Fundamentals of Digital Audio for Mac OS X and iPhone OS - WWDC 2010

Graphics & Media • iOS, OS X • 56:07

Core Audio features a state-of-the-art audio engine for Mac OS X and iPhone OS, enabling the world's most innovative desktop and mobile music and audio applications. Understand the fundamental principles of digital audio processing and codecs that are critical to any application delivering audio functionality and how these relate to the Core Audio API and services.

Speakers: James McCartney, Eric Allamanche

Unlisted on Apple Developer site

Downloads from Apple

HD Video (114.3 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

[James McCartney]

My name is James McCartney. After me, Eric Allamanche will be up to talk. There's going to be two parts to this talk. I'm going to talk about Audio Processing Basics, and then Eric will talk about the Voice Processing Audio Unit and Audio Codecs. So, in many of the past WWDCs we've in Core Audio, we've given a lot of talks about Core Audio and how to use it. And we've always assumed a certain knowledge about audio that everyone is an audio programmer. But a lot of people just want to be able to play audio.

So, I'm-- we're going to step back a bit and I'm going to talk about well, what is digital audio and how does it, you know, how is it represented. So, there'll be three parts to my talk about audio representation formats, converting audio, and processing audio. So, what is digital audio? Sound is a moving air molecules. And a microphone transduces that into an electronic waveform. And you sample that waveform at a periodic interval called sampling rate. And each sample is a number.

And so, you need to have at least two numbers per cycle of a sine wave in order to be able to reconstruct a sine wave. And therefore, the highest reproducible frequency is half of the sampling rate. So, there's different ways -- once you have sampled the waveform as numbers, there's different ways to represent that, you know, in the computer. I'm going to talk about liner PCM, nonlinear PCM, and packetized compressed formats. Linear PCM or LPCM is the most direct way to represent the sampled audio. And you just store the sampled numbers in binary form.

But there's a lot of ways to do that. Are the numbers you're storing integer or floating point? Are the integers, if you're storing integers, are they signed or unsigned. How many bits are in each number? That's called the bit depth. What order are you storing the bytes in? If the most significant byte is coming first, that's called big endian.

If the least significant byte is coming first, that's called little endian. How many channels of audio are there? And are you storing the channels of audio together, which is called interleaved, or are you storing them separately? That's called non-interleaved. And are the bytes packed or is there padding?

So, now, different groups in different areas refer to these things differently. So, in Core Audio, we have a consistent way of referring to samples, frames, and packets. Now, if you work in codecs, you might call a frame what we call a packet. Or if you work in the music industry, you might call things in another way. So, we've decided on a certain way of calling things. So, a sample is one sample of a waveform.

A frame is a collection of samples for each channel, you know, audio stream for the same vertically aligned moment in time. And then a packet is for a particular stream of audio is the smallest cohesive unit of data for that format. It's what you pass around. For a linear PCM, one packet and one frame are synonymous. But for compressed formats, one packet is some group of bytes that you can decompress into some number of frames of linear PCM.

And so, this is going to show the difference between what interleaved audio and non-interleaved audio looks like. In Core Audio, we have a universal container for audio called the AudioBufferList which we use to pass around audio to all our APIs. That contains an array of buffers. So, in the non-- in the interleaved case, you see that for stereo sound, you have the left and right channels are alternating. They're interleaved in a single buffer. And in the non-interleaved case, each channel is in its own buffer.

So, when we see what one frame looks like in each of these formats, one frame of interleaved audio, this is linear, stereo linear PCM. So, one frame of stereo would be the left and right sample in the same buffer together. So, if you're talking about 2-byte like 16-bit audio samples, the left and right sample together will be 4 bytes of audio.

Whereas in the non-interleaved case, you have, in each buffer, one sample, so there would be 2 bytes occupied by a frame in each buffer in a non-interleaved case. So, in linear PCM or in audio in general, there's a-- well, in linear PCM, there's two dimensions of quality, basically - there's sample rate and bit depth.

In sample rate, as I mentioned earlier, determines what the highest reproducible frequency is. It's the bandwidth of the audio you're listening to. So, we just halve the sampling rate. So, this is the list of common sample rates and how they're used. Eight kHz is narrow-band speech. You can only represent frequencies up to 4 kHz, so it's not very good for music. Sixteen kHz is wide-band speech. You're able to represent other incidental sounds in this format, and it just sounds better.

Then 44 kHz, 44.1 kHz, that's CD quality audio. That contains basically the full human audible spectrum. And another common rate is 48 kHz, which is using digital audio tape and a lot of audio hardware. Then you see other higher rates used in pro equipment, and now it's showing up like in home theater equipment with 96 kHz and 192 kHz sampling rates.

So, a human hearing extends up to 20 kHz. So, you don't really need sampling rates above 48 kHz in order to be able to reproduce the entire human audible spectrum, but there's technical reasons why you might want higher sampling rates than 48 kHz, and that has to do with being able to simplify your audio processing. So, in a lot of pro situations or in audio processing situations internally, you'll have rates like 96 kHz or 192 kHz. So, the other dimension is bit depths.

And bit depth determines Signal to Noise Ratio. And so what is Signal to Noise Ratio? It's the amplitude of the signal, which is what you're interested in, the music or the speech, divided by the amplitude of the noise, and the noise in this case is quantization noise. When you converted this, the analog signal to a number, if you converted to an integer, especially, there's only an integer number of steps that you have available to you to represent the amplitude of the waveform.

And the error between the value that you chose and the actual value of the audio is called the quantization error, and that becomes a noise in your audio stream. And so, every 6 decibels, it's measured in decibels, and every 6 decibels is roughly a factor of two in an amplitude. So, and then every bit you add to the audio, gives you 6 more decibels of Signal to Noise Ratio. So, in the integer format, Signal to Noise Ratio is amplitude dependent.

So, you see a quoting of what the Signal to Noise Ratio is for an integer format, that's referring to what you get if you got a full amplitude signal. But the quieter signals have a worse Signal to Noise Ratio. And so, if you got a signal that's at -20 dB, and you've only got 48 dB of-- or you've got an 8-bit audio signal, then you're going to have a 20 dB worse Signal to Noise Ratio. But in floating point, Signal to Noise Ratio is independent of the amplitude.

So, this is some common bit depths and how they're used. 8 bit integer is sort of the format that was used when it was very expensive to store and process audio. It's used in the old games and a lot of gear made in the 1980s. It's got 48 decibels of Signal to Noise Ratio for a full scale signal. So, that's not very good especially if you've got things that are not at full scale, which is quite often the case.

Then 16 bit integer is CD quality sound. It's 96 decibels of Signal to Noise Ratio, that's quite good except if you're applying a quite sound and you're turning the volume way up then you can start to hear the limitations of 16 bit integers. So, 24 bit integers gives you 144 decibels of Signal Noise Ratio. That's quite good. A lot of-- actually I think almost all hard work can't actually even reproduce that much quality. So, then for internal processing, you have-- or in OS X, we use 32 bit values.

The AudioUnitSampleType on iPhone OS or iOS is a 32 bit integer, which is 8.24, which is 8 bits of integers in sine bit and 24 bits of fraction. That gives you 144 decibels of Signal to Noise Ratio, but then you've also got 42 decibels of headroom, so you can go over unit you gain and sort of not worry about some kinds of processing problems when you're dealing with internal processing. Then on the desktop, we'll use 32 bit floating point which has 144 decibels of Signal to Noise Ratio for any amplitude even very quiet signals. And it's got an essentially unlimited dynamic range.

So, one point about quality, once you've lost quality, you can't add it back. So, sometimes, you hear about people converting their audio to a higher sample rate or to a higher bit depth. That's not going to gain you any quality. Similarly, re-encoding compressed audio to a higher bit rate or re-encoding it with a better codec is not going to give you quality back for your signal. It's only if you have the original uncompressed source at a higher quality will you be able to then re-encode to a better format than you previously encoded it to.

So, then, the next step past linear PCM was kind of historical. But it's called nonlinear PCM. And that's instead of storing the number, you store the logarithm of the number, and that increases the Signal to Noise Ratio of quiet signals at the expense of the loud signals. And there are two common algorithms, and they differ in how they quantize or proximate the logarithm function. It's mu-law and A-law. And those both encode audio in 8 bits per sample. Then you get to packetized compressed formats. And that's when a group of frames is compressed into a packet of bytes.

One thing to note is that packets often have dependencies on preceding packets. So, when you're decoding audio, you're putting the codec into a particular state, and that-- the next packet will assume that the codec is in that state. So, if you take chunks of compressed audio from different streams and you splice them together just by appending the packets to each other, the codec is not going to be in the state that the packet, the next packet after the splice is going to think that it's in, and you're going to get a glitch.

So, the way you really have to do that is to edit non-compressed data together and then compress that, or you do something more sophisticated about how you splice your packetized audio together by overlapping and decompressing or re-compressing. So, but there's going to be more about compressed formats in the next section. Now, formats are represented in Core Audio by a structure called the AudioStreamBasicDescription. It's used in nearly every API in Core Audio. It's been covered extensively in previous WWDC talks.

There's information about how to fill one out, how you get one from the various APIs like AudioFormat, AudioFile, and AudioConverter. And one thing to note is that if you use the AVAudioFoundation classes which play a file to the audio hardware, then you can avoid the audio stream basic descriptions altogether. So then, converting audio. When you've got audio in all these formats, how do you get from one format to another? So, we have an API called the AudioConverter which does this.

And there are three main conversions you can do which is linear PCM to linear PCM, which handles all kinds of transformations like sample rate conversion, bit depth changes, converting integer to floating point, interleaved to non-interleaved or any combination of these, or removing numbers of channels. And then you have encoding and decoding which is taking, encoding is taking linear PCM and converting it to a compressed format, and decoding is taking the compressed format and turning it back to the linear PCM.

Some of our APIs have AudioConverters built within them. So, if you use these APIs, then you can avoid having to deal with the complexity of the AudioConverter and sort of take advantage of some of the work that's been done for you. So, if you're on one of these scenarios like if you are playing or recording buffers of audio from memory and you want to play it out to the hardware or record from the hardware, then you can use the AudioQueue API and that will take care of converting between your format and the hardware format.

If you want to read and write audio files into memory or to and from memory, then you can use the ExtendedAudio API, ExtendedAudioFile API, and that will handle the conversions between your-- the format you want and the format that the file is in. Then if you want to just play a file out to the audio hardware or record a file from the audio hardware, then AVAudioPlayer will handle the conversion between the file format and the hardware format for you so you don't have to deal with format change at all.

And in this case, you don't have to even name the formats using audio stream basic description. OK. So, when you're doing a sample rate conversion using the AudioConverter, you have a number of ways to set the quality. It's a-- sample rate conversion is a relatively expensive operation depending on what quality you want.

There's a property called the AudioConverterSample RateConverterComplexity property which is, allows you to choose several different algorithms which choose different levels of quality. So, there's linear, normal, and on the desktop, you have mastering quality. Linear is just a linear interpolation between samples. It's fast, but it's not very good quality. And then normal is a-- does a better job doing more sophisticated algorithm.

So, within normal and mastering complexities, you have several bands of quality which are from minimum to maximum that you can set. Linear is just linear, there's no quality setting for linear. A higher quality costs more CPU. The other thing is on the desktop where you have normal and mastering, the lowest quality of mastering is better than the highest quality of normal. So, they're completely disjoint bands of quality, but mastering is quite a lot more expensive especially if you're doing-- if you choose maximum of mastering quality.

So, processing audio, in Core Audio, we have AudioUnits which are used to process audio, there are components. And the main attributes of them is they have inputs and outputs, so you have ways to get audio in and out, and then you have parameters that you can adjust in real time. And there's a lot of different kinds of AudioUnits. There's I/O which talk to the hardware so you can read audio from the hardware or play it out.

And then there are effects which is the most numerous category which gives you filters, compressors, delays, reverbs, time/pitch changes. And then there are panners and mixers. For the I/O AudioUnits on iOS or the iPhone OS, you have the remote I/O unit which is your most direct access to the audio hardware.

And then on the desktop, you have AUHAL which fulfills basically the same role. It's an AudioUnit that talks to the hardware. On the desktop, the AUHAL is built on top of the HAL which is a sensor hardware abstraction layer, and that's the low level access to the hardware. If you use the AUHAL AudioUnit, you're going to benefit from having a lot of the details of dealing with the low level handled for your including audio conversion, format conversion. But there's no cost of latency for using that AudioUnit.

Just an example, on the desktop, you have various filter AudioUnits which these images are from the UI of the AudioUnit that show a graph of frequency versus the gain for the various filters. There's a Parametric EQ, Graphic EQ, Lowpass, Highpass, Bandpass, Low and High Shelf Filters. And then you also have Compressors. And there's Delay unit, Reverb unit. There's also Panner units. We have Mixers.

On the iPhone, you have a Mutichannel Mixer which actually does mono and stereo, and then there's the Embedded 3D Mixer. And then on the desktop, you have Multichannel Mixers, 3D Mixers, Stereo Mixer and Matrix Mixer. So, I'm going to talk about the Embedded 3D Mixer. It gives you two basic algorithms which is equal power panning for stereo, and then you have now spherical head algorithm which gives you interaural time to delay cues, intensity difference, and filtering due to head, and distance filtering. So, the 3D Mixer uses Azimuth, Elevation, and Distance as its parameter digital listener-centric parameterization of the source of the audio. So, Azimuth is the angle from directly forward for the listener.

So, positive is around to the listener's right, and negative is around to the listener's left. And 180 is in the rear. So as this just illustrates these parameters' distances, some distance from the listener. And then on some of the desktop panners and 3D Mixers, you have also Elevation.

So, then there's also a property on the 3D Mixer which is Distance attenuation. There's a reference distance which below which there is no change in the amplitude of the audio, and there's a maximum distance above which there is no change in the amplitude. But between that, there's a distance curve, and there's several different distance curves you can choose.

Another way to access 3D spatialization is through OpenAL. And OpenAL is a OpenGL-like library for 3D audio. It's cross-platform, and allows 3D spatialized source positioning. It's built on top of the 3D Mixer, so you're using the 3D Mixer underneath, but it allows you to use world coordinates which are in x, y, z, so the listener can be anywhere and the source can be anywhere in space rather than using listener-centric coordinates like the 3D Mixer uses. OK. So, now, it's time for Eric Allamanche to talk about the Voice Processing Unit and Audio Codec.

[ Applause ]

[Eric Allamanche]

Thank you, James. And welcome again. My name is Eric Allamanche, and I'm going to walk you through the Voice Processing Audio Unit and the Audio Codecs we provide on the iPhone. So, let's start with the Voice Processing Audio Unit. The Voice Processing Audio Unit was added to iPhone OS 3.0 last year.

And it is basically a dedicated RemoteIO unit with a built-in Acoustic Echo Canceler. So, from a programmer's perspective, this RemoteIO-- this Echo Canceler can be accessed exactly the same way as you would access the RemoteIO unit. So, basically, setting up, creating the instance, setting parameters, and so on.

But this year, in iOS 4, we added a new algorithm which provides significantly better quality, of course, at the cost of a heavier CPU load. And this Echo Canceler was specifically designed to allow extremely high quality audio chat like in the FaceTime application. So, now, we provide two, we offer two algorithms in this RemoteIO unit, and this allows you to make a tradeoff between the quality and the CPU you want to spend for this kind of application.

Let me just recall the functionality of an Acoustic Echo Canceler. So, on the left-hand side, you have what we call the far end speaker, which has the device with a microphone and a loud speaker. And on the right-hand side, there is the near end speaker with the same appliance.

So, the far end speaker starts to talk, and this is visualized with the blue arrows, so the speech signal goes to the microphone on the far end speaker's end. And then it's encoded and propagated over the internet and comes to our device to the near end speaker device. And so, this signal is then rendered through the loud speaker. But because most devices, the microphone is really in the vicinity of the loud speaker, there is a certain amount of acoustic energy which goes back to the near end speaker's microphone.

And if we don't take any measures at this level, well, this signal is then propagated back to the far end speaker. And because of all the delays in the chain being from the encoding, network delays and et cetera, the far end speaker signal comes back after a certain amount of time typically around 100 or 200 milliseconds; and this is perceived as an echo.

So, what happens now if the near end speaker talks on top of the far end speaker while the signals are mixed together acoustically in the air, and so this mixed signal is then captured by the microphone and send back to far end speaker. So, this is now where the Echo Canceler comes in to play because we want to eliminate the amount of blue signal here from the lower path. And this is done with several algorithms which have been developed over the last decades.

And basically, the purpose of the echo-- of the Echo Canceler is to analyze both signals, so the signal which is about to be sent to the loud speaker and the signal which is captured by the microphone. And from these two signals, the Echo Canceler tries to make an estimate of the amount of echo included in the lower path. And this amount-- this echo is then removed from the-- from the path.

This is visualized at the-- usually-- so the subtraction sign on the lower part of this diagram. And what basically-- and in either case, what should go back to the far end speaker is actually a red signal which is the speech of the near end speaker only. So, now, what happens if other app sounds are playing or you get a notification, e-mail or whatever, then this sound gets mixed in with the far end speaker's signal, and then rendered through the speaker.

But what we want to do here is we want to eliminate this sound, this loud sound, or it could be some background audio playing in the case of a game, for example, and we don't want this signal to go, to be echoed back to the far end speaker. And in order to do this, we have to put the summation point of the signals on the left, on the farthest most side of this-- of the system. And this is why the-- this mixing happens before the Echo Canceler is invoked.

One important thing to note here is that your application has, you will see all the signals that you have to deal with, but you won't see any other signals coming in like an e-mail notification or sounds played by other applications. So, because you won't see these signals here, in this case, the mixing will happen further on the right side, and those won't be eliminated. So that's why this RemoteIO with built-in Canceler is an ideal solution to eliminate all ancillary sounds coming from other applications.

The way you open and interact with this processing unit is exactly the same as with the RemoteIO units, and this will be explained in much more detail in the next session by Murray. I just wanted to point out here how-- some basic setups. So, you create the audio component description. And the difference here is that you just provide the componentSubType to be the VoiceProcessingIO instead of the RemoteIO, and that's basically it. So, once you find the component and create a new instance, you have your RemoteIO with a built-in Canceler ready to be used.

Now, with this voice processing unit, we provide a few parameters. The first one is the Bypass parameter which allows you to bypass the whole process, which means that basically nothing happens, everything is mixed together and nothing is removed. This can be useful in certain circumstances. The voice processing unit also has built-in automatic gain control unit to boost the resulting signal coming out from the Echo Canceler, and this can be controlled by this property and it is on by default. And another property is for ducking the NonVoiceAudio, as I mentioned in the diagram. So, all the other app sounds is what we call the NonVoiceAudio. And there's a property to duck this audio to a certain extent.

Now, in iOS 4, we've added two more properties. And as I mentioned, on the first slide, we added a new algorithm which provides much better quality, and this quality, this algorithm is now controlled by this VoiceProcessingQuality property. So, with this property, you can select either the old echo suppressor we had from which way it is, either in iPhone OS 3.0 or the new better one available in iOS 4. And the last property we added is a MuteOutput, which basically zeroes out the signal coming out of the Echo Canceler. So, this is for the muting of a conversation.

So that was about the Voice Processing Unit. Now, let's dive into the Audio Codecs. The term CODEC is a contraction from encoder and decoder, and this is not specific to audio only, but applies to any kind of codecs like video codecs and so forth. And the main purpose of this-- of a codec is to compress and decompress PCM audio signals. So, because we're talking about audio codecs, we only deal with PCM audio signals.

And in general, we differentiate two different, two big categories of codecs. One being the lossy codecs which are associated with loss of information, and on the other hand, lossless codecs. And of course, codecs, audio codecs nowadays are core technology in digital audio I mean, it is the basically the backbone of the iPod and any media player application, and also for the iTunes Store. Now, let's talk about lossy versus lossless audio codecs.

In the case of lossless codecs, there is no loss of information. So, after one encoding and decoding cycle, the resulting signal should be bit identical to the input signal, and this is regardless of the bit depth, be it 16, 24 or 32 bits integer or float. So, no loss of information.

But because there is, this can be compared to the Unix zip command, for example. But Unix zip is a general tool, and it is not, it doesn't provide good compression ratios for audio signals in general, so that's why it's better to have dedicated audio lossless codecs. And typical compression factors for state of the art lossless codecs nowadays are in the range from 1.5 to 2 and 2 being a very good compression ratio already.

On the other hand, the lossy codecs, which are the most widely used one like MP3 and AAC. And this typically rely on a perceptual model of the human auditory system. What this means is that as James mentioned before, we can only hear signals up to 20 kHz, but this is already a very optimistic case because growing older, this frequency shifts towards more 15 kHz and even lower.

And by taking advantage of many properties of what's going on in our auditory system and especially the masking effect which basically is you have a tone playing back at a certain magnitude and you have another tone coming in somehow in the same frequency range but with a much lower magnitude. What will happen is that this second tone won't be perceived at all. So, because it is not perceived, there is no reason to encode it. And this is what the lossy codecs try to achieve by evaluating first what information can be discarded and what should remain.

So, this is the irrelevant part of the information that we're going to try to remove, and the other part is the redundant information, but this is more of a mathematical nature. This is basically the predictable part of the signal. The lossy codec is basically controlled by the bit rate. So, it is obvious the higher the bit rate, the higher the quality. And conversely, the lower the bit rate, the worse the quality.

So, it is extremely important to make good decisions regarding the bit rate because we want to make sure that we don't degrade the signal or that degradations won't be perceived. And in contrast to lossless codecs, lossy codecs have typical compression factors ranging between 6 and 24, which is a very big range, and this is achieved with very sophisticated algorithms.

Now, I'm just going to talk about the audio decoders, which are available on the iPhone. So, we first have the Adaptive Data Pulse Code Modulation Codecs like IMA, IMA4, DVI, and MS-ADPCM which are very simple codecs. They don't provide very good audio quality, but they were, because they were simple, they were widely used historically, and they are still for example on Voice-over IP providers when they send you an e-mail, if you can't pick up the phone, then voice messages are typically encoded in one of these formats. Then we have the QDesign version 1 and 2, which is actually the old audio codec which was used by QuickTime before we moved over to AAC, so it's just there for historical reasons. Then of course, there's GSM, the GSM Full Rate Codec used on the mobile networks.

Then we have the Internet Low Bit rate Codec which is a free and open codec providing decent quality at decent-- with pretty low bit rates. Then of course, MP3 which is MPEG-1/2 Layer 3. Apple Lossless which is the only lossless codec we provide on the iPhone. And then the MPEG-4 AAC family of codecs which I will discuss in more details in the next slides. Now, regarding encoders, we don't provide an encoder for every decoder.

There is no-- there is not always a need for having an encoder for every formats. Therefore, the choices of encoders is much more restricted here. So, for the ADPCM, we have IMA4. We have the iLBC codec which will allow you to use this as a Voice-over-- for Voice over IP applications, for example. The Apple Lossless, which in certain cases in used for the voice recording.

And then for the MPEG-4 AAC, we provided three different codecs the Low Complexity Codec, the Low Delay Codec, and the Enhanced Low Delay Codec. Now, regarding MP3, we don't provide any MP3 encoder, and this is also true on the desktop. Core Audio doesn't provide an MP3 encoder. So, if you want to encode to MP3 on the desktop, you need to use iTunes. The audio converter doesn't-- can't encode to MP3.

If we put this together all the codecs with their characteristics, we get the following table. So, let's first start with iLBC. iLBC is 8 kHz. It is optimized for speech. It is a speech codec and has two, offers 2 bit rates in ballpark of 15 kilobits per second. Then of course, we have MP3 which has a sampling rate range of 16 to 48 kHz, and can provide mono-stereophonic signals. An MP3 is what we call a general audio codec.

What this means it doesn't-- MP3 hasn't been designed to encode a certain class of signals, but it is, it can be used for any kind of signals including speech of course. ALAC has the particularity that the bit rate can't be set. And the reason is obvious because it is a lossless codec. So, the content itself will actually determine what the final bit rate will be. And it is also a general audio codec.

Then the AAC Low Complexity encoder, codec sorry, provides a very broad sample rate range first going from 8 to 48 kHz, and it is also a general audio codec as same as MP3. And in iOS 4, we've added two more channel configurations which are the 5.1 and the 7.1. What this means if you have an AAC file encoded as with this 5.1 or 7.1, you will be able to decode it, however, it will only be rendered in stereo.

We do a downmixing at the end of the decoding process. Now, for the High Efficiency AAC Codec, we have also provide also a mono and stereo channels. In this codec as I will go into more details in the next slide, this has been optimized for streaming audio. And the AAC Enhanced Low Delay or Enhanced Low Delay has been specifically optimized for AV chat kind of applications.

Now, let's go talk about the AAC Codecs into more details. Now, why do we even have to bother about AAC, I mean isn't MP3 now is ubiquitous and why isn't MP3 good enough? Advanced Audio Codec, AAC, and I just wanted to point out that none of the AAC stands for Apple, it's all about advanced audio, and it's a standout-- it's an MPEG standout. MP3 is almost 20 years old now. It was standardized in 1991. And at that time, the requirements for audio codecs were completely different than they are nowadays. So, and even at that time, it was a challenge to decode an MP3 stream in real time.

This could only be done with expensive DSP boards. Therefore, MP3 had serious limitation in its design. And most specifically, it was limited in the bit rate it would support, the sampling rates and the channel configurations. So, right off the bat, MP3 can only handle stereo signals. And there were some other mathematical underpinnings in the codec design itself which wouldn't allow it to be transparent for certain signal classes.

So, all these together, led the MPEG consortium to start a new-- to a new work group which was focused in designing a much better codec, non-backwards compatible codec, and this is where AAC came to life. And the first version was standardized in the course of MPEG-2 in 1997.

But since that time, AAC was adopted as the basic codec in MPEG-4, and since that time, it has seen many, many additions and new variations coming in which makes it a very versatile codec. So, about the AAC Codecs, so first, there is the Low-Complexity Codec. The Low-Complexity Codec is actually the codec you use for any kind of media playback application. So, this is what the iPod uses.

This is what the iPod application uses on iPhone or iPod touches. Then we have the High Efficiency and High Efficiency v2 Codecs which have the advantage that they provide similar quality to some extent, but at significantly lower bit rate. And this makes it really interesting for internet radio stations. And if you look up some-- even the iTunes radio station's library, you will notice that every low bit rate, something around 64 kilobits and below is-- most of the time, only encoded using High-Efficiency or High-Efficiency v2.

And on the other side, we have the Low Delay Codec which was first introduced in Mac OS X Leopard. And it is the default codec for the iChat AV application. And today, in iOS 4, we've added a new codec which is called the Enhanced Low Delay Codec, and it's-- this is the codec which is used for the-- for the FaceTime application. Now, Low Complexity and High Efficiency. Low Complexity provides the highest audio quality multi-channel support. And High Efficiency uses some tricks in order to significantly reduce the bit rate.

And what High Efficiency is basically doing is that during the decoding process, it synthesizes the other frequency bands rather than encoding them. And so, this results in some significant bit rate savings. And as an extension to the High Efficiency, the High Efficiency v2 Codec expands from mono to stereo signals using some parametric stereo techniques. And this v2 version can even provide lower bit rates down to 20 kilobits per second.

So, to summarize this, the highest quality will always be achieved using the Low-Complexity Codec and the lowest bit rate with degraded quality will be-- can be provided using the High Efficiency v2. Now, just to give you an idea-- I hit the wrong button, sorry. I have a few sound examples here. So, the first sound example is a low complexity as encoded using the Low-Complexity Codec at 120 kilobits per second which is a very popular bit rate.

[ Music ]

The next example is going-- is encoded using the High-Efficiency Codec at 64 kilobits per second, so half the bit rate.

[ Music ]

And the next one is going to be High-Efficiency v2 encoded at 32 kilobits per second, so even half of what we just heard.

[ Music ]

Now, in order to put this in contrast to what Low Complexity to-- Low Complexity, the next example is actually the same item encoded with Low Complexity but at 32 kilobits per second. So, exactly the same rate-- bit rate as High Efficiency v2.

[ Music ]

So, the-- even in this-- with this acoustic, the difference is obvious. But just to be clear, the message here is not-- I don't want you to rush back to your offices and re-encode all your assets using a High Efficiency v2 at the lowest possible bit rate, that's not the message here.

So, I just want to say how efficient this technique is. And as I mentioned before, the high efficiency just-- there's some clever synthesization of the upper frequency bands, but they do not really reflect the-- what the original content was. So, you should always be aware of this. So, the way this is working with the High Efficiency is because when High Efficiency was introduced, many systems were already using AAC decoders, and they wanted to preserve backwards compatibility.

So, what the MPEG consortium did was to use a layer approach. So, we start with a-- with the Low Complexity base layer which is either mono or stereo and typically at 22 kHz sampling rate. And on top of this Low Complexity layer, we add the High Efficiency which is also mono or stereo. But the High Efficiency layer operates at double the frequency of the Low Complexity one. What this means is that-- I mean as I said before, we synthesize the other frequency bands. And therefore, only the lower frequency portion will be encoded using the Low Complexity.

And similarly for the-- for the High-Efficiency v2, we started with a low complexity layer, but this time, this layer is a mono-only layer. And then we add High Efficiency layer twice the sampling rate, but will also be mono. And then on top of this, comes the High-Efficiency v2 layer which will then expand the mono to stereo signal. And this-- the discovery mechanism and how you deal with this format is described in the tech note 22.3.6, so I recommend having a closer look at this document.

Now, given that Low-Complexity provides such high qualities and we have very good efficiency using the High Efficiency Codecs, what's the problem now with the Low Delay? Why do we need Low Delay? Well, if we look at the waveform as it would come out directly from a Low Complexity decoder, we would see something like this. The input signal is actually the same signal, but really left-justified.

So, we see here that before the signal onset, there is a huge lag-- there is a huge region which contains only zeros. And this lag is obviously too much for a full duplex type of-- for chat-like applications. This problem was somehow addressed with the Low Delay AAC Codec, and you see that the output of AAC Low Delay Codec has substantially less-- the lag is substantially less than for the Low Complexity one.

And in the case of the Enhanced Low Delays, the lag is even-- is only half of the Low Delay one. So, to summarize, sorry-- the-- and if we put this in terms of numbers, then we see that the low complexity has a lag of 2112 samples whereas the Enhanced Low Delay has only 240 samples lag. So, this shows that this codec is well suited for AV chat applications.

The Low Delay Codecs share the same foundations as AAC, so it's actually an extension of the ACC standards, but they provide much smaller delays typically 15 to 40 milliseconds, and these have been specifically design for full-duplex type of communication applications. Low Delay, which is-- can be created using the format ID, the format ID constant kAudioFormatMPEG4AAC_LD.

It has a minimum delay of 20 milliseconds. The Enhanced Low Delay provides 15 milliseconds as the minimum delay. And because that they are part of AAC codecs, they have a large bit rate range and they allow for even transparent quality. One thing I just wanted to point out is because the windows-- the block sizes affect for all the codecs, the delay is actually proportional to the sampling rate. So, the higher the sampling rate, the lower the delay will be, and conversely, the delay will increase if you go down with the sampling rates. I just want to briefly give you an overview of where those codec lives.

Allan, in the previous session, did explain when a software codec comes into play whenever-- how do a codec comes into play. This table just summarizes in which world which codec live. And we see that the High Efficiency and High Efficiency v2 Codecs aren't yet available as software codecs.

And this is important for your application because if you want to use High Efficiency in codec material, you have to be aware of this that you may not be able to decode it at its-- at its full quality if the hardware codec is already in use either by another application or by another AV player instance.

And-- but the Low Delay Codecs are software only, so they can have multiple instances on them. Now, I just wanted to go over some key parameters for the encoding process. First, we have the sampling rate, of course, the number of channels, and the bit rate, the bit rate modes, and all this leads to the subjective quality, which we want to maintain. The bit rate determines the compression ratio. The higher the bit rate, the bigger the resulting file, but the better the quality. And the bit rate is typically accounts for all channels.

So we don't specify on a per channel basis because sometimes it doesn't make sense like for 5.1 material or 7.1 because the Low Frequency effect channel doesn't require many resources. And the bit rate typically also grows with the number of-- with the number of channels and also the sampling rate.

And one thing to be aware of is that the AAC encoder-- the software AAC encoder has a sample rate converter-- an internal sample rate converter, and it made you a sample rate conversion if you specify your bit rate which is too low for the sampling rate of the input signal. So, like in the example I just showed before, the 32-kilobit Low Complexity one, the sampling rate was actually down to 16 kHz.

Bit rate modes is another knob you can turn to do some tradeoffs. The most simple mode is the Constant Bit Rate Mode which allocates a fixed amount of bytes for every packet. But therefore, it is not flexible at all. It doesn't accommodate to the content. So, encoding-- one second of silence takes us many resources as encoding complete symphony orchestra.

We recommend the use and the default is actually the Average Bit Rate Mode, which has much more flexibility in the sense that it dynamically allocates the resources to the-- for every packets according to its content, but with the constraint of trying to maintain the average bit rate as provided by the user. And the most flexible mode which is also known from MP3 is the so-called Variable Bit Rate Mode where there is basically no limitations to the bit rate. And VBR is expressed in terms of quality in terms of-- instead of bit rate.

Just to wrap up about the encoder, I just want to give a few recommendations and hopefully you will follow them. The thing is that you should choose the codec according to the use case and the limitations. If you want high-quality audio like media playback, then there is no doubt you should always use Low Complexity. For streaming kind of applications like streaming radios, High Efficiency is the best choice obviously because of the lower band, the much significantly lower bit rates. But if you want high-quality voice chats, then you should reuse the Enhanced Low Delay Codec.

And whenever possible, you should favor the highest possible-- about highest possible quality by choosing the right codec, the best encoding mode, and the highest possible bit rate. And also as James stated in the previous section is that lost information can be recorded. What this means if you convert an MP3 to an AAC, even at the higher bit rate, the quality will be degraded, even though the high-- the bit rate has been higher. So, this is something you should really avoid. If you don't have the source material, try-- avoid transcoding from one format to the next.

Now, the way this AAC streams are packaged, basically, you will-- well you know these .mp4 endings which is the MPEG-4 native file format. There's also the .m4a which is MPEG-4 compatible which adds iTunes-specific data chunks, and this can also be used to encode-- to embed ALAC material. And there's the preferred format which is the Core Audio file format with the ending .caf.

For streaming, you have the ending .adts or .aac which is a self-framing format, and this is what is used in SHOUTcast, internet broadcast, and HTTP live streaming. So, this concludes my talk. I just wanted to point out to the next session about Audio Development for iPhone OS Part 2 by Murray who will go into much more details about the RemoteIO, the AudioUnits, excuse me.