Audio Formats - WWDC 2003

Application Frameworks • 46:57

Mac OS X includes a rich set of technologies and resources for reading and playing back audio files in a variety of formats. This session specifically addresses the APIs for Audio Converter, Audio Codec, and Audio File. This session is of interest to all audio developers.

Speakers: James McCartney, Doug Wyatt

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hello. So I thought as an introduction to James' session here, he's going to talk about the details of several of our APIs including the audio file APIs, the audio converter APIs. I just show a quick demo program I wrote which shows you how this looks from the outside. This doesn't use QuickTime. It's using all of our native audio APIs to read and convert audio files. So here I've got a few files I brought with me. This is a simple AIFF stereo 16-bit file. And I can listen to it.

So that's just using the Audio File API to recognize the file, read the samples out of it, and pump it through an audio output unit. There's very little actual manipulating of audio data in this program. It's all using our APIs. So another thing I can do here, I can do a simple sample rate conversion.

These are the same APIs that QuickTime is based on top of in their upcoming version. This program also, by the way, is going to be available as sample code. And we can see some of the things the Audio File API provides in sample here, or in Panther here. We can see all the file formats that are supported.

I'm going to keep this in AIFF. I'll just say same as source. Actually if I were to choose AIFF here, I could make it a 24-bit AIFF, which doesn't make sense because it's 16 bits. I'll just say same as source. And I'll down sample it to 22K. And hit convert.

This is going through our sample rate converter, which is built into our Audio Converter API. This is very heavily AlteVec optimized code, as are all of our integer to float conversions now in Panther. These are good reasons for using our APIs for your int to float conversions. They're heavily optimized for both G4 and G3 and soon on G5 as well. This is my converted file. It's going to sound about the same.

Okay, so another thing you can do with the Audio Converter is to use it to encode and decode formats such as AAC. Here I've got an example of a six-channel AIFF file which one of my coworkers authored in Logic a year ago. Here it is in its AIFF format, and I'll leave it playing while I convert it to an AAC. You might want to turn the volume down just a little so I can keep talking, please.

Okay, so I can say I want to convert it to AAC and leave it to the same sample rate. There's some more parameters you can put on the converter here, such as the bit rate, the quality. You can do channel remapping in multi-channel AAC, for example. Depending on the channel layout, it can save bits by putting a filter on the sub or LFE channel. So this example program doesn't show that right now, but that is an option that there is on the converter. So while it keeps playing, you can turn the volume back up now.

So this is a pretty substantial file. It's 40 megabytes as an AIFF. So we're encoding that six-channel AAC file in a little less than real time here. It's And so here we have, if we go to the finder, we'll see now it's a 3.1 megabyte file. And it's also in surround.

This is kind of cool. I'm scrubbing around inside the AAC file. It's decoding multiple packets. Really nice and efficient. By the way, our AAC decoder in our test is about three times as efficient as MP3 decoding. So if you're considering putting sounds in games, you might take a close look at using AAC encoding instead of MP3. Okay, so that's the sample program, and I'll turn it back over to James to explain the APIs that are underneath this program. Thank you. Okay, back to slides, please.

Slides. Okay, there we go. Okay, so what I'm going to talk about is how to handle audio formats with Core Audio. First I'll sort of review the basics of how formats are represented in Core Audio. Then I'll talk about the Audio Converter API and the Audio Format API, which is new in Panther. And I'll talk about some new features for audio units for supporting multi-channel and surround. And I'll talk about the new Audio File Get Global Info API and the new Matrix Mixer Audio Unit.

Okay, so one thing that's been a source of some confusion in Core Audio is what is the definition of frames and packets and bytes and samples. And when you're dealing with compressed formats, it gets pretty important to take them all into account. So this graphic shows what a five-channel interleaved 24-bit stream looks like. You can see that one sample is three bytes and five channels are interleaved into one frame.

And for linear PCM, one frame equals one packet in the way we count things in the Core Audio APIs. And this information is the way you describe a format is by specifying an audio stream basic description, which is a structure that's used throughout the Core Audio API. It has the sample rate for the stream and the format ID, which tells you whether it's a PCM stream or it's some kind of compressed format.

That's the format ID as a four-character code for the format. Then there's flags that are specific to that format and some fields that tell you the relationship between the bytes, the packets, and the frames in that format and the number of channels in that format and how many bits per channel.

and David Koehn, and I'm going to be talking about the different ways that each sample is in that format. Okay, so here's an example of how to fill one of these out for the five-channel 24-bit interleave stream that was in the first graphic there. The format ID is linear PCM. The flags are set to Big Indian signed integer packed. It's one frame per packet. All linear PCM is one frame per packet. And then for this interleave stream, there's 15 bytes per packet and 15 bytes per frame, five channels per frame, 24 bits per channel.

Okay, so this shows an example of a non-interleaved stream. There's two buffers. Each of them holds floating point samples. They're four bytes each. You see on the left there's a and I are going to be talking about the different types of audio buffers that we use. We have these names, mbuffers, subzero.mdata, that's fields in the audio buffer list structure, which I'll show in a minute.

Samples from one channel of the data. So this non-intermediate stream has two channels, and so there's two buffers. Okay, and then here's the Audio Stream Basic description for that. The one difference you'll see here, well, for one thing, there's the flag for non-interleaved is set, the format flag. And another difference you'll see here is that bytes per packet and bytes per frame is four, describes one of those buffers. So even though there's two channels, it's not two times four bytes. It's just four bytes per packet because they're split into two buffers.

Okay, so now when you get into compressed formats, the simplest kind of compressed format is a constant bitrate format. This is, there's a constant number of bytes per packet. The number of frames per packet depends on the compressed format that you're dealing with. So the other kind will be a variable bit rate data, and that's the number of bytes per packet can vary. And in the AudioStream Basic Description, the m bytes per packet is just set to zero because it's not a constant value.

Now when you're dealing with some formats like AAC that don't have information in the bitstream about where the packet boundaries are, you need something, an external piece of data to tell you where those packet boundaries are. And we use the Audio Stream Packet Description in our APIs to tell you what the starting byte offset of a packet is and the length in bytes of that packet. And when you're passing around AAC data, you have to pass around an array of these packet descriptions to tell you where the packet boundaries are.

Okay, now, audio data is stored in Audio Buffer Lists in Core Audio throughout our APIs. This is in the Audio Units, the Audio Converter, and the How. The Audio Buffer List tells you the number of buffers, and then there's an array of buffer structures, and each buffer tells you how many interleaved channels there are in that buffer, and the size of the buffer, and then there's a pointer to the buffer.

Okay, all these structures that I'm talking about and some I'll talk about later are defined in CoreAudioTypes.h. They're used everywhere in Core Audio, so they're and I are the co-founders of the Core Audio and there's public utility classes that are shipped with the SDK that provide common operations on these structures and make it easier to fill them out and do the things that you commonly need to do with them.

Okay, so in Audio Units, I'm going to talk about how frames and packets are used by the various core audio APIs. Audio Units express their buffer sizes in frames. By default, the format is 32-bit float, non-interleaved. People have sort of gotten the impression that that's the only format you could do with Audio Units, but since there is a stream description that you can set on the inputs and outputs of the Audio Units, you can actually put any kind of audio data into those Audio Units.

Okay, the HAL counts time in frames when you get your I/O proc callback, the sample time is provided there in frames. In PCM mode, the HAL tells you its buffer size in frame. The buffer sizes are set in frames and unless you tell it otherwise, the format will be 32-bit float. In non-linear PCM mode, you're restricted to dealing one packet at a time with the data. The buffer frame size range is restricted to the number of frames per packet.

Okay, so for the Audio File API, there's two calls, Audio File Read Bytes and Write Bytes, Deals and Bytes, and then Audio File Read Packets and Write Packets, Deals and Packets. Since frames equals packets for PCM, then you can use Audio File Read Packets for PCM, and that'll return you PCM frames. If you're dealing with compressed data, then Read Packets will return you some number of packets of that compressed data.

Okay, so now you can describe different formats. How do you convert from one format to another? And for that we have the Audio Converter. It can do floating point integer, various bit depth, sample rate conversion, interleaving, deinterleaving, channel reordering, and new in Panther is that it can convert between PCM and compressed formats for codecs that are installed in the system.

In order to create an audio converter, you use Audio Converter New. You give it an input and output format, and it returns you an Audio Converter Ref, which is your audio converter object. So in order to call it here, you would fill out two audio stream descriptions. You can use CA Stream Basic Description to help you do that, the SDK class.

Then you have a decoder. Well, in this example, I'm just showing a decoder, creating a decoder. I do Audio Converter New, and I'll get my decoder instance out from that. In order to, say, after you've created your audio converter, you want to convert audio with it. So you call Audio Converter Fill Complex Buffer.

That is a call that takes an audio converter instance in. It takes a pointer to an input procedure, which is the data source for getting input into the audio converter. A user data field for storing your instance data for the audio converter. And then there's IO Output Data Packet Size is the number of packets you want to get out of the audio converter.

And on return, you'll see the number of packets you want to get out of the audio converter. And on return, it will be the number of packets you actually got out. That number could be less than what you requested if you're at the end of the stream or there was an error.

Then you pass in the audio buffer list which you want to be the audio converter to write your converted data into. And then there's a pointer to an array of packet descriptions. If you're converting AAC and you're asking for some number of packets of AAC, you need to pass in an array of packet descriptions so you'll know where the packet boundaries of the data you get back is.

Okay, so... In order to call Audio Converter Fill Complex Buffer, you need to prepare an output buffer list. If it's interleaved data, then there will be one buffer in the buffer list array. And if it's non-interleaved, you'll need multiple mono buffers. The mdata pointer contains the pointer to the buffers that will have the audio data written into it. And the data byte size tells the size of the buffer. And if you ask for more packets that will fit into that size, it will get truncated down to whatever the space that you provided to it.

So in order to call Audio Converter Fill Complex Buffer, basically this shows just passing the arguments. There's the decoder, the input procedure pointer, the user data, the packet list I'm requesting, which is 8192 in this case, and a buffer list, which I have filled out. And then I'm passing here null for the packet descriptions because I'm probably dealing with, well, I'm decoding, so I'm getting PCM out, so I don't need packet descriptions in this case. So now when you use an audio converter, you need to write an input procedure, and the input procedure implements the source of the demand-driven model. So the audio converter.

When you ask it for converted data, it will call your input proc to get data, the input side data, and it's demand-driven so that if you're doing a sample rate conversion, then it might be pulling data at a different rate than you're getting it out. And also there's... some internal buffering that can happen for doing compression or sample rate conversion so that the pulling needs to be decoupled from the... the pulling of input needs to be decoupled from the pulling of output, and that's what the input proc implements.

So you need to set up your buffer list. Inside your input proc, your job is to provide data to the audio converter. You don't have to copy data into Audio Converter's buffers. You just give the audio converter pointers to your data. And so it passes you a buffer list and you fill out that buffer list with the pointers to the data that it wants to convert. The Audio Converter in general goes out of its way to eliminate copying and so it tries to just convert. I mean, if it can, it will just convert from your input buffer and into your output buffer without buffering internally. In certain cases, it cannot do that.

In the input proc, you provide pointers to your data, not copy it. And you need, in your input proc, when you pass data to the audio converter, you have to keep that data valid until the next time your input proc is called. And that might be across calls to Audio Converter Fill Complex Buffer. So you may call Audio Converter Fill Complex Buffer.

It calls your input proc, which returns data to the audio converter. And then you exit Audio Converter Fill Complex Buffer, exit and returns you data. You still have to keep that input data live until the next time you've called Audio Converter Fill Complex Buffer and it's called your input proc. Because it's still looking at that data.

Now, your input proc gets past the number of packets that the Audio Converter wants from you, but you're allowed to return either more or less data than it's asked you for. If you return less, then it will be called again. The word again has been removed here. If you return more, then it will just ask you less frequently. It will just ask you the next time it needs data.

The input proc, you have an Audio Converter instance, and there's the number of data packets. On input, it's the number of data packets that have been requested, and on output, you return the number of data packets that you're actually returning. It passes you a buffer list, which you need to fill out for the output. And then there's the Audio Stream Packet Description. If you're returning AAC data, then you need to set this pointer to the array of packet descriptions that describe the packet boundaries of the data.

And then there's the user data that's passed in, which is your own instance data for use, however you want to use it in your input proc. There's a couple special conditions that you have to deal with in your input proc. One is when you reach the end of the stream and you're out of data, what you need to do is set the number of data packets you're returning to zero and return no error. Your input proc may be called several more times and you just keep returning zero. And that will signal to the Audio Converter that you're indeed out of data and it will flush its buffers.

Another situation you could be in is that if you're doing real-time streaming, you may be in a situation where you're not at the end of the stream but you don't have any data available right now. So the Audio Converter needs to just return whatever it's got converted. But what you do in that case is you return no packets available and you return an error.

And this error gets propagated back to the caller and any data that had been converted up to that point will be returned to the caller. But the Audio Converter will keep any unconverted data that it has internally until the next time Audio Converter fill complex buffer is called.

[Transcript missing]

So for Audio Stream Basic Description Operations with the Audio Format API, you can get a format's name. You pass an Audio Stream Basic Description to the Audio Format API, and it will generate a name, either a name for a compressed format, or if you pass it a certain linear PCM format, it will tell you, generate a CFString and tell you what that is.

You can also pass it a partially filled out Audio Stream Basic Description. That's mostly useful for constant bit rate data formats to find out what the bytes per packet is for IMA4, for example. So you can ask, is a format variable bit rate? Is it externally framed? You can ask what encoders I have installed on the system. and what decoders I have installed on the system.

Okay, so this is what the Audio Format API looks like. There's two calls. One is the Audio Format Get Property Info. You use that to find out the size of a property. You pass in a specifier, which is some argument to the property. There's a property ID, which tells the Audio Format API what API, what property you're asking about. And then there's the specifier, which gives an argument. And then it returns you the size of the property you're asking for. And then there's the Get Property call, which you use to actually get the value of the property.

All right, so here's an example of using it. Here I'm finding out what encoders are installed in the system. I make a get property info call to find out the size of the array of encoders. It's going to return me a list of OS types for these encode formats. Then I call audio format get property to get the array of format IDs.

Then I enter in a loop here where I call the format API to give me a name for that format. I create a small audio stream based description and get a name for the format and print them out. Then on my system this is what I got printed out.

[Transcript missing]

describes the channel ordering of a stream. Now, Audio Stream Basic Description tells you the number of channels in a stream, but it doesn't really tell you what they are. So, if you have one channel or two channels, you can pretty much guess that's mono or stereo. But if you have five channels, it could be one of several orderings of 5.0. If you have six channels, it could be 6.0 or it could be 5.1 in several different orderings. So, you need a way to find out what that is. on this.

The Audio Channel Layout has several ways of specifying orderings. There's an integer tag for a bunch of predefined layouts. In a lot of cases, you can just pass around these integer tags to tell you what channel ordering you have. There's a bitmap for USB WAV style layouts. So in WAV files or USB, there's a bitmap to tell you which channels are present in a stream, and then they have to be present in a certain order. And then there's an array of channel descriptions which you can use to describe arbitrary layouts. So the structure looks like this. There's a channel layout tag, which is one of these predefined layouts. There's a bitmap. And then there's an array of channel descriptions.

Lots of formats are predefined. We defined just about everything we could think of or find references to. And then there's one thing you can do with these integer tags is mask off the low 16 bits, and that will tell you the number of channels in the format. The hold tag will be more specific and tell you what kind of ordering there is.

There's two special tags. One is used channel descriptions, which means that you can't know anything from looking at the tag. You have to look at the array of channel descriptions to find out what channels are present. And then there's used channel bitmaps, or if you're dealing with a USB Wave style.

and I are going to be talking about the channel layout. You can see here that there's four different kinds of layouts for 5.1. This sort of illustrates the kind of problem that this is trying to solve. Various streams will be in different ones of these formats. You have to be able to differentiate them.

The channel description struct for the array of channel descriptions. There's a channel label that tells you whether it's left, right, left, or around. And then there's some optional coordinates. So if you wanted to specify speaker positions using floating point coordinates, you can do that in rectangular or spherical coordinates. Okay, the channel label is an integer. It tells you just basically which channel you're dealing with.

There's basic ones and then there's lots of more esoteric ones. This is sort of channels, some of the channels defined by the theater industry along with my favorite channel, the left surround direct channel, LSD channel. The audio channel layout operations, you can get the number of channels in a layout.

You can get a full description from a layout tag or a bitmap. So if you have one of these integer tags like 5.1a, you can have it give you an array of the channel descriptions telling you what the channel label is all filled out so you know which channel is where in a certain layout.

You can get a matrix of coefficients for using with the matrix mixer, which I'll go over in a bit, for doing downmixing. So you can get a matrix of coefficients for using with the matrix mixer, which I'll go over in a bit, for doing downmixing. You can get a name for a layout. So if you have some layout, you can pass it here to the audio format API and it'll give you a CFString, which you can print out. You can get a name for a channel.

So if you have a certain channel and you want to find out a name you can print for the user. These are localized strings. This is what the left channel looks like. You can get that and that'll be localized. And then there's a, you can also put these audio channel layouts into AIF files, so.

All right, so with Audio Units, you can get the channel layouts and Audio Unit supports. This is a new feature for Panther. So Audio Units can support channel layouts, and you can get or set a channel layout for an Audio Unit stream. For example, the Matrix Reverb can have stereo, quad, or 5.0 channel layouts as output.

Okay, you don't have to support channel layouts. If you're doing an Audio Unit that's only doing mono or stereo, or it doesn't really care about spatial location like a filter, then you don't have to support audio channel layouts. But if you're doing a Reverb or a Panner or something that can deal with certain numbers of channels, like if you're dealing with five channels and you want to be able to support multiple channel orderings, you can do that with Audio Units. So the Audio Converter also supports channel layouts. You can use the AAC Codec to support various channel layouts when encoding to AAC. There's a property for getting the available encode channel layouts and for setting the channel layouts when you're encoding.

So there's also the Audio File Global Info API, which is related to Audio Format API, but it gets information about audio files. You can ask it what file types can be read, what can be written. You can get names for the file types, and you can find out what stream formats a certain file type can have put into it. You can find out what file extensions and I will be presenting the first part of this session.

So it's basically symmetrical to the Audio Format API. There's a property ID, a specifier, and then for the info you get the size of the property, and for the get property call you get the property data itself. So here's an example of finding out what writable file types are on the system.

It's almost identical to finding out what encoders are on the system. I find out what the size of the array I'm going to get back is using the info size call, and then I get the array of file types that can be read, and then I go into a loop and I use the file type name property to get a CFString that I can print out. And so this is what I get. So in Panther Audio File, I'm going to get the file type name property, and then I'm going to go into a loop and I'm going to Okay, one new audio unit in Panther is the Matrix Mixer.

Audio Unit, it's an audio unit that can take inputs to, or any number of inputs to any number of outputs, and they can be bundled in streams of any size. And the CPU usage depends only on the number of non-zero cross points in the matrix, not the size of the matrix. And you can get metering on inputs, cross points, and outputs.

So Matrix Mixer is useful for signal routing, channel reordering, surround down mixing, generalized panning, and generalized mixing. All the input buses are flattened to an array of mono channels for all the input and output buses. It's a big matrix of mono channels. Four gains control each cross point. There's a gain on input, a cross point gain, an output gain, and a master gain for the entire matrix. This is a... This is a game of Go where black is losing.

So this shows-- This shows the input buses coming in. They're flattened to a set of four channels. I have stereo buses, so they're flattened to four channels. There's a gain on each input. And the channels get numbered across all the buses in just a linear fashion. So they go 0, 1, 2, 3 for addressing each channel.

And then I'm using a black circle here to represent a cross point that has a non-zero gain. And then the open circles show zero gain. So here I've got input bus 0 as being mixed to output buses 0 and 1. And input bus 1 is being mixed just to output bus 0. So that's how it's laid out. And you're only paying CPU. And you're only paying CPU cost for the black circle. So as you turn up more gains, you'll have more CPU load. Okay, so I'm going to demonstrate that now.

Okay, so here's the Matrix Mixer, or just my UI on top of the Matrix Mixer. I've got two stereo input buses coming in, and I've got five channels going out in one bus. So if I hit play here, you can see on the left, these are the pre-fader input meters. I'll turn up my master gain here.

So as I turn these input faders on... You can see this is the post fade input meter. And I'll turn on these cross point gains. So I'm going to map this channel to channel zero of the output and I'll map this.

[Transcript missing]

And then I can sort of put this backwards into the surround channels here. Okay, so you can see the metering on the cross points here and the metering on the outputs.

So I'm mixing another sound in here. And I can mix it also into the surrounds here. This is center here. Okay, so that shows basically setting the levels and the metering. In addition, you can enable and disable buses and that basically turns off pulling for that branch of the, on that bus of input. So basically it's like pausing a section of your input.

Yeah, so, um, all right, now, just, I implemented this so I can just demonstrate, uh, You know, you can automate these programmatically to do any kind of panning algorithm you want to do. One of the kind of interesting things you can do with a matrix mixer is multi-channel panning in various manners.

It's just the rear speakers. Okay. I can also disable the output. That just mutes it basically. Okay, so that's the Matrix Mixer. I'll go back to slides now. All right, so in order to set the gains on the Matrix Mixer, you can set all gains from global scope.

Audio units, when you set parameters, you set parameters using either input scope, output scope, or global scope. With the Matrix Mixer, you can set everything from global scope, although you can set input gains and output gains from input or output scope. But this shows how you would specify a certain gain in the cross points of the matrix. You need to specify these using the element argument to audio unit set parameter. And you do that by shifting the input channel left by 16. And then you set output channel 2.

to hex all Fs or four Fs, or that with the input channel shift left 16 and then for the output gain you do the converse operation and then master gain is all Fs. Okay. So, as you saw you can get metering on prefade and postfade metering on inputs and postfade metering on outputs and cross points.

In order to get the metering you need to set audio unit property metering mode to one. Metering does take some CPU so you want to have the option to turn it off if you don't need it. And so and the parameters for metering are accessed in the same method as the gains by shifting the input left 16 in the element and ordering with the output.

Okay, so the bus enable that I showed you, Matrix Mixer parameter bus enable, if input bus is disabled, then it won't be pulled. So that's one way you can use to manage CPU load when you're using the Matrix Mixer. So if you disable input buses, then you're basically turning that part of your input graph off. So, um... If you just set the input gain to zero, that input will still be playing, you're just not hearing it. So it's just a different way to do something like that.

In order to set up the MatrixMixer, before you can use it, you need to set the number of input and output buses using Audio Unit Property Bus Count. And you need to set the number of channels in the stream formats of each bus. This defines the size of the matrix so that it can allocate itself to the proper size.

Okay, also in Panther, there's a new Panner unit, as Bill was saying, a Panner unit class for doing mono, stereo, or end channel inputs to end channel outputs, and possibly using audio channel layouts. You can use it to do panning or you can use it to do inline faders for channel volumes. Alright, so that's about it. Wrap up. Okay, I think a couple of these are over. There's audio and QuickTime tomorrow morning. And, uh, uh, quality documentation.