Understanding the Core Audio Architecture - WWDC 2008

Media • 1:03:11

Professional-level audio is designed right into Mac OS X and iPhone OS with Core Audio. Get an overview of the Core Audio architecture and learn to take advantage of its richness in your own code. Learn how to write code that works on both the Mac and iPhone. Deepen your understanding of audio codecs and how to work with popular audio formats.

Speaker: Jeff Moore

Unlisted on Apple Developer site

Downloads from Apple

SD Video (772.5 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

So I'm Jeff Moore and I'm going to be talking about the architecture of Core Audio. The talk is going to be divided into two parts. In the first part, I kind of want to give you kind of a taste of some of the generalities behind the Core Audio architecture.

I'm going to talk a little bit about what generally holds the pieces together. And then I'm going to talk a little bit about two specific concepts that are used throughout the APIs. First, the property mechanism, which is an API construct, and then format agnosticism, which is kind of an API concept and philosophy.

After that, I'm going to introduce you to some of the specific Core Audio APIs using four general usage cases, such as converting audio data from one format to another, reading and writing audio data from a file, as well as adding an effect to some data, and playing audio data out to the speaker.

When you look at the Core Audio API set, you see a couple of hundred functions across a couple of dozen header files and a couple of different frameworks. It's kind of a lot to get your head around. But, you know, there is some method to that madness. And the first thing to know about them is that even though there seem like there are a lot of APIs, each API is kind of focused on one specific job. And it tends not to, they don't like the color outside the line.

So if you're dealing with an audio file API, you're not going to have to worry about it doing other things besides just dealing with audio files. Now, even though the API set is kind of, each API is focused in on its own world, we built the API such that they are cooperative and that they can be strung together in order to perform more complicated tasks.

We've also organized the API into basically three layers. At the lowest layer, you have mostly APIs that deal with audio hardware, that deal with talking to the hardware, such as the IO Audio driver family, the audio HAL that talks to those drivers, Core MIDI, et cetera. Now, at the middle tier of APIs, that's where you're going to find kind of the meat of the Core Audio API set.

That's where you're going to find the APIs you use the most, such as the audio converter, the audio file API, the audio unit APIs, AU Graph, and a bunch of others. And then we have a few high-level APIs. Now, when we talk about high-level APIs in the Core Audio world, what we really mean is an API that combines two or more of the lower-level APIs. And then we have a few high-level APIs that combine them into one easier-to-use package. And examples of our high-level APIs are the Extended Audio File API, OpenAL, the Audio Queue Services, and a few others.

So another part of the APIs that kind of makes everything kind of hang together is this notion of properties. A property represents some attribute or some piece of state of an API object. And you use a key value pair to describe this little aspect of the audio object and to manipulate it as well. Now, a given object in the API might have, you know, lots of properties on it.

So in order to be able to tell what property you're talking about, each property has an address. Now, for most of the Core Audio APIs, the address is just a 32-bit integer selector. However, several APIs use further qualify the address using a scope and element selector to allow you to talk about different sub-pieces of a given API object.

Now, the value of a property can be pretty much whatever the API needs. If it's a CF object, it'll be a CF object. If it's a structure, it'll be a structure. Basically, it's organized as a blob, a pointer to a blob of memory and a size that says how big that blob of memory is.

Now, some APIs provide what we call property listeners. A property listener is a callback that you can install on the object in the API that will get called whenever the value of that property changes. And this is a way you keep up with some of the busyness of some of the APIs as they go about doing their work.

Now, as I said, the properties are kind of a humanifying mechanism with the API set. And as such, you're going to find Now, in addition to formats, to properties, which is kind of an API mechanics thing that kind of makes the API set hang together, we also have this concept of format agnosticism that runs throughout the API set. Now, what we mean by format agnosticism is kind of a mindset more than it is anything mechanical. It's about organizing your coding so that you are approaching it without any preconditions on the data formats you're dealing with.

This allows you to use whatever data that comes your way, no matter what it is, as long as it can be probably APIs we'll talk about in a minute. As I said, pretty much every Core Audio API uses format agnosticism in its API, so by following along with what the APIs do, you will naturally build it into your program.

Now, even though we talk about not really trying to focus on the individual details of a format, the Core Audio API set does have what we call canonical sample formats. And the job of the canonical sample format is first and foremost to be the default format you get when you don't specify anything else.

The canonical format is also chosen to be rich enough that it can be used as an intermediate format in various format conversions. And finally, the other big job in picking the canonical format is it has to be an efficient format for the platform you're on and in the situation you're in.

Now, on Mac OS X and on the iPhone, we have basically two kinds of canonical sample formats. And we represent them both using a typedef as well as a set of flags that... The first one I want to talk about is the audio sample type and K-audio format flags canonical. This represents the canonical format that is used in I/O situations. On the desktop, this is a 32-bit native Indian floating point format. On the iPhone, however, it's different. It's a 16-bit integer format.

The other canonical format is the one that's used for audio units and also in other signal processing situations. And it's represented by the audio unit sample type and K-Audio format flags audio unit canonical. Now, this format differs from the I.O. format. As I said, it comes up in places where you're going to be doing processing and you need headroom.

As such, on the desktop, the format is going to be a 32-bit float. It's exactly the same as the I.O. format. However, on the phone, you're going to find that it's an 8.24 fixed-point sample format, which is much more efficient on a processor where you don't really have a good floating-point processor.

Now, even though you're being format agnostic, you still have to know a certain amount of base-level information about the format you're dealing with. And we've kind of boiled that information down into this structure here, the AudioStream Basic Description. Let's talk, and you'll see written in our documentation in other places, we often refer to the AudioStream Basic Description by its abbreviation, ASBD.

So, as you can see, this structure has several fields, some of which might jump out at you as familiar, such as the sample rate field and the bits per channel and channels per frame field. And those mean pretty much what you think they do. They represent the sample rate of the data stream, as well as how many bits deep the samples are, as well as how many channels are in the data stream. The other fields are just as important, starting with the format ID and format flags fields.

Now, these two fields together allow you to identify the general category of the audio data, whether it be linear PCM or AAC or MP3, Apple lossless, what have you. Now, the other fields, the bytes per frame, the frames per packet, and the bytes per packet fields, are there to describe different ways of breaking down the data stream into component pieces. And we'll talk a little bit more about where that's important in a few minutes. Now, before we go on and look at a few areas, we're going to talk a little bit more about the data stream.

So, let's go ahead and look at a few ASPDs. Just wanted to talk a little bit about how the structure works. Now, for any given format, not necessarily all the fields in the ASPD are going to be used. For some formats, it just doesn't make any sense, or you just can't know that information about the format. In other cases, it just doesn't make any sense. Now, in those cases where you have a field in an ASPD that's not needed for the format, you have to set that field.

You have to set that field to zero. You have to be sure you do that, or otherwise you're going to get the ASPD rejected by various API calls. Now, I want to talk a little bit about some terminology we use in the core audio world, particularly these three words here, sample, frame, and packet.

Now, these terms are very overloaded in the computer science world as well as in the signal processing world. But when we use them within the core audio context, we mean something very specific. A sample is a single data point for a given change. It's a single channel in an audio stream.

A frame is a collection of samples that are time coincident. You can think of it as kind of the left and right stereo pair of a stereo signal. And then you have packets. Now, packets are important here, but for now, it's sufficient to define a packet as a collection of frames that go together for some person. It's also important to know that the frames in a packet are all contiguous.

Here we see an Audio Stream Basic description filled out for two channels of data in the I/O canonical sample format at a 44:1 sample rate. And, you know, as you walk through the fields here, you can see that we filled out each field appropriately for this format. Starting with the format ID and format flags fields, you can see that we've set them to linear PCM, and we're using the audio format flags canonical flag to set the flags.

We've also set the sample rate, and we've set the bits per channel by using the audio sample type typedef. You can see we've taken the size of that and multiplied by eight to get the number of bits for each channel. And then for filling out the other fields, we just multiply that size by the number of channels to get the bytes per frame, which indicates that we get two times that size, and the bytes per packet also.

Now, it's interesting to note with linear PCM is that the number of frames in each packet of linear PCM is always going to be one. And when we talk about packets a little bit more later, you'll understand why that is. Now, another interesting thing to note about this structure is that it is also platform agnostic.

This structure is filled out correctly for each channel. It's filled out correctly whether you're on the desktop or the iPhone. Now, the way we did that is making sure you're using the audio sample type data type and the canonical format flags. Those data types will change in size depending on what platform you're compiling for.

Now here we see the Audio Unit Canonical sample format, similarly with a two-channel 44:1 stream. Now going through this, the main differences to notice here is that we're using the Audio Unit Canonical flags here to describe the format flags. Now, and we've also using the Audio Unit sample type to describe the size of the sample.

Now, one thing to note about the Audio Unit Canonical type is that this format is what we call a non-interleaved format. And so what this means is that each sample, each channel is represented by its own buffer in the processing chain. As such, you have to be sure you fill out the ASBD to reflect this. So you can see here, we filled out the ASBD. And so we filled out the bytes per channel as one times the size of Audio Unit sample type, whereas in the previous ASBD, we did two times. And that's because of the difference between interleaved versus non-interleaved formats.

Now here we see a slightly more complicated ASBD. It represents a two-channel format that has 24-bit samples that have been aligned high in 32 bits of a big-endian sample. That sounds like a mouthful, and it's kind of a complicated format, but you can see here as you go through the different fields that we can represent the sample type pretty much exactly the way you would expect.

Now the key thing to notice here is the difference between how we filled out the channels per frame field, or I'm sorry, the bytes per frame field. Here we use the value of 8, which is 2, the number of channels, times 4 bytes, which is each sample is actually 4 bytes wide because it's 24 actual bits carried in a 32-bit word.

And finally, we see an ASBD here that describes the AAC format. Now, the key thing to note here is that the AAC format does not actually allow you to define, to fill out a lot of these flags because they're not really relevant or even knowable about the AAC data stream.

For example, the bits per channel. You can't really talk about the bits per channel of an encoded format, so in this case, we've set it to zero. And the fields that otherwise depend on that also are set to zero, such as the bytes per frame and others. One key field that you will always see filled out, however, for AAC is the bytes per packet field, or the frames per packet field. So that's so you know how many frames are in each packet.

Now, in addition to an ASPD, some formats require an extra blob of data to fully describe the content of a data stream. In Core Audio, we refer to this blob as a magic cookie. And magic cookies have to accompany the data stream that they go with, and they can't really be separated. You get the magic cookie when you first create the stream using an encoder, and you absolutely have to be sure you pass it along to the decoder when it comes time to decode the data.

Now, as you might imagine, the magic cookie has some fairly interesting information in it, and it's tempting to parse inside the black box. But you just have to be sure you remember that it is a black box. You really shouldn't do that. Instead, you should use the various Core Audio APIs, such as the AudioFormat. And you can use the AudioFormat API to get access to the information contained in the magic cookie.

We talked a little bit earlier about packets and before we defined a packet as a collection of sample frames that go together for some purpose. Now, going forward, I'd want to strengthen that definition a little bit and say that a packet is actually the smallest indivisible unit of a given data format. Now, you might remember I mentioned earlier that the number of frames in each packet of linear PCM is always one.

Now, when you think of that in terms of this new definition of a packet, it kind of makes some sense because the smallest logical block you can break down a linear PCM stream to is a frame, to make sense. And that's why a packet of linear PCM is exactly one frame of data.

Now, there are three general kinds of packetization of various data formats that you're going to run into. The first is the simplest, constant bitrate. What this means is that each packet of the format has the same size in terms of both the number of bytes in the packet and the number of sample frames. Examples of a constant bitrate format are linear PCM and IMA. The great thing about constant bitrate formats is you always know how to break the stream apart into pieces and manipulate it. You don't need to have any external framing information to know how to do that.

The next form of packetization you get is the variable bitrate packetization. Variable bitrate or VBR, as you'll hear me say, those packets have the same number of frames in them, but each packet might vary in size in terms of bytes from packet to packet. Now, that gives you a lot of flexibility in terms of managing the bitrate of an encoded format. And because of that, you're going to find that pretty much most of the encoded formats you run into these days are of a variable bitrate nature, such as AAC, Apple lossless, MP3, etc.

Now, the final packetization I want to mention to you is the variable frame rate packetization. Now, this is kind of the most general packetization. Now, this is kind of the most general packetization type that you can get, because each packet can have both a different size in terms of the frames in it, as well as the number of bytes in the packet. Now, the final packetization I want to mention to you is the variable bitrate packetization.

Variable bitrate packetization. In the Core Audio world, we represent a packet using the AudioStream packet description structure. Now, one thing to note is that you're really unlikely to ever run into one AudioStream packet description. You're most likely to see packet descriptions ganged up into arrays of them, and they are used to describe another buffer of memory that contains audio data.

Packet descriptions are absolutely required when you're dealing with variable bitrate and variable framerate formats. Every time you run into an API and you're dealing with that kind of format, you're going to need to be filling out and passing along packet descriptions. Now, at the bottom of the slide, you can see how a packet description is declared.

It has a field that represents an offset into the buffer of audio data where the packet starts. It has a field that represents how long the packet is in terms of bytes. Now, that middle field is a little tricky. In a variable bitrate format, the number of frames in each packet is the same. So you can get that information from the ASPD format.

As such, you're going to find that this field, the variable frames in packet field, is always going to be set to zero for VBR data. However, in the case of variable frame rate data, you're going to find that this field is going to contain the number of frames in that specific packet in that part of the data stream.

Now, as you might imagine, ASPDs are kind of important. They're also kind of complicated to fill out. The good news is that there are lots of ways to fill them out. Probably the easiest way to get an ASPD filled out is to just let the Core Audio APIs do it.

Every Core Audio API uses an ASPD to represent and describe the format of its data. So, consequently, you're going to find that you're going to be getting ASPDs from it and giving them back all the time, so you might as well just use the ones the API hands out to you.

We also provide the Audio Format API to help you out with filling in ASBD. The Audio Format API is a property-based API. In fact, all the API calls in the audio format are related to property manipulation. One of the interesting things the Audio Format API can do for you is that given just the format ID and the magic cookie, you can have the Audio Format API fill out in ASBD. SBD for you.

Now the final mechanism I want to talk about in terms of filling out ASBDs is to just plug the Core Audio SDK. You'll find a class in that SDK in our public utility section called CA Stream Basic Description. It is a big, giant raft of information about formats and filling out ASBDs.

Even though the code is C++, and a lot of you Objective-C programmers may be wary of using it, you should still take a look and understand what this code is doing so that you can at least get the knowledge transferred out of that code and into your head so that you can use it.

Now that we've kind of looked at some of the general attributes of the Core Audio API, we looked a little bit about ASBDs because they're really important. I kind of want to talk a little bit about some of the usage cases and use those as a stepping stone to introduce you to lots of the actual Core Audio APIs you're going to use in order to accomplish the usage cases.

So the first case I want to start with is converting audio data from one format to another. Now one thing to be aware of, the audio converter is only supported on the desktop. It is not supported on the phone. So everything I'm going to talk about in terms of audio converter for now does not apply to the iPhone.

Now, in order to use a converter, you have to actually make one, and that's what the Audio Converter New function is for. In order to call Audio Converter New, you need to have the properly filled out ASPDs for your input format and the output format to the conversion.

Now, one thing to note is that one or both of the input or output formats has to be linear PCM. In other words, you can go from linear PCM to another linear PCM format, or you can go from linear PCM to an encoded format, or you can go from an encoded format back to a linear PCM format. The Audio Converter does not do transcoding between encoded formats.

Now, once you've created your audio converter, there are usually a bunch of other properties you're going to want to set on it to control the conversion. This is also the time and place where you will tell the audio converter about the magic cookie for your input stream. And you will also use this as an opportunity to set up various settings on the encoder, such as the quality of the encoding or the bit rate to use, etc.

After you have an audio converter, you have to be able to move the data through it in order to do the transformation. So the audio converter provides two basic functions for you to do this. The first is a very specialized function, audio converter convert buffer. This function is there strictly for converting between linear PCM formats. And only then if you do not have a sample rate conversion involved.

Now, we also provide the audio converter fill complex buffer format. And this function can do linear PCM to linear PCM conversions. It can also do it with rate conversion. In fact, this is the function you're going to call to pretty much convert any kind of format from one to the other.

Jeff Moore Now, we were talking a little bit earlier about format agnosticism. And if you're following those practices, you probably won't even use audio converter or convert buffer and will just concentrate your code on audio converter fill complex buffer, given that it's the most flexible way to move the data through the converter.

So you can pull data out of the converter using the appropriate call. So now you need to be able to get the data into the converter in order for it to be converted. So you do this by implementing an input data callback. And you pass the input data callback to the converter when you make your call to audio converter fill complex buffer. Now, what the converter does with this function is a couple things.

First, this is the function that gets called to the converter. It gets called whenever the converter needs input data for it to do the conversion. Now, in the callback, what you provide back to the converter is just pointers to your data. You don't copy your data. In fact, the converter has been specifically plumbed to eliminate as many extra copies of data as it can for performance reasons.

Now, the key thing about these pointers that you return is that they have to remain very valid. In other words, they have to continue to point at good memory until your input callback is called again by the converter. Now, the other big job you have to do in your input callback, as I said earlier, is if you're dealing with a variable bit rate or variable frame rate format, you have to be sure you provide the packet descriptions of the buffers that you're providing into the converter.

So you can get data out, you can push data in. What do you do, how do you know when you're done, or what do you even do? Basically, there are two end-of-stream modes that you need to know about. The first is the one where you just have no more data and you're really at the end of the stream and you're never gonna have any more input data.

And the way you signal that to the converter is that you return zero packets of data in your input proc as well as no error from your input proc. So your input proc will return an error code and you should return zero in this case. And this will tell the converter that you're done, you don't have any more input data to give it. And so the converter will know not to call your input data, your input data proc anymore.

Now, the other situation is kinda like the first situation, only it's kind of a momentary situation where let's say you're decoding data coming over the network and you're in the middle of downloading and the packets you need for the next part of the conversion haven't been downloaded yet. You can signal this situation to the converter, again, by first by returning no packets of data, but also you return an error code.

Now, it doesn't really matter what the error code is, but you should keep track of it because what's gonna happen is you're gonna return that error code to the converter and then the converter's gonna note that and see that you mean for this to be the end of the stream for now. And then return that error to back to you through the return value to the call to audio converter fill complex data.

So then after you're all done and you're about, and so there are a couple other things you need to know. If you're going to use the converter to do another conversion, let's say, for example, you're playing through a data stream and you're seeking around randomly in the stream, as you finish each segment, you're going to need to call the audio converter reset function before you start on the next segment. Now, what the audio converter reset function does is tell the converter that you're done with a given segment and to return the converter to its ground state and clear out any cache data or any other leftovers.

So the audio converter has a plug-in API called Audio Codecs. And audio codecs are also used to plug into the audio format API because codecs are expected to be the experts on the data format that they represent. So audio codecs are expected to be the experts on the data format API because codecs are expected to be the experts on the data format API because codecs are expected to be the experts on the data format API.

So, in your application, there are a lot of factors that go into deciding on what kind of audio data to use. The first ones are the obvious ones, bit rate and quality. For your application, you may or may not have bandwidth limitations on how much space you have to fit your data into. And if you do, you're going to be very concerned about the bit rate, how much space the encoding will take.

And, you know, most encoders will have control throttles on bit rate. Now, the other things you're going to want to be concerned about are how much does it cost to encode in terms of CPU and decode, for that matter. And that's going to need to play into the logic in your application. For example, you might be very concerned about the performance of the decoder if you're trying to decode MP3 and do a bunch of 3D graphics or something like that at a high frame rate in a game or something like that.

Now, another factor that goes into codec choice is something that doesn't come up in a lot of people's thinking too often, and that's the data latency. Now, what we mean by data latency is the number, is the amount of delay between putting in the first samples into the encoder and how many zeros it takes the encoder to spit out before it actually gets to the beginning of your encoded data.

And this can vary widely from codec to codec. And if you're in a real-time situation, such as like iChat, you might look at using something like AAC low delay, which is a codec that's been specifically optimized to reduce data latency. The final factor you're probably going to want to consider is, you know, what platform are you on and what encoders and decoders does that platform have? The iPhone has some peculiar limitations, including the fact that it has a limited set of decoders available on it.

And the other interesting aspect of it is the iPhone only allows you to decode a single instance of AAC, MP3, or Apple lossless at a time. That's not one of each at a time. That's one of any of the three at a time. So you have to keep that in mind when you're developing your iPhone application. Thank you.

Now here's a little chart I put up that I put together that kind of compares and contrasts several common formats according to the The topics I just talked about. I threw 16-bit linear PCM up there just to kind of give you a baseline to compare it against. Now, just a word about the encoding cost and decoding cost per channel. Those are desktop numbers on a fairly hefty CPU. So you want to take those numbers as kind of a general way to compare and contrast. Don't take them as written in stone for the CPU that you happen to be working on.

One interesting format I want to call out here is ILBC. This is a new codec that we've just added with QuickTime 7.5, which just shipped yesterday. ILBC is a speech codec, and as such, it's geared mostly toward very low bandwidth situations. Among the things that ILBC is really good at is for dealing with lossy packet situations, such as in a network environment, and reconstructing the data from what you have.

Now I want to talk a little bit about reading and writing audio data to a file. The aptly named Audio File API provides the abstraction that you're going to use for reading and writing files. The Audio File API provides a set of global properties that describe the capabilities of the system that are much in common with the Audio Format API, except that they deal specifically with the file types that are supported. The Audio File API provides a set of global properties that describe the capabilities of the system that are much in common with the file types that are supported.

In the Audio File API, files are specified using a CFURL that points into the file system. They have to be file system URLs. You can't pass an HTTP URL or anything like that. And when you're creating a new file, you need to have the audio file type ID of the file you want to create.

And this is to tell the Audio File API specifically whether you want a WAV file or an MPEG-4 file or an AIFC file or what have you. You also need to have the ASBD of the format of the data you're going to put in the file. And then beyond that, other information about the file are going to be set on the file after you've created it using the Audio File Property API.

Existing files can be opened. Just like every other API, the Audio File API has properties in order to access the information about that file. That's how you're going to find the ASBD and Magic Cookie for the file, how you're going to find the channel layout of the file, as well as the packet descriptions, etc. And then there's also properties for metadata, such as markers, regions, lyrics, album covers, pretty much whatever you can stick in the file.

One word of caution. When you're using some kinds of audio files, you have to be careful about what sort of information you query of the file in order to avoid taking some rather large performance hits. For example, files like MPEG-1 files or ADTS or AC3 files are self-packetized files. In other words, the packetization is embedded in the file. There's no external framing information in those files. So in order to discover all the information about all the packets in the files, you have to go through and pretty much parse every packet in the file.

And as you might imagine, If you're looking for longer files, that can take a significant amount of time when all you want to do is open the file up and ask how long it is. So what we provide in the audio file API are some less time-consuming alternatives that give you some estimations as to some of these important quantities you might want to know. For example, there's a property to get the estimated upper bound on the packet size, which is how you might want to govern the packet size.

So in order to do that, you have to go through and pretty much parse every packet in the file. There's no external framing information in those files. And then we also have a property for returning an estimated duration, which will not need to parse the whole file to get to it.

So, reading and writing in the audio file API are pretty much mirror images of each other in terms of the operation. Both calls will block until they're complete and both calls can either go through the file system cache or not, depending on what you want. Now, there are two styles of reads and writes in the Audio File API, one that goes by bytes and one that goes by packets.

You can use the one that uses bytes, but in keeping with the format agnosticism theme, you really want to do your I.O. in terms of whole packets. For instance, it's really the only way to reasonably handle variable bitrate data, and it also allows it much easier to manipulate time because packets represent a unit of time, as well as being an organizational concept.

Now, another operation the Audio File API provides is a means to optimize the file. And so when you optimize the--tell the Audio File API to optimize a file, it'll go through and relay the file out on disk to optimize the access performance so that the data is offset into the file at an appropriate place so that you don't take cache hits or cache misses in order to read them and stuff like that.

So we also provide a high-level API called the Extended Audio File API. Now, this API, unlike the regular audio file API, is only available on the desktop. And what the Extended Audio File does is it unites an audio file with an audio converter. And what this does is it allows you to think about the operations on the file as if you were dealing purely with linear PCM. Even if the file was encoded in something like AAC or lossless or what have you. The Extended Audio File has analogous routines to the regular audio file API for reading and writing to both existing files and creating new files.

And then, just like the audio once you've created or opened a file with the Extended Audio File API, you're going to use properties in order to get more information about that. And in particular, there's one property you need to set in order to tell the audio file what format of data you want to give it, or in the case of reading, what format of data you want it to give you. And that property is really important or else you're going to miss, or else the things aren't going to work quite the way you expect.

Now, as I said, the point of the Extended Audio File is to make it so that you are conceptually thinking about the data in the file as if it was linear PCM. Now, as such, all the operations in the API are handled in terms of sample frames. One thing about the reads and writes are also very analogous to the Audio File API. The difference is that while reads block always, the writes in the Extended Audio File come in two basic flavors. You have the blocking flavor, which is pretty much the same as what you get with the Audio File API, but you also have a non-blocking form.

Now, the interesting thing about the non-blocking form is that it will offload the work of writing to the audio file to a separate worker thread on your behalf. Now, the cool thing there is that this is safe to use in an I/O proc context. And an I/O proc context, which we'll talk a little bit more about in a few minutes, is a place where you're up against a real-time deadline, and doing things like blocking is going to get in the way.

So, now I want to talk a little bit about applying effects to some audio data. Now, to do that, you're going to be talking about audio units. Audio units are our plug-in API that encapsulates signal processing operations. Audio units are great. They can be hooked together either manually or using an AU graph to control the connections between them.

Now, as I mentioned before, this is a processing context. So, the canonical format that's used in this case is going to be the audio unit canonical format. Now, there's some exceptions to this, and those are the converter units and the output units. Now, the interesting things about those two categories of audio units is that they also encompass an audio converter in them, so that they can handle other formats as well. well as just the canonical format. Now, on the desktop, audio units also provide a GUI component that your app can use to display to the user to allow the user to interact with the parameters of the audio unit.

Parameters are exactly what you think. Those are the, they're much like properties except that they represent the portion, the controllable portions of the signal processing algorithm. For example, if it was, if you're talking about an audio unit that is using an EQ, the parameters are things like the cutoff frequency and the amplitude and the resonance.

Parameters, unlike properties, are always the same value and that value is a single 32-bit floating point number. However, the range and meaning of this value is going to be different from parameter to parameter. And the audio unit can fill out an audio unit parameter infrastructure for you in order to describe what the valid range is, what the units of the parameter are, etc., etc.

Audio units also allow you to change parameters by scheduling them in the future as well as being able to ramp them from one value to another value continuously. And these two operations are really important for applications that are trying to automate their signal processing, you know, such as in a digital audio workstation environment.

Audio units organize all their I/O into buses. Each audio unit will always have some number of input buses and some number of output buses. Each individual bus has some number of channels in its stream and may or may not have an audio channel layout that describes the usage of the channels in that bus in surround contexts. Now, when you're dealing with the property API and you're talking about a bus-related property, you have to make sure that you address the bus on each bus on its own element.

So, in order to use an audio unit, the first thing you have to do is locate it and find the one that you're interested in. Now, one thing we'll say up front is the component manager has been deprecated on the desktop, and it's not available at all on the iPhone.

So, in order to use the various audio components on the system, such as audio units and audio codecs, you need to use the new API in the audio unit framework called audio-component.h. The audio component API is basically a one-for-one replacement for the component manager. For example, you're going to use audio component findNext instead of findNext component. You're going to use audio component instance new instead of open component.

Now, you have to be sure that as a host that you don't try to mix and match audio components with component manager components. They are not interchangeable at the host level, even though underneath they may still be implemented as the same underlying API object. Now, don't worry. Even though the component manager is deprecated, your existing component manager code is still going to continue to work, but we would really advise you to start moving on. You can use the component manager to start moving away from using the component manager for audio unit discovery.

So once you've found your audio unit, you configure the audio unit by, guess what, using properties. And this is the time where you're going to set the sample rate of the audio unit and as well as the maximum number of frames that you're going to ever call the audio unit to render for in one single call. Now, both properties are important, you know, the sample rate for the obvious reason, but you have to be sure you set the max frames on the audio unit.

Because if you don't, you might get an unexpected error on down the road when you're on the real-time thread trying to render because you're asking for more data than the audio unit is prepared to render for. Now, other properties for the audio unit you can then set up such as, you know, algorithmic specific attributes such as, you know, reverb time or EQ cutoff frequency or whatever. And this is also the time at which you're going to set up the audio unit's connections. And audio units support several different flavors of callback and you're going to want to install your callbacks at the initialization time as well.

So once you finish configuring your audio unit, you then call Audio Unit Initialize to tell the audio unit that you're done configuring it and you want it to get ready to process some audio data. This is the signal to the audio unit that it can go ahead and allocate large tables, delay lines, load impulse response files, do other things that might take up a lot of space or otherwise might take some time in order to accomplish. And you need to do this, things that need to be done before you start rendering.

And then in order to tell the audio unit to render, you just call audio unit render. And when you call it, you have to pass in the timestamp that indicates the relative stream position that you want the audio unit to render for. You also have to pass in the number of frames you want the audio unit to render. And again, to be careful, you have to make sure you don't pass a number of frames larger than the max frames that you set in the initialization sequence.

And then you also have to pass an audio buffer list in in order to receive the buffers from the rendered buffers from the audio unit. Now, the buffers that you pass in the audio buffer list can come two ways. You can either pass pointers to actual blocks of memory, in which case you're telling the audio unit that you want it to render its output into the memory that you provide it.

Or you can pass null pointers. And this is telling the audio unit that you want it to render its output into the memory that you provide it. into its own internal buffers and to provide you with pointers into them. Now, this is useful for optimizing the data flow and memory copies in your various overall chain of audio units. Now, then after you're done rendering, you, much like the audio converter, there's an audio unit reset call that will return the audio unit to its ground state. And this will also kill things like reverb tails, stop delays, clear out FFT buffers, what have you.

So that's how you get the data out. To get data into the audio unit, you basically have two choices. You can either get the data from another audio unit, in which case you can make a manual connection in the setup phase of your audio unit using kAudioUnit property make connection to directly connect the output bus from one audio unit to the input bus of another. Or you could use the AUGraph API. The AUGraph API's whole reason for being is to provide for the management of the connections in what can be a very complicated graph of connections.

Now, the other way you can get audio into your audio unit is to use a render callback. Now, you install render callback using KAudioUnit property set render callback. And much like the input data callback for the audio converter that we talked about earlier, this is the function that the audio unit will then call whenever it needs data for that bus.

Now, you have to be careful. These two methods, whether you use a connection to another audio unit or a callback, those are mutually exclusive. You can only do one or the other on each bus of the audio unit. But you can mix and match on a per bus basis.

So we ship a lot of audio units in the system that have a lot of different kinds, starting with effects and music effects, such as these are your delays, EQs, high-pass, low-pass filters, dynamics processing, what have you. The difference between a music effect and an effect is that music effects can be controlled using MIDI. And we have panner units. Panner units are there for applying spatialization and other kinds of sound positioning algorithms to a sound.

We have format converters, which I mentioned earlier. Format converters, in addition to being able to accept very non-canonical audio formats, format converters can also be flexible in the way they pull for their input data, and as such are also used where you find audio units that do things like pitch shifting, time compression expansion, and other kinds of algorithms that manipulate time.

We also have generators and music devices. These are sources of audio, of new audio in an audio graph, and music devices in particular represent things like software synthesizers. Generators can also represent other things like file players or the scheduled slice playing audio unit, which we ship on the desktop.

Jeff And then we also have mixer units, and you're going to find that we ship several different kind of mixers on each of the platforms, and that's always going to at least be a stereo mixer and a 3D mixer. On the desktop, we also ship a few other mixers, including the Matrix mixer, which is really good for doing routing and other complex mix setups. And then finally, we have output units. Output units are the sinks in a graph. That's where the data goes in order to go out to the hardware or to be pulled out of the graph and written to a file or what have you.

So now I want to finish up by kind of going over a little bit about how you get your audio data out to the speaker. Now, there are probably more APIs in Core Audio for playing audio than just about doing any other single task. So the first method I want to talk about is using the audio HAL. Now, the HAL is among the lowest-level pieces of software we have in the stack. And as such, the HAL's job is really to be focused on providing access to the audio device on the audio device's terms.

As such, you're not going to find that the HAL provides much in the way of convenience functions. It doesn't provide much in the way of utilities. And it's not going to provide much in the way of the software itself. But it's also very chatty. The HAL provides a lot of notifications, and applications that are talking directly to the HAL are going to be expected to sign up for and handle all these notifications appropriately.

Now, when you're dealing with mixing for dealing with the HAL, you're pretty much, as with anything else in the HAL, on your own. In your IOPROC, which is what the HAL calls in order to get data from you, you also have to be very, very careful. This PROC is called on a real-time thread and is up against a hard deadline. So you can't do things on that thread that are going to cause it to potentially miss that deadline, like block while you're trying to lock a lock, spin on a spin lock, or do a file I.O., or what have you.

Now, given all the general complexity of using the HAL directly, we really don't recommend that many applications do it. We provide higher-level abstractions that make it a bit easier. The other thing is that the HAL is not available to you on the iPhone, so you can't even go there on the phone.

So to step up from the how, we provide output audio units. Now, the only difference between an output audio unit and a regular audio unit is the addition of these two transport methods. Audio output unit start and audio output unit stop. And they do what their names imply. They start and stop IO.

There are two basic flavors of output unit that you're going to want to deal with. AU-HAL, which is only available on the desktop. And this audio unit is what does all the really hard and complicated work I was talking about earlier with being a proper HAL client. And it does all this so you don't have to. So for most of you, it's going to be a little bit more complicated.