Media • 1:03:11
Professional-level audio is designed right into Mac OS X and iPhone OS with Core Audio. Get an overview of the Core Audio architecture and learn to take advantage of its richness in your own code. Learn how to write code that works on both the Mac and iPhone. Deepen your understanding of audio codecs and how to work with popular audio formats.
Speaker: Jeff Moore
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it may have transcription errors.
So I'm Jeff Moore, and I'm going to be talking about the architecture of Core Audio. The talk's going to be divided into two parts. In the first part, I kind of want to give you kind of a taste of some of the generalities behind the Core Audio architecture. I'm going to talk a little bit about what generally holds the pieces together. And then I'm going to talk a little bit about two specific concepts that are used throughout the APIs. the first the property mechanism, which is an API construct, and then format agnosticism, which is kind of an API concept and philosophy.
After that, I'm going to introduce you to some of the specific core audio APIs using four general usage cases, such as converting audio data from one format to another, reading and writing audio data from a file, as well as adding an effect to some data and playing audio data out to the speaker.
When you look at the Core Audio API set, you see a couple of hundred functions across tens of, a couple of dozen header files in a couple of different frameworks. It's kind of a lot to get your head around. But, you know, there is some method to that madness. And the first thing to know about them is that even though there seem like there are a lot of APIs, each API is kind of focused on one specific job. And it tends not to, they don't like the color outside the line. So if you're dealing with an audio file API, you're not gonna have to worry about it doing other things besides just dealing with audio files. Now, even though the API set is kind of, each API is focused in on its own world, we built the API such that they are cooperative and that they can be strung together in order to perform more complicated tasks.
We've also organized the API into basically three layers. At the lowest layer, you have mostly APIs that deal with audio hardware, that deal with talking to the hardware, such as the IO Audio driver family, the audio HAL that talks to those drivers, CoreMIDI, et cetera. Now, at the middle tier of APIs, that's where you're going to find kind of the meat of the Core Audio API set. That's where you're going to find the APIs you use the most, such as the audio converter, the audio file API, the audio unit APIs, AU graph, and a bunch of others. And then we have a few high-level APIs. Now, when we talk about high-level APIs in the core audio world, what we really mean is an API that combines two or more of the lower-level APIs into one easier-to-use package. And examples of our high-level APIs are the extended audio file API, OpenAL, the audio queue services, and a few others.
So another part of the APIs that kind of makes everything kind of hang together is this notion of properties. A property represents some attribute or some piece of state of an API object. And you use a key value pair to describe this little aspect of the audio object and to manipulate it as well. Now, a given object in the API might have lots of properties on it. So in order to be able to tell what property you're talking about, each property has an address. Now, for most of the Core Audio APIs, the address is just a 32-bit integer selector. However, several APIs further qualify the address using a scope and element selector to allow you to talk about different sub-pieces of a given API object.
Now, the value of a property can be pretty much whatever the API needs. If it's a CF object, it'll be a CF object. If it's a structure, it'll be a structure. Basically, it's organized as a blob, a pointer to a blob of memory, and a size that says how big that blob of memory is. Now, some APIs provide what we call property listeners. A property listener is a callback that you can install on the object in the API that will get called whenever the value of that property changes. And this is a way you keep up with some of the busyness of some of the APIs as they go about doing their work. Now, as I said, the properties are kind of a humanifying mechanism with the API set, and as such, you're going to find that all the Core Audio APIs are going to have several property-related functions in common. The first one you'll notice is the getPropertyInfo function. And this is the function that provides you metadata about the properties such as whether or not it exists, whether or not you can change the value, as well as how many bytes the value of the property is going to take up in memory. And then you also have the get property and set property methods. These routines are there so that you can get the value and change the value. And then finally, for APIs that support listeners, you have the add and remove property listener functions, which allow you to install and remove the callback that will get called when the value of a property changes.
Now, in addition to formats, to properties, which is kind of an API mechanics thing that kind of makes the API set hang together, we also have this concept of format agnosticism that runs throughout the API set. Now, what we mean by format agnosticism is kind of a mindset more than it is anything mechanical. It's about organizing your coding so that you are approaching it without any preconditions on the data formats you're dealing with. This allows you to use whatever data that comes your way, no matter what it is, as long as it can be probably APIs we'll talk about in a minute. As I said, pretty much every core audio API uses format agnosticism in its API. So by following along with what the APIs do, you will naturally build it into your program.
Now, even though we talk about not really trying to focus on the individual details of a format, the Core Audio API set does have what we call canonical sample formats. And the job of the canonical sample format is first and foremost to be the default format you get when you don't specify anything else. The canonical format is also chosen to be rich enough that it can be used as an intermediate format in various format conversions. And finally, the other big job in picking the canonical format is it has to be an efficient format for the platform you're on and in the situation you're in.
Now, on Mac OS X and on the iPhone, we have basically two kinds of canonical sample formats. And we represent them both using a typedef as well as a set of flags that-- The first one I want to talk about is the audio sample type and K audio format flags canonical. This represents the canonical format that is used in I/O situations. On the desktop, this is a 32-bit native Indian floating point format. On the iPhone, however, it's different. It's a 16-bit integer format.
The other canonical format is the one that's used for audio units and also in other signal processing situations. And it's represented by the audio unit sample type and K-Audio format flags audio unit canonical. Now, this format differs from the I-O format. And as I said, it comes up in places where you're going to be doing processing and you need headroom. As such, on the desktop, the format is going to be a 32-bit float. It's exactly the same as the I.O. format. However, on the phone, you're going to find that it's an 8.24 fixed point sample format, which is much more efficient on a processor where you don't really have a good floating point processor. Now, even though you're being format agnostic, you still have to know a certain amount of base level information about the format you're dealing with. And we've kind of boiled that information down into what this structure here, the audio stream basic description.
talk and you'll see written in our documentation in other places, we often refer to the AudioStream Basic Description by its abbreviation ASBD. So as you can see, this structure has several fields, some of which might jump out at you as familiar, such as the sample rate field and the bits per channel and channels per frame field. And those mean pretty much what you think they do. They represent the sample rate of the data stream as well as how many bits deep the samples are, as well as how many channels are in the data stream. The other fields are just as important, starting with the format ID and format flags fields. Now, these two fields together allow you to identify the general category of the audio data, whether it be linear PCM or AAC or MP3, Apple lossless, what have you. Now, the other the bytes per frame, the frames per packet, and the bytes per packet fields are there to describe different ways of breaking down the data stream into component pieces. And we'll talk a little bit more about where that's important in a few minutes.
Now, before we go on and look at a few ASBDs, just wanted to talk a little bit about how the structure works. Now, for any given format, not necessarily all the fields in the ASBD are going to be used. For some formats, it just doesn't make any sense, or you just can't know that information about the format. In other cases, it just doesn't make any sense. Now, in those cases where you have a field in an ASBD that's not needed for the format, you have to set that field to zero. You have to be sure you do that, or otherwise you're going to get the ASBD rejected by various API calls. Now, I want to talk a little bit about some terminology we use in the core audio world, particularly these three words here, sample, frame, and packet. Now, these terms are very overloaded in the computer science world as well as in the signal processing world. But when we use them within the core audio context, we mean something very specific. A sample is a single data point for a given channel in an audio stream.
A frame is a collection of samples that are time coincident. You can think of it as kind of the left and right stereo pair of a stereo signal. And then you have packets. Now packets are important here, but for now it's sufficient to define a packet as a collection of frames that go together for some person. It's also important to know that the frames in a packet are all contiguous.
Here we see an AudioStream basic description filled out for two channels of data in the IO canonical sample format at a 44.1 sample rate. And, you know, as you walk through the fields here, you can see that we filled out each field appropriately for this format. Starting with the format ID and format flags fields, you can see that we've set them to linear PCM. and we're using the audio format flags canonical flag to set the flags. We've also set the sample rate, and we've set the bits per channel by using the audio sample type typedef. You can see we've taken the size of that and multiplied by eight to get the number of bits for each channel. And then for filling out the other fields, we just multiply that size by the number of channels to get the bytes per frame, which indicates that we get two times that size, and the bytes per packet also. Now, it's interesting to note with linear PCM is that the number of frames in each packet of linear PCM is always going to be one.
And when we talk about packets a little bit more later, you'll understand why that is. Now, another interesting thing to note about this structure is that it is also platform agnostic. This structure is filled out correctly whether you're on the desktop or the iPhone. Now, the way we did that is making sure you're using the audio sample type data type and the canonical format flags. Those data types will change in size depending on what platform you're compiling for.
Now, here we see the audio unit canonical sample format, similarly with a two-channel 44.1 stream. Now, going through this, the main differences to notice here is that we're using the audio unit canonical flags here to describe the format flags. Now, and we're also using the audio unit sample type to describe the size of the sample. Now, one thing to note about the audio unit canonical type is that this format is what we call a non-interleaved format. And so what this means is that each sample, each channel is represented by its own buffer in the processing chain. As such, you have to be sure you fill out the ASBD to reflect this. So you can see here we filled out the bytes per channel as one times the size of audio unit sample type, whereas in the previous ASBD we did two times. And that's because of the difference between interleaved versus non-interleaved formats.
Now, here we see a slightly more complicated ASBD. It represents a two-channel format that has 24-bit samples that have been aligned high in 32 bits of a big-endian sample. And, you know, that sounds like a mouthful, and, you know, it's kind of a complicated format, but you can see here, as you go through the different fields, that we can represent the sample type pretty much exactly the way you would expect. Now, the key thing to notice here is the difference between how we filled out the channels per frame field, or I'm sorry, the bytes per frame field. Here, we use the value of 8, which is 2, the number of channels, times 4 bytes, which is each sample is actually 4 bytes wide because it's 24 actual bits carried in a 32-bit word.
And finally, we see an ASBD here that describes the AAC format. Now, the key thing to note here is that the AAC format does not actually allow you to define, to fill out a lot of these flags because they're not really relevant or even knowable about the AAC data stream. For example, the bits per channel. You can't really talk about the bits per channel of an encoded format. So in this case, we've set it to zero. And the fields that otherwise depend on that also are set to zero, such as the bytes per frame and others. One key field that you will always see filled out, however, for AAC is the bytes per packet field. And that's so that you know, or the frames per packet field. So that's so you know that how many frames are in each packet.
Now, in addition to an ASPD, some formats require an extra blob of data to fully describe the content of a data stream. In Core Audio, we refer to this blob as a magic cookie. And magic cookies have to accompany the data stream that they go with. And they can't really be separated. You get the magic cookie when you first create the stream using an encoder. And you absolutely have to be sure you pass it along to the decoder when it comes time to decode the data. Now, as you might imagine, the magic cookie has some fairly interesting information in it. And it's tempting to parse inside the black box. But you just have to be sure you remember that it is a black box. You really shouldn't do that. Instead, you should use the various core audio APIs, such as the audio format API, to get access to the information contained in the magic cookie.
Now, we talked a little bit earlier about packets, and before we defined a packet as a collection of sample frames that go together for some purpose. Now, going forward, I'd want to strengthen that definition a little bit and say that a packet is actually the smallest indivisible unit of a given data format.
Now, you might remember I mentioned earlier that the number of frames in each packet of linear PCM is always one. Now when you think of that in terms of this new definition of a packet, it kind of makes some sense because the smallest logical block you can break down a linear PCM stream to is a frame, to make sense. And that's why a packet of linear PCM is exactly one frame of data.
Now, there are three general kinds of packetization of various data formats that you're going to run into. The first is the simplest, constant bitrate. What this means is that each packet of the format has the same size in terms of both the number of bytes in the packet and the number of sample frames.
And examples of a constant bitrate format are linear PCM and IMA. And the great thing about constant bitrate formats is you always know how to break the stream apart into pieces and manipulate it. You don't need to have any external framing information to know that information, to know how to do that.
The next form packetization you get is the variable bitrate packetization. Variable bitrate or VBR, as you'll hear me say, those packets have the same number of frames in them, but each packet might vary in size in terms of bytes from packet to packet. Now, that gives you a lot of flexibility in terms of managing the bit rate of an encoded format. And because of that, you're going to find that pretty much most of the encoded formats you run into these days are of a variable bit rate nature, such as AAC, Apple lossless, MP3, et cetera.
Now, the final packetization I want to mention to you is the variable frame rate packetization. Now, this is kind of the most general packetization type that you can get because each packet can have both a different size in terms of the frames in it as well as the number of bytes in the packet. And as you might imagine, it can become tricky to deal with a packetization of a stream like that.
In the core audio world, we represent a packet using the audio stream packet description structure. Now, one thing to note is that you're really unlikely to ever run into one audio stream packet description. You're most likely to see packet descriptions ganged up into arrays of them, and they are used to describe another buffer of memory that contains audio data. Thank you. Now, packet descriptions are absolutely required when you're dealing with variable bit rate and variable frame rate formats. Now, every time you run into an API and you're dealing with that kind of format, you're going to need to be filling out and passing along packet descriptions. concerns.
Now, at the bottom of the slide, you can see how a packet description is declared. It has a field that represents an offset into the buffer of audio data where the packet starts. It has a field that represents how long the packet is in terms of bytes. Now, that middle field is a little tricky. In a variable bitrate format, the number of frames in each packet is the same. So you can get that information from the ASBD from the format. As such, you're gonna find that this field, the variable frames and packet field, is always gonna be set to zero for VBR data. However, in the case of variable frame rate data, you're gonna find that this field is gonna contain the number of frames in that specific packet in that part of the data stream.
Now, as you might imagine, ASPDs are kind of important. They're also kind of complicated to fill out. The good news is that there are lots of ways to fill them out. Probably the easiest way to get an ASPD filled out is to just let the Core Audio APIs do it. Every Core Audio API uses an ASPD to represent and describe the format of its data. So consequently, you're going to find that you're going to be getting ASPDs from it giving them back all the time. So you might as well just use the ones the API hands out to you.
We also provide the audio format API to help you out with filling in ASBD. The audio format API is a property-based API. In fact, all the API calls in the audio format are related to property manipulation. And among the things that the audio format API is there for is to allow you to describe what formats are installed in the system in terms of encoders and decoders, as well as to answer various questions about how those formats are working on the system at the time. One of the interesting things the Audio Format API can do for you is that given just the format ID and the magic cookie, you can have the Audio Format API fill out an ASBD for you.
Now, the final mechanism I want to talk about in terms of filling out ASBDs is to just plug the core audio SDK. You'll find a class in that SDK in our public utility section called CA Stream Basic Description. It is a big, giant raft of information about formats and filling out ASBDs. Even though the code is C++, and a lot of the Objective-C programmers may be wary of using it, you should still take a look and understand what this code is doing so that you can at least get the knowledge transferred out of that code and into your head so that you can use it. So... Now that we've kinda looked at some of the general attributes of the Core Audio API, we've looked a little bit about ASBDs because they're really important. I kinda wanna talk a little bit about some of the usage cases and use those as a stepping stone to introduce you to lots of the actual Core Audio APIs you're gonna use in order to accomplish the usage cases.
So the first case I want to start with is converting audio data from one format to another. Now, one thing to be aware of, the audio converter is only supported on the desktop. It is not supported on the phone. So everything I'm going to talk about in terms of audio converter for now does not apply to the iPhone. Okay.
Now, in order to use a converter, you have to actually make one, and that's what the audio converter new function is for. In order to call audio converter new, you need to have the properly filled out ASPDs for your input format and the output format to the conversion. Now, one thing to note is that one or both of the input or output formats has to be linear PCM. In other words, you can go from linear PCM to another linear PCM format. Or you can go from linear PCM to an encoded format. Or you can go from an encoded format back to a linear PCM format. The audio converter does not do transcoding between encoded formats.
Now, once you've created your audio converter, there are usually a bunch of other properties you're going to want to set on it to control the conversion. This is also the time and place where you will tell the audio converter about the magic cookie for your input stream, and you will also use this as an opportunity to set up various settings on the encoder, such as the quality of the encoding or the bit rate to use, etc., Now, after you have an audio converter, you have to be able to move the data through it in order to do the transformation. So, the audio converter provides two basic functions for you to do this. The first is a very specialized function, audio converter convert buffer. This function is there strictly for converting between linear PCM formats, and only then if you do not have a sample rate conversion involved.
Now, we also provide the audio converter fill complex buffer format. And this function can do linear PCM to linear PCM conversions. It can also do it with rate conversion. In fact, this is the function you're going to call to pretty much convert any kind of format from one to the other. Okay.
Now, we were talking a little bit earlier about format agnosticism. And if you're following those practices, you probably won't even use audio converter convert buffer. And we'll just concentrate your code on audio converter fill complex buffer, given that it's the most flexible way to move the data through the converter.
So you can pull data out of the converter using the appropriate call. So now you need to be able to get the data into the converter in order for it to be converted. So you do this by implementing an input data callback. And you pass the input data callback to the converter when you make your call to audio converter fill complex buffer. Now, what the converter does with this function is a couple things. First, this is the function that gets called whenever the converter needs data, needs input data for it to do the conversion. Now, in the callback, what you provide back to the converter is just pointers to your data. You don't copy your data. In fact, the converter has been specifically plumbed to eliminate as many extra copies of data as it can for performance reasons. Now, the key thing about these pointers that you return is that they have to remain valid. In other words, they have to continue to point at good memory until your input callback is called again by the converter. Now, the other big job you have to do in your input callback, as I said earlier, is if you're dealing with a variable bitrate or variable frame rate format, you have to be sure you provide the packet descriptions of the buffers that you're providing into the converter. Thank you.
So you can get data out, you can push data in. How do you know when you're done or what do you even do? Basically there are two end of stream modes that you need to know about. The first is the one where you just have no more data and you're really at the end of the stream and you're never gonna have any more input data. And the way you signal that to the converter is that you return zero packets of data in your input proc, as well as no error from your input proc. So your input proc will return an error code, and you should return zero in this case. And this will tell the converter that you're done. You don't have any more input data to give it. And so the converter will know not to call your input data proc anymore.
Now, the other situation is kind of like the first situation, only it's kind of a momentary situation, where let's say you're decoding data coming over the network and you're in the middle of downloading and the packets you need for the next part of the conversion haven't been downloaded yet. You can signal this situation to the converter, again, by returning no packets of data, but also you return an error code. Now, it doesn't really matter what the error code is, but you should keep track of it because what's going to happen is you're gonna return that error code to the converter. And then the converter is gonna note that and see that you mean for this to be the end of the stream for now. And then return that error to back to you through the return value to the call to audio converter fill complex data.
So then after you're all done, and you're about, and so there are a couple other things you need to know. If you're going to use the converter to do another conversion, let's say, for example, you're playing through a data stream and you're seeking around randomly in the stream, as you finish each segment, you're going to need to call the audio converter reset function before you start on the next segment. Now, what the audio converter reset function does is tell the converter that you're done with a given segment and to return the converter to its ground state and clear out any cache data or any other leftovers. And... So the audio converter has a plugin API called audio codecs. And audio codecs are also used to plug into the audio format API. Because codecs are expected to be the experts on the data format that they represent. And as such, are gonna be used by the rest of the system in order to, when you ask questions about a given format. Now, codecs come in two basic flavors. You have encoders that take in linear PCM data and output the encoded format. And then you have decoders that take in the encoded format and produce linear PCM.
So in your application, there are a lot of factors that go into deciding on what kind of audio data to use. The first ones are the obvious ones, bit rate and quality. For your application, you may or may not have bandwidth limitations on how much space you have to fit your data into. And if you do, you're going to be very concerned about the bit rate, how much space the encoding will take. And most encoders will have control throttles on bit rate.
Now, the other things you're gonna wanna be concerned about are how much does it cost to encode in terms of CPU, and decode for that matter. And that's gonna need to play into the logic in your application. For example, you might be very concerned about the performance of the decoder if you're trying to decode MP3 and do a bunch of 3D graphics or something like that at a high frame rate in a game or something like that.
Now, another factor that goes into codec choice is something that doesn't come up in a lot of people's thinking too often, and that's the data latency. Now, what we mean by data latency is the number, is the amount of delay between putting in the first samples into the encoder and how many zeros it takes the encoder to spit out before it actually gets to the beginning of your encoded data.
And this can vary widely from codec to codec. And if you're in a real-time situation, such as like iChat, you might look at using something like AAC low delay, which is a codec that's been specifically optimized to reduce data latency. The final factor you're probably going to want to consider is, you know, what platform are you on and what encoders and decoders does that platform have? The iPhone has some peculiar limitations, including the fact that it has a limited set of decoders available on it. And the other interesting aspect of it is the iPhone only allows you to decode a single instance of AAC, MP3, or Apple lossless at a time. That's not one of each at a time. That's one of any of the three at a time. So you have to keep that in mind when you're developing your iPhone application.
Now, here's a little chart I put up that I put together that kind of compares and contrasts several common formats according to the... the topics I just talked about. I threw 16-bit linear PCM up there just to kind of give you a baseline to compare it against. Now, just a word about the encoding cost and decoding cost per channel. Those are desktop numbers on a fairly hefty CPU. So you want to take those numbers as kind of a general way to compare and contrast. Don't take them as written in stone for the CPU that you happen to be working on.
One interesting format I want to call out here is ILBC. This is a new codec that we've just added with QuickTime 7.5, which just shipped yesterday. ILBC is a speech codec, and as such, it's geared mostly toward very low bandwidth situations. Among the things that ILBC is really good at is for dealing with lossy packet situations, such as in a network environment and reconstructing the data from what you have.
Now I want to talk a little bit about reading and writing audio data to a file. The aptly named audio file API provides the abstraction that you're going to use for reading and writing files. The audio file API provides a set of global properties that describe the capabilities of the system. They have much in common with the audio format API, except that they deal specifically with the file types that are supported. It'll tell you what the readable file types are. It'll tell you what the writable file types are. And then for each of those types, it'll tell you what kind of audio data you can stick in each of those files. also.
In the audio file API, files are specified using a CFURL that points into the file system. And they have to be file system URLs. You can't pass an HTTP URL or anything like that. And when you're creating a new file, you need to have the audio file type ID of the file you want to create. And this is to tell the audio file API, specifically whether you want a WAV file or an MPEG-4 file or an AIFC file or what have you. You also need to have the ASBD of the format of the data you're going to put in the file. And then beyond that, other information about the file are going to be set on the file after you've created it using the audio file property API.
Existing files can be opened and just like every other API, the audio file API has properties in order to access the information about that file. That's how you're going to find the ASBD and Magic Cookie for the file, how you're going to find the channel layout of the file, as well as the packet descriptions, et cetera. And then there's also properties for metadata such as markers, regions, lyrics, album covers, pretty much whatever you can stick in the file.
One word of caution, when you're using some kinds of audio files, you have to be careful about what sort of information you query of the file, or in order to avoid taking some rather large performance hits. For example, files like MPEG-1 files or ADTS or AC3 files are self-packetized files. In other words, the packetization is embedded in the file. There's no external framing information in those files. So in order to discover all the information about all the packets in the files, you have to go through and pretty much parse every packet in the file. And as you might imagine, for longer files, that can take a significant amount of time when all you want to do is open the file up and ask how long it is. So what we provide in the audio file API are some less time-consuming alternatives that give you some estimations as to some of these important quantities you might want to know. For example, there's a property to get the estimated upper bound on the packet size, which is how you might wanna govern your buffer allocation when reading from the file. And then we also have a property for returning an estimated duration, which will not need to parse the whole file to get to it.
So reading and writing in the audio file API are pretty much mirror images of each other in terms of the operation. Both calls will block until they're complete. And both calls can either go through the file system cache or not, depending on what you want. Now, there are two styles of reads and writes in the audio file API, one that goes by bytes and one that goes by packets. You can use the one that uses bytes, but in keeping with the format agnosticism theme, you really want to do your IO in terms of whole packets. For instance, it's really the only way to reasonably handle variable bitrate data. And it also allows it much easier to manipulate time, because packets represent a unit of time, as well as being an organizational concept.
Now, another operation the Audio File API provides is a means to optimize the file. And so when you tell the Audio File API to optimize a file, it'll go through and relay the file out on disk to optimize the access performance so that the data is offset into the file at an appropriate place so that you don't take cache hits or cache misses in order to read them and stuff like that.
So we also provide a high-level API called the Extended Audio File API. Now, this API, unlike the regular audio file API, is only available on the desktop. And what the Extended Audio File does is it unites an audio file with an audio converter. And what this does is it allows you to think about the operations on the file as if you were dealing purely with linear PCM. even if the file was encoded in something like AAC or lossless or what have you. The extended audio file has analogous routines to the regular audio file API for reading and writing to both existing files and creating new files.
And then just like the audio, once you've created or opened a file with the extended audio file API, you're going to use properties in order to get more information about that. And in particular, there's one property you need to set in order to tell the audio file what format of data you want to give it, or in the case of reading, what format of data you want it to give you.
And that property is really important or else you're going to miss, or else the things aren't going to work quite the way you expect. Now as I said, the point of the extended audio file is to make it so that you are conceptually thinking about the data in the file as if it was linear PCM. Now as such, all the operations in the API are handled in terms of sample frames. One thing about the reads and writes are also very analogous to the audio file API. The difference is that while reads block always, the writes in the extended audio file come in two basic flavors. You have the blocking flavor, which is pretty much the same as what you get with the audio file API. But you also have a non-blocking form. Now, the interesting thing about the non-blocking form is that it will offload the work of writing to the audio file to a separate worker thread on your behalf. Now, the cool thing there is that this is safe to use in an IOPROC context. And an IOPROC context, which we'll talk a little bit more about in a few minutes, is a place where you're up against a real-time deadline, and doing things like blocking is going to get in the way. Thank you. So now I wanna talk a little bit about applying effects to some audio data.
Now to do that, you're going to be talking about audio units. Audio units are our plug-in API that encapsulates signal processing operations. Now audio units are great, they can be hooked together either manually or using an AU graph to control the connections between them. As I mentioned before, this is a processing context. So the canonical format that's used in this case is going to be the audio unit canonical format. Now, there's some exceptions to this, and those are the converter units and the output units. Now, the interesting things about those two categories of audio units is that they also encompass a audio converter in them so that they can handle other formats as well as just the canonical format. Now, on the desktop, audio units also provide a GUI component that your app can use to display to the user to allow the user to interact with the parameters of the audio unit.
Parameters are exactly what you think. Those are the, they're much like properties except that they represent the portion, the controllable portions of the signal processing algorithm. For example, if it was, if you're talking about an audio unit that is using an EQ, the parameters are things like the cutoff frequency and the amplitude and the resonance. Parameters, unlike properties, are always the same value, and that value is a single 32-bit floating point number. However, the range and meaning of this value is going to be different from parameter to parameter. And the audio unit can fill out an audio unit parameter infrastructure for you in order to describe what the valid range is, what the units of the parameter are, etc. et cetera.
Audio units also allow you to change parameters by scheduling them in the future, as well as being able to ramp them from one value to another value continuously. And these two operations are really important for applications that are trying to automate their signal processing, such as in a digital audio workstation environment.
Audio units organize all their I/O into buses. And each audio unit will always have some number of input buses and some number of output buses. And each individual bus has some number of channels in its stream and may or may not have an audio channel layout that describes the usage of the channels in that bus in surround contexts. Now, when you're dealing with the property API and you're talking about a bus related property, you have to make sure that you address the bus on each bus on its own element.
So in order to use an audio unit, the first thing you have to do is locate it and find the one that you're interested in. Now, one thing we'll say up front is the component manager has been deprecated on the desktop. And it's not available at all on the iPhone. So in order to use the various audio components on the system, such as audio units and audio codecs, you need to use the new API in the audio unit framework called audio-component.h. The audio-component API is basically a one-for-one replacement for the component manager. For example, you're going to use audio-component find-next instead of find-next-component. You're going to use audio-component instance new instead of open-component. Now, you have to be sure that as a host that you don't try to mix and match audio components with component manager components. They are not interchangeable at the host level, even though underneath they may still be implemented as the same underlying API object. Now, don't worry, even though the component manager is deprecated, your existing component manager code is still gonna continue to work. But we would really advise you to start moving away from using the component manager for audio unit discovery.
So once you've found your audio unit, you configure the audio unit by, guess what, using properties. And this is the time where you're going to set the sample rate of the audio unit and as well as the maximum number of frames that you're going to ever call the audio unit to render for in one single call.
Now, both properties are important. You know, the sample rate for the obvious reason. But you have to be sure you set the max frames on the audio unit. because if you don't, you might get an unexpected error on down the road when you're on the real-time thread trying to render because you're asking for more data than the audio unit is prepared to render for. Now, other properties for the audio unit you can then set up, such as algorithm-specific attributes, such as reverb time or EQ cutoff frequency or whatever. And this is also the time at which you're going to set up the audio unit's connections to other audio units. And audio units support several different flavors of callback, and you're going to want to install your callbacks at the initialization time as well.
So once you've finished configuring your audio unit, you then call audio unit initialize to tell the audio unit that you're done configuring it and you want it to get ready to process some audio data. This is the signal to the audio unit that it can go ahead and allocate large tables, delay lines, load impulse response files, do other things that might take up a lot of space or otherwise might take some time in order to accomplish. And you need to do this, things that need to be done before you start rendering. thing.
And then in order to tell the audio unit to render, you just call audio unit render. And when you call it, you have to pass in the timestamp that indicates the relative stream position that you want the audio unit to render for. You also have to pass in the number of frames you want the audio unit to render. And again, to be careful, you have to make sure you don't pass a number of frames larger than the max frames that you set in the initialization sequence. And then you also have to pass an audio buffer list in, in order to receive the buffers from, the rendered buffers from the audio unit. Now, the buffers that you pass in the audio buffer list can come two ways. You can either pass pointers to actual blocks of memory, in which case you're telling the audio unit that you want it to render its output into the memory that you provided.
Or you could pass null pointers. And this is telling the audio unit that you want it to render into its own internal buffers and to provide you with pointers into them. Now, this is useful for optimizing the data flow and memory copies in your various overall chain of audio units. Now, then after you're done rendering, much like the audio converter, there's an audio unit reset call that will return the audio unit to its ground state. And this will also kill things like reverb tails, stop delays, clear out FFT buffers, what have you.
So that's how you get the data out. To get data into the audio unit, you basically have two choices. You can either get the data from another audio unit, in which case you can make a manual connection in the setup phase of your audio unit using kAudioUnit property make connection to directly connect the output bus from one audio unit to the input bus of another. Or you could use the AUGraph API. The AUGraph API's whole reason for being is to provide for the management of the connections in what can be a very complicated graph of connections. Thanks.
Now, the other way you can get audio into your audio unit is to use a render callback. Now, you install a render callback using kAudioUnit property set render callback. And much like the input data callback for the audio converter that we talked about earlier, this is the function that the audio unit will then call whenever it needs data for that bus. Now, you have to be careful. These two methods, whether you use a connection to another audio unit or a callback, those are mutually exclusive. You can only do one or the other on each bus of the audio unit. But you can mix and match on a per bus basis.
So we ship a lot of audio units in the system that have a lot of different kinds, starting with effects and music effects, such as these are your delays, EQs, high-pass, low-pass filters, dynamics processing, what have you. The difference between a music effect and an effect is that music effects can be controlled using MIDI. And we have panner units. Panner units are there for applying spatialization and other kinds of sound positioning algorithms to a sound. We have format converters, which I mentioned earlier.
Format converters, in addition to being able to accept very non-canonical audio formats, format converters can also be flexible in the way they pull for their input data. And as such, are also used where you find audio units that do things like pitch shifting, time compression expansion, and other kind of algorithms that manipulate time.
We also have generators and music devices. These are sources of audio, of new audio in an audio graph. And music devices in particular represent things like software synthesizers. Generators can also represent other things like file players or the scheduled slice playing audio unit which we ship on the desktop.
And then we also have mixer units, and you're going to find that we ship several different kind of mixers on each of the platforms, and that's always going to at least be a stereo mixer and a 3D mixer. On the desktop, we also ship a few other mixers, including the Matrix mixer, which is really good for doing routing and other complex mix setups. And then finally, we have output units. Output units are the sinks in a graph. That's where the data goes in order to go out to the hardware or to be pulled out of the graph and written to a file or what have you.
So now I want to finish up by kind of going over a little bit about how you get your audio data out to the speaker. Now, there are probably more APIs in core audio for playing audio than just about doing any other single task. So the first method I want to talk about is using the audio HAL. Now, the HAL is among the lowest level pieces of software we have in the stack. And as such, the HAL's job is really to be focused on providing access to the audio device on the audio device's terms. As such, you're not going to find that the HAL provides much in the way of convenience functions. It doesn't provide much in the way of utilities. And it's also very chatty. The HAL provides a lot of notifications and applications that are talking directly to the HAL are going to be expected to sign up for and handle all these notifications appropriately.
Now, when you're dealing with mixing for dealing with a HAL, you're pretty much, as with anything else in the HAL, on your own. In your IOPROC, which is what the HAL calls in order to get data from you, you also have to be very, very careful. This PROC is called on a real-time thread and is up against a hard deadline. So you can't do things on that thread that are going to cause it to potentially miss that deadline, like block while you're trying to lock a lock, spin on a spin lock, or do a file IO, or what have you. Now, given all the general complexity of using the HAL directly, we really don't recommend that many applications do it. We provide higher level abstractions that make it a bit easier. The other thing is that the HAL's not available to you on the iPhone, so you can't even go there on the phone.
So to step up from the how, we provide output audio units. Now, the only difference between an output audio unit and a regular audio unit is the addition of these two transport methods. Audio output unit start and audio output unit stop. And they do what their names imply. They start and stop IO.
There are two basic flavors of output unit that you're gonna wanna deal with. AU-HAL, which is only available on the desktop. And this audio unit is what does all the really hard and complicated work I was talking about earlier with being a proper HAL client. And it does all this so you don't have to.
speaker more often than not. On the phone, we don't have AU-HAL, because we don't have the HAL. Instead, we have AU Remote I/O. However, even though the name is a little different, AU Remote I/O still functions pretty much like AU-HAL does on the desktop. And it performs the same basic job, in that it's there to do all the hard work of being a proper client on the system, so that your application doesn't have to.
One other thing about output units is that they include an audio converter on their input bus. This allows you to feed an output unit with pretty much any kind of linear PCM format on the desktop and on the phone, except for on the phone, we do not support floating point sample formats, so you just have to pass integer, fixed point, or, well, that's it. And then when you're doing mixing with an output unit, again, you can do it yourself by using a render callback or whatever mechanism we talked about earlier. Or you can connect in one of those mixer audio units I mentioned earlier, and then you can connect your inputs, your data inputs to the inputs of the mixer. Now, one thing to bear in mind is that the HALS IOPROC rules, which I mentioned briefly earlier, still apply in your render callback when you're talking of an output unit.
Now, another mechanism you can use for getting data out is the OpenAL API. OpenAL is a high level cross platform API for doing 3D audio mixing. Now, it's mostly been optimized for the needs of game developers, but we've also found that it provides a pretty good basis for development on any kind of application.
Now, the good thing about OpenAL is that it's built on top of our 3D mixer. So the capability is there no matter which API you're going to use. Thank you. Finally, I want to talk a little bit about the Audio Queue. It's one of the high-level APIs we have. And its job is mostly to take buffers in and shepherd them along until they get to the audio hardware, and to do all this with a more simplified interface than what you would get with using an output unit or the HAL directly. One interesting aspect about the queue is that the audio queue owns the buffers that it uses. So that whenever you need a buffer to fill up, you're going to ask the queue for it. Now, the queue also supports scheduled playback and has, in general, just a very rich timing mechanism. So even though it's a high level API, it can still be used to perform some pretty delicate synchronization and playback operations. It's the basis of what we use in a variety of our products.
Now, in review, I started off by talking a bit about the generalities of the Core Audio architecture, the focused cooperating layered APIs that make it up. We talked about the property mechanism and how that that's a unifying mechanism in the API. And we also talked about being format agnostic. And then I kind of talked a little bit through a few usage cases here that kind of introduced you do several of the APIs that you are gonna run into when you're trying to do things.
So next up, directly after this session, you run upstairs and you can learn more about the Core Audio APIs in action on the iPhone. In particular, you'll hear a lot more about the Q, OpenAL, and a lot of the other services we provide. And then tomorrow afternoon, all the core audio engineers will be in Lab D, and you can come and ambush us and ask us lots of questions, get help debugging your tricky problems, what have you.
Then we have lots of resources out there. There are documents, the Core Audio Overview. Mostly you can get at them through the Apple Developer site that you can see the URL here. You'll find tech notes, our sample code, what have you. And then there's the WWDC Attendee site, which has even more examples and documentation that haven't made it out to the public site yet. Then I want to plug two other things. I want to plug the Core Audio SDK. Even though a lot of the code is in C++, there's lots of stuff there that you can get even if you're not a C++ programmer. And finally, to get in touch with us in general, you should use the Core Audio API mailing list. All the Core Audio API engineers monitor that list, and we will try to answer your questions as quickly and succinctly as you can.
Allan Schaffer is our evangelist, and you can get in touch with him for a kind of higher-level DR-related problems. And then we have, there's the URL for the documentation sample code and other things on, wow, that's an old site, isn't it? I bet it's still there anyway, and believe it or not, there's still useful information there.