Graphics and Imaging • 1:04:27
Professional-level audio is designed right into Mac OS X Leopard with Core Audio. Get an overview of the Core Audio architecture and learn to take advantage of its richness in your own code. Deepen your understanding of audio codecs and how to work with popular audio formats. This is an important session for audio developers who need to know the strategy and fundamental paradigms driving audio development in Mac OS X Leopard.
Speaker: Jeff Moore
Unlisted on Apple Developer site
Transcript
This transcript has potential transcription errors. We are working on an improved version.
[Jeff Moore]
( Applause )
[Jeff Moore]
This is, in fact, my presentation. I'm here to introduce you to the Core Audio architecture and some of the general concepts that have gone into its design. In particular, I want to focus in on the property mechanism that's used throughout most of the Core Audio APIs as well as to talk a little bit about a concept that we call format agnosticism, which is another thing that runs throughout our APIs.
After that, I want to introduce you to some specific Core Audio APIs by going over some simple usage cases, including converting audio from one format to another, reading and writing audio from a file, playing an effect, applying an effect to some data, and playing audio data to hardware. Now for those of you that have taken a look at our API set, you've noticed that there are an awful lot of headers spread across several frameworks, but there is some method there to that madness.
Now each individual Core Audio API is very focused on the specific task that it was designed to handle. It doesn't, it tries to not veer outside of the lines, so to speak, as far as dealing with its own task. But even though they are focused, all the APIs are also cooperative in that they were designed to work well with each other so that you can basically pick and choose amongst the APIs to accomplish what you need to get done.
Now the API set is also a very layered architecture. We kind of break it down into three layers. At the lowest layer we have the APIs that deal primarily with hardware, such as the I/O Audio driver family, the audio how for talking to those drivers, core midi and a few others.
And then we have the middle tier of our APIs and that's where you're going to find all the workhorse APIs in the Core Audio set and that includes things such as the audio converter, the audio file API, audio units, AU graph, etcetera. Most of our APIs are going to fall into this middle ware layer.
And finally we have a few high level APIs. And when we talk about high level APIs, basically we just mean an API that combines some of the lower level APIs to synthesize a solution to a particular problem. And APIs like that are the extended audio file, Open AL and the audio queue API.
( Laughter )
Now, in the Core Audio API set, we do have this concept called properties and the property mechanism that we have is kind of a unification device, because properties occur throughout our API set. What a property is is basically a way to access and manipulate some particular piece of the state for a particular API object.
Now because the API objects more than likely have many, many properties, each property has to have an address so that you and the object know what you're talking about. Now most of the APIs use just a simple 32 bit integer selector for this address, although several of the APIs further qualify the address using scope and element selectors as well.
Now the value of a property can be pretty much any kind of data that the API has a need for. In fact, in the APIs property values are actually represented as a block of memory and a size that says how much space that that memory is occupying. Now, some of the APIs also provide what we call a listener mechanism and that's just a way to install a callback that you can get called whenever the value of a particular property changes.
Now as I said, the property mechanism is a bit of a unification device in our various APIs. And as such, you're going to run into several common routines in pretty much all the APIs that use properties. Now the first routine that you're going to run into is the get property info routine. The get property info routine is used to discover whether or not a property at that address exists, how big its value is, and whether or not you can actually change the value of that property.
Now the next pair of functions you're going to run into quite a bit is the get property and set property functions. And as their name implies, these are your basic accesser methods for getting and setting the value of a property. Now in the APIs that support listeners, you will also see the add property listener and remove property listener functions. And those are there specifically to allow you to install your callback so that it can get called when a property you're interested in changes value.
Now I'd like to talk a little bit about another concept that we have in the design of Core Audio and that's formatting agnosticism. Now by format agnosticism, all we really mean is that you write your code in such a way that you're prepared to handle pretty much any kind of data that the system might throw at you, even data formats that aren't on the system when you were writing that code.
And all of the Core Audio APIs are designed with format agnosticism in mind. They will go out of their way to help you out in this regard. And the reason why you want to do all of this is because, as I said, you really want to be able to deal with any kind of data your users might want to throw at you. So it's best to kind of keep this in mind when you're writing your code.
Now even though we try to be agnostic about what formats we're using, the Core Audio APIs do designate linear PCM with 32 bit floating point samples as the canonical sample format. Now the canonical sample format's job is really two fold. First, it's going to be the default format you get in absence of any other specified preference for a format. In other words, whenever you bring up a new audio unit, you're going to get the canonical sample format, for example.
The other big job that the canonical sample format provides is to be the intermediate format when doing conversions. In other words, if you're trying to transform data in format A into format B and there isn't a direct transformation, the transformation is going to go from format A into the canonical format and then from the canonical format into format B. The canonical format is always chosen to be big enough to hold pretty much any kind of data without losing any precision.
Now as I said, format agnosticism tries to know as little about the data as possible, but there is still a small amount of information that you're going to need to know in order to properly handle formatted data. And in the Core Audio APIs we've kind of boiled down a good portion of this information into the structure called the audio stream basic description. Now audio stream basic description is a bit of a mouthful and quite often you're going to see written and in my talk you're going to hear me use the abbreviation ASBD and I really just mean an audio stream basic description when I say that.
Now here you can see the declaration of an audio stream basic description and there are quite a few fields in here. Some of their usage should be obvious, such as the sample rate field. But some of them are a little less obvious. Now the format ID and the format flags field are there so that you can identify the general category of audio data you're dealing with, such as AAC or Linear PCM or MP3 or what have you.
The other fields that should be pretty obvious are like the channels per frame field. That just basically tells you how many channels of audio are in this stream of data. And then you have the bits per channel field, the bytes per frame fields, the frames per packet and bytes per packet field, and these are just fields that are used to describe how the audio data is broken down in the stream in memory.
Now there are a lot of fields in an ASBD and not every format actually needs to use all those fields in order to describe itself. Now in such a case, you will find that the fields that are unknown or that are not applicable will be set to zero.
Now before we move on and look at a few audio stream basic descriptions, I just want to take a few minutes to talk about some terminology. Now the words sample frame and packet are thrown about in computer science and signal processing literature and are very terribly overloaded terms. But when we're talking about the core audio APIs, we mean something very, very specific when we say sample frame and packet. A sample for us is a single data point for one channel in an audio stream.
And then a frame is a collection of samples that are all coincident at the same instant in time. Now one interesting aspect of a frame is that when you're talking about relating a data format to time, the frame is really the smallest unit at which you could measure time accurately with audio data. Now the final term I want to talk a little bit about is a very important term.
And it's important enough that we're going to talk a lot more about it here in a few minutes. But I just wanted to say that a packet is, at it's simplest terms, just a collection of frames that go together for some purpose. Now we're, like I said, we're going to talk a little bit more about packets in a few moments.
Now here you see an audio stream basic description structure filled out to represent a four channel stream in the canonical sample format at a 44.1 sample rate. And as you can go through, you can see each of the fields are filled out to represent what that format is. The sample rate is 44.1, the format ID and format flags are specifically set for linear PCM and the canonical sample format tags, which are, as we recall, 32 bit floating2 point.
And then you have the bits per channel, which is 32. And the channels per frame is four. Now when it comes to the bytes per frame, as I said, that's a collection of all the samples in that time, so there are four channels, so that's going to be four times four, which is 16.
Now one interesting aspect I want to point out about linear PCM is that the number of frames in a packet of linear PCM is always going to be one. Now that seems kind of arbitrary, but it'll make a little bit more sense when we talk about packets later.
Now here you see an audio stream basic description filled out for a slightly more complicated linear PCM format. Now this one has 24 bit samples, but they're carried within a 32 bit word and 2you have two channels of them at 96K. Now you can go through this and you can see how all the fields are filled out to represent pretty much exactly what I said in the English here. But I do want to point out a couple of interesting things about this format.
Because you have a format that is 24 valid bits represented in a 32 bit word, 2you can see some little discrepancies between this ASBD and the previous ASBD I showed you. In particular, you see that the bits per channel is set to 24, but the bytes per frame is actually eight. Now this is because each sample is actually 32 bits wide, but only 24 of those bits are actually valid, and so that comes out to 4 bytes per sample, times two, which is eight bytes per frame.
Now the final ASBD I wanted to show you is one here for AAC at a stereo AAC stream. Now AAC is one of those formats that doesn't require all the fields in an ASBD in order to describe itself. So as you can see, several of the fields here are set to zero to indicate this. In fact, the only fields that are really interesting to talk about width with AAC are the number of channels in the channels per frame field and the frames per packet field, which tells you how much to expect in each packet of AAC data.
Now in addition to audio stream basic descriptions, some formats also require what we call a magic cookie. Now a magic cookie is just a little blob of data that provides some extra out of band information about the specific instance of the format of the data in the stream.
Now magic cookies, as I said, are very particular to their stream of data so they have to be paired with the stream at all times. In fact, a magic cookie comes about as part of the encoding process to create the stream of data and you absolutely have to pass the magic cookie along in order to properly decode the data.
Now as you might imagine, the magic cookie does have some interesting bits of information in it. But the contents of the magic cookie should be considered opaque and you should treat it like a black box. You shouldn't be dereferencing or trying to parse a magic cookie. Instead you should use the APIs that Core Audio provides that allow you to probe the contents of a magic cookie and extract much of the interesting bits of information.
Now I mentioned, we talked a little bit earlier about packets in passing, and at that point I defined a packet as a collection of frames that go together. I want to strengthen that definition a little bit here and define a packet to be the smallest indivisible unit of a data format. That is to say you can't really break that data format down into pieces smaller than a packet.
Now from that definition you can pretty much see why the number of frames per packet in a linear PCM format is always going to be one. Because that's the smaller unit at which you can break a linear PCM stream up with. Now there are basically three kinds of packetizations that you're going to run into with various data formats.
Now the first is constant bit rate. Now in a constant bit rate format each packet is going to have both the same number of frames and it's going to occupy the same number of bytes in memory for each packet. Now the most common constant bit rate packet format is linear PCM. You know, linear PCM you always knows how big a packet's going to be by the definition of the format.
Now the next packetization you're going to run into is what we call variable bit rate. Now in a variable bit rate packetization, each packet will have the same number of frames in it, but the number of bytes that the packet will occupy will vary from packet to packet.
In fact, the variable bit rate packetization scheme is flexible enough and useful enough that it's used in most, if not all, of the modern codecs that we use on the system, such as AAC, Apple Lossless, MPEG1 Layer 3, etcetera. Most of your common formats these days are, in fact, variable bit rate in nature.
Now the final packetization type I want to talk about is variable frame rate. Now variable frame rate packetization has, each packet in this format will have a different, potentially a different number of frames and it will potentially occupy a different number of bytes in memory. Now this is a very generalized packet scheme and the only commonly used codec that you're going to find that uses this style of packetization is the Ogg Vorbis codec.
Now in the Core Audio APIs we use the structure audio stream packet description in order to identify a packet in memory. But it's very unusual to run into an audio stream packet description on its own. More likely you're going to run into them in the form of an array of packet descriptions that are used to describe how another buffer of memory is broken up into packets.
Now packet descriptions are absolutely required when you're dealing with a variable bit rate format or a variable frame rate format. And the reason should be obvious, when you see them in memory there's no way to know where one packet begins and the next packet ends without having a packet description that tells you this information. Now at the bottom of this slide you can see the declaration of the audio stream packet description structure. It has a field that has a byte offset for the start of the packet.
It has a field to indicate how many frames are in shit packet. Now this field is going to be set to zero except for variable frame rate packetization formats, in which case they'll be set to the number of frames that are in that particular packet. And finally the audio stream packet description also has a field for how many bytes the packet occupies in memory. Now one interesting property that this definition of a packet description gives you is that it allows you to describe a buffer of memory where the audio packets might be intermingled with other data, such as ID3 tags or video frames or what have you.
Now to wrap up the first part of this talk, I'd like to talk a little bit about how you'll go about filling out an ASBD, because from what I've said so far, you're probably looking to go wow. You kind of have to know something about the format you're dealing with in order to fill out the ASBD. And in some cases that's true, but there are some techniques you can use that can help you out and kind of lift some of the burden of filling the ASBD out.
Now the first technique to use should be pretty obvious. Just use the ASBDs that the Core Audio APIs hand to you. Every Core Audio API uses an audio stream basic description to describe the data that it wants and the data it produces. So you should feel free to just ask the APIs your interacting with to fill out your ASBDs for you.
Now in some cases you're still going to have to fill them out by hand and the APIs you're using aren't going to be able to help you much. In that case, we do have another API, the audio format API, which can help you still fill out an ASBD, provided that you have some other bits of information to help it out.
Now the audio format API is just a collection of global properties and its purpose is there to give you kind of a handle on what formats, in terms of encoders and decoders are installed on the system, as well as other bits of information about the encoders and decoders, such as the name of the format, and a bunch of other things. But there are also properties there that will help, that will fill out an ASBD for you when all you happen to have is the format ID of the format and a magic cookie.
Now the final technique I wanted to mention kind of in passing, is to use the CA stream basic description class. Now the CA stream basic description is part of the Core Audio SDK. As such, you'll find it in the developer examples folder on your install right now. And even though it is C++ code, and if you're uncomfortable with C++, I really advice you to take a look at the code anyway and even if you're not going to use it directly, there's probably a lot of good information in that code that you can make use of in your own code.
Even to the point of just copying and pasting to where ever you need it, as are all of the SDK codes. They're really there to help you, even if they are in C++. I do encourage you to at least take a look at them and be familiar with how they work. Wrong way.
So that kind of wraps up the first part of the talk where I've been kind of talking in generalities about the Core Audio APIs. Here for the second part of this talk, I'd like to move more into some concrete examples of doing something and through that introduce you to several of the more commonly used Core Audio APIs.
Now the first task I want to talk about is converting data from one format to another. And the Core Audio API that you're going to use to do this is the aptly named audio converter API. In order to do a conversion, the first thing you need to do is instantiate a new audio converter.
Now in order to do that, all you do is call audio converter new and you have to pass the input format and the output format to the audio converter new function. Now one thing to be aware of is that the one or both of the input or output formats has to be linear PCM.
In other words, you can use an encoder to go from an encoded format to linear PCM or from linear PCM to an encoded format or convert between the various linear PCM formats. Now after you've instantiated your audio converter, all the other settings that you need to make on the converter are handled by setting properties.
Now some of the properties you're definitely going to want to set include the magic cookie in case of your dealing with a format that requires them. And when you're doing an encoding, you're also going to want to configure the encoder for such things as what quality of encoding, what bit rate to run the encoder at, etcetera.
So when you have an audio converter, you need to know how to get data out of it. And the audio converter API provides to function that you're going to want to use to do this. Now the first function is a specialist function, audio converter convert buffer. It specializes in doing linear PCM to linear PCM transformations provided that they don't have any rate conversion. This is useful in the cases where you are dealing with low level stuff and you need to massage the linear PCM format and you can just access directly the high performance blisters in the audio converter.
But aside from that one special case, the general routine that you're going to want to use for getting data out of an audio converter is audio converter fill complex buffer. This is the routine that can handle every kind of data that the system supports. As such, along with the concept of format agnosticism, this is really the routine you're going to want to organize your code around.
So presuming you're using audio converter fill complex buffer, you're going to need to write an input data callback. Now this is the function that the audio converter is going to call whenever it needs to get some new input data from you. Now your input data callback has two basic jobs. The first job is to provide the data in order for the input side of the conversion. Now the audio converter has been specifically plumbed so that it minimizes the number of copies involved in doing a particular transformation.
As such, your input data callback does not copy data. Instead it has to provide a pointer to a buffer containing the input data you want converted. Now this buffer has to remain valid for the entire duration of a call to audio converter fill complex buffer and further has to be valid until the next time you call audio converter fill complex buffer. Now the other big job that you're input callback data has to do is that it has to fill out the packet descriptions in the case where you're dealing with a format of data that requires them.
So you can get data in, you can get data out, of an audio converter, what do you do when you're done? How do you even know when you're done? Well there are two cases that you really ought to know about that signal the end of the stream in a bit of a different way.
Now the first way is that in your input callback, you return zero packets of data, but you also return no error from the function. Doing this will signify to the audio converter that you are out of data and this is, in fact, the logical end of the stream and that the converter should not expect you to give it any more data.
Now the second mechanism that's used to indicate kind of an end of stream situation is that if you return no data again, but that you return an error code from your input data callback. Now by returning the error, you're indicating to the converter that you don't have any input data to give it at this point and time but you are not actually at the logical end of stream and so that the converter should keep on trying to get more data from you in the future.
So once you've signified the end of stream and you're done with the conversion, you still have a little bit more that you need to do. In the case where you're going to reuse that audio converter for something, you have to make sure that you call audio converter reset before you try to do a new conversion with it. Now audio converter reset basically discards all the internally cached data in the audio converter and basically returns everything to its ground state.
Now another situation where you're going to call audio converter reset is if you're seeking around in a given data stream. Now the reason why you want to do this even if you're decoding the same string is that most data formats have some kind of inter packet dependency and by moving around in the packet stream, you're going to need to make sure that the codec is returned to its ground state before it tries to decode anything further.
Now when you're dealing with audio formats that does imply audio codec's and I've kind of mentioned codec's in passing. Officially the audio codec's API, they constitute the plug in mechanism for both the audio converter API and the audio format API. Now codecs come in two flavors. You have encoders. They take in linear PCM data and output their encoded format and then you have decoders, which take in the encoded format and spit out the linear PCM data.
Now one of the other rules of an audio codec is that it's expected to be the expert format on the data format it works with. And so the audio converter and the audio format API will always defer questions about the formats to the audio codec. Now in the case where you have both an encoder and a decoder for a given format present, the decoder is always consulted prior to consulting any encoders.
Now when you're looking at what kind of audio data to do for a particular job, there are several factors you're going to want to think about in the choosing. Now the first factor is bit rate. How much compression do you actually need? And this is going to be governed by the specific application. Like the case of a telecommunications application, you may only have a certain allocated amount of bandwidth that you can use. And in those cases you're going to need to use a codec that can fit into that bandwidth.
Now along with the amount of compression, that's always a trade off in terms of quality, so you're going to want to balance the amount of compression that you need with the audio signal quality that you get out of an encoder. Most of the time you're going to be trading one for the other.
Now another big factor in your decision is going to be how much CPU does it cost to use this format? If you're encoding, you're going to want to look at how much CPU does it take to encode the format. The same with decoding. And another factor that a lot of people don't necessarily think about when they think about codecs is the latency in the codec. Now all of these modern codecs are based on all sorts of great signal processing and cool math tricks. But all of these things have some inherent amount of latency in the calculation.
That is, you'll stick some data in but you'll have some extra silence put out before you get your actual bits back out. Now latency becomes a really important thing for certain classes of applications, again, such as telecommunications. And when you fall into that category you usually know it. And the Core Audio system does provide codec's that have excellent support for low latency characteristics.
Now, the final factor you're probably going to look at is compatibility. You're going to deploy your app in potentially across several platforms and you're going to want to make sure you pick a format that is accessible and usable across all those formats. So here's a little table I made that kind of compares and contrasts some of the more common formats on the system. Now the first format is 16 bit linear PCM. I put that up there kind of to give you a baseline for comparison. And it's pretty much the uncompressed format of the data and it doesn't actually incur any CPU cost to use it.
Now the next format on the chart is Apple Lossless. Apple lossless, as it's name applies, doesn't have any generation loss when you do the encoding. In other words, the bits that you encode are exactly the bits you get back when you decode them. Now the cost for lossless is that it's somewhat more expensive to code and decode, in terms of CPU cost, and it doesn't support particularly low bit rates either. Now Apple lossless is available across Mac OS X and Windows, provided you have QuickTime.
The next format I want to talk about is IMA. Now IMA is a fairly old codec standard. It's been with us for a long time. It basically provides a four to one compression ratio and the quality, though while not lossless, is still pretty high. And the CPU cost of IMA is next to nothing, it's basically a couple of multiplies and adds. And the other great thing about IMA is that it has absolutely no latency in it's encoding format.
Not the final three formats are all variations of MPEG formats. The first one is M-PEG4,AAC. This is the format we use on the iTunes store. This is the format that we've been promoting in the system for a long time now. Now much like these, all three of these formats all can trade bit rate for quality.
But it's interesting to know when you compared and contrast these three codecs together you will get, they kind of fall in about that order in terms of quality at the same bit rate. I kind of want to point out a little bit about M-PEG4,AAC-LD. This is a new codec in Leopard.
You might have heard it mentioned yesterday in the keynote, as iChat has delivered a significant quality improvement to its users because they've switched over to using the AAC-LD codec. Now AAC-LD is a modified form of AAC and it has a slightly worse quality at the same bit rate as regular AAC.
But what you get for that slight trade off is a significantly improved latency performance in the codec, which was a key feature for getting iChat on board for using it. And the final entry in the table is MPEG1 layer 3, aka KA MP3. And I threw that up there because it pretty much everybody has a file somewhere in MP3 and it's good to look at that because it's so common throughout the various platforms you're likely to use. Now I'd like to move on to talking about reading and writing audio data to a file.
The aptly named audio file API is the API basically you're going to use for doing all of that. The audio file API provides a set of global properties that allow you to find out all the readable file types, all the writable file types, and then for each of those file types, what kinds of data you can put in each one of them.
These properties are kind of the audio file equivalent to what the audio format API provides with the audio converter and describing the actual data formats. When you specify a file to the audio file API, you can do that either using a CF URL that points to a file system object, in other words a file calling URL, or you can use a Carbon FS Ref.
Now when you want to create a new file with the audio file API, you need to have the audio file type ID. That basically says what kind of file you want to create, such as AIFF or MPEG4 or what have you. You need to know the ASBD of the format of the data you're going to put in the file. And then any other information you need to include in the file such as magic cookies or meta data, such as regions or markers or what have you. All of that stuff is taken care of after creating the file by setting properties on the file object.
Now when you're accessing an existing file, much like most file system APIs, you just simply open those APIs, or open those files. And then you can find out all the information about the file by accessing the various properties, such as the ASBD and the magic cookie for the format, the channel layout for a multi channel situation, the various packet descriptions for the data and other kinds of Meta data, markers, regions, lyrics, album covers, what have you.
Now one word of warning, when you're trying to access the packet table of certain kinds of files, you're potentially going to have a pretty large performance hit. The reason for this is that some formats of data such as MPEG1 files and ADTS and AC-3 files, they don't have a packet table, per say, rather that they have an implicit packet table is signaling in the stream itself so that in order to discover the packet table, you have to go through and parse every single packet in the data stream. And as you can imagine, for even small files, that can be a significant performance hit in order to calculate that information.
As such, the audio file API tries to provide as much estimations for the sorts of things you're going to want to access the packet table for, such as the upper bound on the size of a packet in the stream, as well as an estimate of how long, in terms of real time, the file takes. And you should use these alternatives if you can, rather than necessarily incurring the overhead of having to parse the entire file to get the packet table.
Now, in the audio file API, reading and writing data work basically the same way. In fact, the API calls all have basically the same set of arguments. And they all have the same characteristics in terms of behavior. Both reading and writing are blocking calls. That means that when you make the audio file read or audio file write call, that file is not going to return until either there's some kind of system error, in which case that'll be reported to you, or the request you made is satisfied. So you should keep that in mind when you're making these calls.
Now another thing that you can do with these calls is you can say whether or not you want the audio file API to read or write the data through the system cache. This is a performance optimization, because you may or may not want to incur the extra overhead of sticking the data in the file in the general system cache.
Usually you don't want to do this in the case where you're just reading a packet of data from the disk once, potentially maybe to play it to the hardware or what have you. Now both the audio file read and write calls come in two flavors. There's a form that works in terms of bytes and there's another form that works in terms of packets.
Now while generally you can do it in terms of bytes for pretty much any format, we've been talking a little bit about format agnosticism and this is another place where that creeps in to things, you're rally going to want to deal with these formats in terms of whole packets. It's just the only way to handle variable bit rate data, for example.
But generally speaking, if you're dealing with things in terms of whole packets, it's just generally easier to compute things like durations and to keep track of the data in memory. Now after you're done writing to the file, the audio file API has an optimization routine. And you can call that routine on a file to cause the file to be rewritten in such a way that it can improve the actually access performance to the data in the file.
Now the audio file API is all well and good, it's fairly robust and has a lot of stuff in it, but we also have another API called the extended audio file API and it unites an audio file with an audio converter. And this allows you to easily read and write files in whatever format you want while being able to do all the book keeping an what not in terms of linear PCM. This is a conceptual thing that can simplify the code that you have to write in terms of handling all the various kinds of formats.
Just like the audio file API, the extended audio file API can read and write to existing files. It can create new files. In fact, creating an opening with the extended audio file is very analogous to the regular audio file API and everything I've talked about so far applies to the extended audio file as well.
And once you've created or opened a file with the extended audio file API, you have to also set a property after the fact that will tell the extended audio file what data format you want to give it, or if you're reading what format you want it to give you.
Now when you're using the extended audio file API, because it deals with everything as if it were linear PCM, all the reading, writing, file positions, sizes, etcetera, are always handled in terms of whole sample frames. And much like the regular audio file API, the extended audio file reads block.
But the rights come in two flavors. There's the blocking version, but there's also a non blocking asynchronous version that's pretty useful in lots of debugging context. You can, for example, you can use the non blocking form of extended audio file write from an I/O context, which is a place where timing of everything is very critical. Now I'd like to talk a little bit about adding an effect to some audio data.
And when you do that in the context of Core Audio, you're going to be talking about the audio units that we shit on the system. Now an audio unit is a plug-in for encapsulating some kind of signal processing operation. Audio units can live on their own, but their generally can be, assembled together in either manually or using the AU graph API to make a collection of them that can perform more complex signal processing operations.
Now when you're dealing with I/O in an audio unit, all the I/O has always done in a canonical sample format, it can be at any sample rate, but it's always going to be at the canonical sample format. There are a couple of exceptions. Converter audio units and output audio units can be a little more flexible with the kind of data they can take in, and we'll talk a little more about that in a minute. Audio units also can provide one or more GUI components that your application can show to the user so that the user can interact with the audio unit and can shape the sound that it creates.
And in addition to properties, audio units also make use of parameters. Now a parameter differs from a property in a couple of key ways. First, parameters only apply to things that modify the real time aspect of the signal processing. Further, parameters are always 32 bit floating2 point values. Now the range and the meaning of that 32 bit flow is going to vary from parameter to parameter and you can get the audio unit to fill out an audio unit parameter infrastructure that can describe the usage of the parameter to you.
Now parameters can be, the changes to a parameter can be scheduled in advance, and the changes can also be ramped smoothly from one value to another over some length of time. Both of those operations are absolutely key to using an audio unit in a digital audio workstation environment where you might be automating the process of the data over time.
Now the I/O of an audio unit is organized into a bus. Now each bus on an audio unit has some number of channels and each audio unit can have potentially many input busses and output busses. Now each bus will also have an audio channel layout that can describe the usage of each channel in a multi channel context. And when you're dealing with the audio unit API itself and you're talking about properties that relate to the input and output busses, the actual bus number is passed to the audio unit as the element in the property address.
Now in order to set up an audio unit to render some data, the first thing you have to do is locate and open the audio unit and this is done using the component manager. This means you're going to use routines like find next component and open a component, etcetera, in order to locate the audio unit component that you're interested in.
Once you've opened the audio unit, you configure the audio unit by again, just like every other Core Audio object, you set properties on it. Things such as the sample rate and the maximum number of frames that you're going to ask the audio unit to render at a given time have to be set at this time. You also set up any algorithm specific attributes, such as rendering quality or pointing the audio unit at data files such as samples or impulse responses or what have you.
And then at this point you're also going to want to set up whatever the audio unit has to whatever other audio units that you're using and you're also going to use this point to install the various callbacks you're going to need in order to do the various processing.
Now in order to get processed data out of the audio unit, the first step after you configure it is to call audio unit initialize. This tells the audio unit that you're done configuring it and that it's okay for the audio unit to go ahead and finalize it's setup and allocate buffers, set up delay lines, basically get itself ready in order to do processing. Now once you've initialized the audio unit, you call audio unit render in order to have the audio unit provide you with data.
Now when you call audio unit render, you pass in a time stamp that indicates the position in the stream that you want the audio unit to render for. You also have to tell the audio unit how many frames to render. And recall that this number that you tell it can't be larger than the maximum frame size that you set earlier in the configuration step.
Now you will also pass in an audio buffer list that can either be populated with buffers, in which case you're telling the audio unit that you would like it to render into the buffers that you provide, or you can populate the audio buffer list with nulls, in which case you're telling the audio unit that you want it to render into it's own buffers and then provide you with pointers to them. Now both of those options are important for when you're trying to optimize the signal flow in your application so that you can minimize the number of copies that occur in the operation.
And finally, much like the audio converter, the audio units have a reset method that is there to return the audio unit to it's ground state. And you're going to call this to do things like quiet reverb tails, stop delay lines, etcetera. Basically return the audio unit to its just after being initialized state.
So you can get data out of the audio unit, you also have the task of getting the data into the audio unit that you want to process. Now there are two ways that you can get audio into the audio unit. The first way is that the audio can come from another audio unit.
You can manually use the make connection property on the audio unit to directly connect and input bus of the audio unit to another audio unit's output bus. When you make that connection, whenever you ask the audio unit to render on that bus, he's going to go pull on the output bus of the other audio unit to get his input data. Now manual connections are good, but the AU graph API is there specifically to help you manage all the complexity involved with dealing with the connectiveness of your audio units.
Now another way you can get input data into the audio unit is to use the render callback. The render callback is called by the audio unit whenever it needs new input data to process. And you install your render callback using the set render callback property. Now one thing to note about these two methods is that they are mutually exclusive on a bus by bus basis. That is on a given bus you can either connect it to another audio unit or you can install an input callback to provide the data, but you can't do both.
Now OS X ships a lot of audio units and this is a partial list of some of them and it's a list of also some of the general categories of audio units that are in the system. You have effects, music effects, which are things like delays and filters, EQs, what have you. The different between an effect and music effect is music effects can process their control information from midi.
And you have panner units. Panner units are there to manipulate the spatialization of a signal and we have units that do HRTF panning, vector panning, sound field panning, etcetera. Then we have format converter audio units. Now I mentioned format converter audio units earlier in that they're one of the kinds of audio units that can take non canonical format data. And the reason for that is some of them actually have an audio converter built into them, but not all of them. But mostly the key aspect of a format converter is how the cadence at which they're going to pull for their input data.
Then you have generators and music devices. Generators and music devices represent sources of audio in a graph of audio units. And the difference between a music device and a generator is that music devices are representative of software synthesizers, whereas generators provide a much wider range of things that can provide sound into the system.
And then you have mixers, which are pretty standard signal processing operations. We have several mixers on the system, including a stereo mixer, a mixer that does 3D mixing, the matrix mixer, which is for handling for routing and other forms of complex mixing operations and there are a couple of others. Finally you have output units, AU Hal being the primary example. Output units are the synchs in a graph. They're basically the place where audio goes to exit the graph and to be handed off to places like the hardware or to be written to a file.
And then the final kind of audio that I want to mention is the offline unit. An offline unit is a special kind of audio unit where the signal processing operation involved requires either random access to the audio stream data, or it requires to be able to see all of the data ahead of time before it can do it's processing. An example of an offline unit is normalization. Now I'd like to wrap up this part of the talk by talking about how you output data to the hardware and actually get it to come out of the speaker.
Now in the Core Audio API set there are probably more ways to output data to the speaker than there are to do just about any other kind of tasks. I'm going to talk about four basic ways right now, in fact. The first one I want to talk about is using the audio how directly. Now the Audio HAL, as you might recall from earlier, is a very low level API.
The Audio HAL's job is to be focused on providing access to the audio hardware on the audio devices terms. That is the Audio HAL does not provide convenience functions and it doesn't emulate things that they audio device doesn't actually do. For example, the HAL will not provide volume control on a channel if the hardware does not actually support it.
Now as such, the HAL's a pretty complex beast and applications that use it are expected to deal with all the various notifications and all the changes in state that the HAL will provide that relate to the ever evolving state of the audio hardware on the system. Applications are also expected to handle all their own mixing and also all their own format conversions so that they can provide the data to the hardware in the format that the hardware is expecting.
Now one other aspect of using the how, is that when you're inside your I/O Proc, which is the callback the HAL will call when it wants to get some data for output from you, you have to make sure that you're not going to block. This is very important because this thread has to meet a deadline and when you block there's no guarantee that you're going to get back to the thread in any time reasonable for making the deadline. Now given all the complexities of using the how and being a proper how client, generally it's not recommended for most applications to use the how.
Instead, most applications should probably look at using AU HAL, which is the output audio unit that talks to the how in order to send the data on a graph of audio units out to the hardware. Now AU HAL is an output audio unit and everything I've mentioned up to now about dealing with audio units definitely applies.
The main difference between output units and the rest of the audio units is that output units add to routines, to the audio output start and stop methods. And these routines are used to start and stop IO. Now AU HAL is there to do all the heavy lifting and all that complex code for being a proper how client on your behalf so that your application doesn't have to.
one of the big jobs that it does is it takes the wide variety of formats you're liked to get from an audio device and massages it down so that it appears in the audio unit world as just one input bus on the audio unit, and this saves you a lot of hassle and it kind of normalizes things out so that it makes it easy to feed data to the hardware. Now AU HAL comes in two basic flavors.
There is the basic AU how flavor and it tracks the device that you tell it to. You can go to the how and say I want to use the built in speaker, and it will track that device across any changes or whether it gets unplugged or what have you.
Now the other kind of AU how output unit is the default output unit. Now the default output units only differ from the regular AU HAL class in terms of what device that they track. The default output units come in two flavors. There's the regular default output unit that tracks the how's content default device, which is the device that the system has designated for all content output such as the out put for iTunes, or QuickTime Player or what have you. And then there's the system default device and that's the device that's used for interface sounds. And they may not be the same device, so you have to make sure in the case where you want to deal with default device that you open the right component.
Now one other interesting aspect of AU how is that it provides an audio converter on its input bus, but this audio converter supports only linear PCM data. But this does give you an added bonus of flexibility in terms of what kind of data you can supply to the output unit in order to map it onto the hardware.
Now when you're dealing with AU HAL, you can deal with mixing in two ways. You can either install a render callback and do it yourself, or you can connect a mixer unit to the input bus and connect your input data to the various inputs of the mixer audio unit. Now one thing to be aware of with AU HAL is that when you're inside your render callback for AU HAL, you're also on the HAL's IO Proc so all those rules I mentioned earlier still apply.
Now the next method I want to talk about for outputting data to the hardware is for using OpenAL. OpenAL is a high level API that provides cross platform 3D audio mixing. And while OpenAL has mostly been tailored for the needs of game developers, we found that it can be very useful in a wide variety of context.
And if you need cross platform support, OpenAL is a pretty good place to start. Now the Mac OS X implementation of OpenAL is built on top of the 3D mixer audio unit that I mentioned earlier. And I'm not going to go into too much more detail about Open AL because we're going to have a great demo about OpenAL in the very next session here, in the Core Audio hands-on session.
And the final way I'd like to talk about for outputting data to the hardware is using the Audio Queue API. The audio queue is a new API in Leopard that provides a very high level API for playing audio data. It can handle both linear PCM and compressed audio data, but one of the interesting aspects about the queue is that it owns the buffers that you're going to be filling up to play. Now the queue also supports very rich timing mechanisms and it also has support for scheduled playback of your sound. And again, much like OpenAL we're going to talk a lot more about the audio queue in the next session, the audio hands-on session.
So to kind of think back I've been talking about the core audio architecture in general, what sort of APIs we have, and I've gone into some usage cases at kind of a 10,000 foot level to kind of introduce you to some of the Core Audio APIs you're going to end up using.
And the next session, as I said earlier, is the Core Audio hands on session where you're going to get some more practical experience with the Core Audio APIs, in particular we're going to talk about playing and recording with the audio queue, using the audio file and audio file streams APIs, using the extended audio file API and more about OpenAL.
Now there are a lot of resources out there that you can use to find out more about Core Audio and its architecture. There's documents, including the Core Audio Overview, Getting Started With Audio, and various programming guides, including The Audio Unit Programming Guide, and the new Audio Queue Services Programming Guide, which you can get from the WWDC attendee site. You'll also find tech notes and sample code on the developer website at the URL here.
And again, here are some more resources I'd like to particularly plug, the Core Audio SDK again. There's a lot of code in there, and even though a lot of it is C++, you can adapt it pretty much to whatever you're trying to do, and there's a lot of knowledge kind of coded up in that SDK. And finally, I just want to plug our mailing list. The Core Audio API mailing list is probably the best way to get in touch with us. All of the Core Audio engineering staff monitors that list and we answer questions all the time there.