iPhone • 1:04:11
Core Audio delivers professional-grade audio processing and playback capabilities to any application. Dive deeper into the functionality of Core Audio and see how to take advantage of its richness in your own code. See how to apply the AV Foundation API to play sounds or record audio. Understand the best practices to adopt when using OpenAL for game audio, and uncover the power of I/O units for low-latency input and output.
Speakers: James McCartney, Bob Aron, Doug Wyatt, Eric Allamanche
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript has potential transcription errors. We are working on an improved version.
My name is James McCartney and I'm going to talk about AVFoundation. So AVFoundation is a set of Objective-C classes. I'm going to talk about AVAudioPlayer and AVAudioRecorder. Bill covered AVAudioSession in a previous talk. So AVAudioPlayer is pretty much the simplest way to play a sound, it's layered on top of AudioQueue and AudioFile and the AudioToolbox, and it plays file types that are supported by AudioFile including CAF files, M4A, and MPG layer three, et cetera.
And it provides basic playback operations like creating a player, preparing it, playing, pausing, skipping, stopping. It also supports volume, metering, and looping, and it uses a delegate for notifications for interesting events like your sound finished playing or there was a decode error, or there was an interruption that's begun or ended.
So there's two ways to create an AVAudioPlayer, you can create one using a URL for a file in the file system, or you can create one from NSData that contains AudioFile format data. So the first method that you'd call would be prepareToPlay, and that allocates all the resources that are necessary to play a sound.
It creates the AudioQueue, it allocates the buffers, and it performs priming, so it's decoding the beginning of the sound before playback begins. And then it returns Yes if there was no problem, and No if it failed. And if you don't call prepareToPlay and you just call play, prepareToPlay will happen automatically anyway.
But the reason you would want to call prepare is to get that first bit of audio decoded, and so when you call play it's just ready to go and there's going to be less of a bump in the CPU use right at that moment. So in order to play you call play and it starts playing the sound. And if you were stopped or you were paused, then play is the way to resume playback.
So pause stops playback, but the player remains prepared, all the resources are allocated, the AudioQueue and buffers. And so when you call play it will just resume playing from where you left off. Now stop also stops playback, but in this case the player is no longer prepared to play. The AudioQueue and the buffers have all been disposed. And so when you call play again it will need to do a prepare in order to start.
Now stop does not reset the player back to the beginning of the sound, it works just like on a cassette deck if you hit stop and then you hit play, you're going to continue from where you left off. If you want to go back to the beginning of the sound, you need to set currentTime back to the start.
So currentTime, that's the current position in the AudioFile in seconds. And you can get the currentTime or you can set it at any time. If you set it while you're stopped or paused, that will control where you begin playing back when play is called again. And if you set it while you're playing, you're going to jump to that time. And I'm going to demonstrate setting currentTime to implementing a fast forward feature in the demo. So AVAudioPlayer also supports metering.
You need to explicitly enable metering because it's somewhat expensive to run. So once you've enabled metering, then periodically you'll call updateMeters to have the meters update themselves, and then you can get the values for each channel. Other properties for AVAudioPlayer are the volume, you can get and set the volume.
Number of loops, you can turn looping on or off or set the number of loops, the number of channels and duration of the file. And also there's a delegate. You can create a delegate object for your AVAudioPlayer, and then when certain interesting events happen you'll be notified. One is that your player finished playing, or there was a decode error because maybe your file had corrupt packets in it.
And then also as was covered before, there might have been an interruption from a phone call, and then you'll get a begin and end interruption notification, you'll only get the end if the phone call was declined, or if you weren't just terminated. So if you do get an end interruption, you might still want to play, so in your delegate callback it would look like something like this, you might check a condition to see if you still were interested in playing.
And then if you were, you could just call player play and your playback will continue. So in the demo that I'm about to do which is implementing fast forward, what's going to happen is we're going to play .1 seconds of sound at a time and then skip .3 seconds ahead.
So that will give us a three times, sort of CD style skipping playback ahead in your file. So when the UI fast forward button is pressed we're going to use an NSTimer that is an instance variable in this class. If the timer exists, we're going to invalidate it, otherwise we're going to schedule it to every .1 seconds send the fast forward selector to this controller object, and then that's going to repeat periodically.
And then when the button is released we're going to invalidate the timer. So every time the fast forward selector is called, which is every .1 seconds while the button is down, we're going to advance the currentTime by .3 seconds, so that's going to give us a way of skipping through the file. And so I'll just go to the demo unit here.
[ Video game sound ]
When I hold down fast forward and we're skipping around, and reverse is implemented the same way, just skipping around in the file by setting currentTime. So it's that simple really.
Back to slides please. So new in 3.0 is AVAudioRecorder, it's very similar to AVAudioPlayer, it records sound. You create in AVAudioRecorder, you give it a URL of a file that you want to record to, and you give it a settings dictionary which is an NS Dictionary that contains key-value pairs that specify what format you want to record and the number of channels sample rate. And then for linear PCM you would also specify bit depth and endian-ness or for an encoded format you can specify quality, bit rate, and bit depth.
So these are the keys that are supported for the settings dictionary, and this is how you would set one up. So there's a key for the format ID. This is MPEG4AAC, the sample rate is 44-1, the number of channels is two, encoder bit rate 128 kilobits, and then the encoder quality is set to max.
So then we're just creating an NS Dictionary that way. So recording has the same corresponding methods as AVAudioPlayer, that's prepare, record, pause, and stop. There's also metering which works exactly the same as AVAudioPlayer, and there's delegate methods which respond to the same kinds of events that AVAudioPlayer responds to. So that's AVFoundation, it's simple. Cocoa classes to playback files and record files on the iPhone. And so now I'm going to bring Bob Aron, he's going to talk about OpenAL.
Good afternoon, my name is Bob Aron, I'm an engineer on the Core Audio team. And I'm going to talk a little bit about some best practices for you when you're using OpenAL in your application. So first, just for a quick review OpenAL, if you're not familiar, it's an Open Source API, cross-platform API for doing spatialized audio. There's implementations on several platforms, Mac OS X, Windows, Linux, of course the iPhone. It's primarily used for game development but it can be used for other audio purposes.
And if you're not already familiar, I'm not really going to go into any of the APIs directly, so if you're not familiar with OpenAL, I'd recommend that you go to the openal.org site for more info. As far as Apple is concerned, we've been delivering an implementation of OpenAL since we shipped Tiger 10.4. And then when the 1.1 specification was completed, we did implementations for a Tiger update on 10.4.7, and that's also what's on the 2.0 iPhone release, and of course 3.0. So OpenAL is basically organized into four primary objects.
There's an OpenAL device, an OpenAL context, OpenAL source, and an OpenAL buffer. And the OpenAL device is basically the object that is responsible for delivering audio to the hardware on the system that you're running. On the device this is implemented on top of the remote I/O audio unit, which we'll go into more depth in later in the session.
The OpenAL context is basically your spatial environment, it sits on top in the implementation on our 3D mixer where all the heavy mixing and the spatialization gets done, it's responsible, it has an implicit listener, it's responsible for rendering all of the objects that are moving around in your virtual space.
And those objects that are moving around are the OpenAL sources. So the OpenAL source's job basically is to queue up audio data and provide it to the mixer or the context, and then also to move around using the spatial coordinates, X, Y, Z coordinates that you might also use in OpenGL. The last object is the OpenALBuffer, and the reason I have that separated on the diagram here is that it doesn't really have any connection to the sources or the context which all have some connection in terms of how they're set up.
The OpenAL buffers you would just generate a set of these buffers, populate them with audio data, and they're available for use by any of the sources that are being rendered by the context that you have in your application. So let's just jump into some best practices. One of the things that you have to do on the device with the existing 2.0 iPhone OS when you get an interruption such as phone call, is to save off the state of your contexts and all the associated sources. So you have to save that off and the you have to destroy your contexts.
And then when your interruption ends, if you're still running, you can then create a new context and then restore all the state for those objects. Well with 3.0, you don't have to do that anymore, all you really have to do when you get your interruption in your listener is to set the current context to NULL. And then when your interruption ends, you can just restore your current context back to the context that you had been using.
So if your application, if you're already doing this on 2.0 where you're saving off your state, and you want to stay compatible with the 2.0 release. With a 3.0 app, you can make the decision at run time. So you can see there's a little code here where you can go get the major OS version, and then when you get your interruption you can make the decision on which mechanism to use at run time, so that way you can - we would recommend that if you're running on 3.0 you really do want to use the alcMakeContextCurrent method. The next best practice has to do with OpenAL sources. Now with OpenAL, there's no callback mechanism, so for you to find out state about your OpenAL source, you have to pull.
So if you want to find out if your source is playing or paused or stopped, or you want to know how many buffers have been processed, you have to pull. And so if you're using OpenAL already, you may recognize this little code snippet, you have to kind of pull and ask the source for some state.
And something we've seen in a lot of applications is a tight loop like this where you actually can hurt your performance because you're calling the source just too often. And so what we recommend is just that you put a short sleep while you're doing your polling and that actually can give you a benefit in your performance.
Another thing we've seen in regards to OpenAL sources in a lot of the applications that we've had to do investigations on since we shipped the 2.0 release, is a calling pattern that looks a little bit like this. Your application wants to play a sound so you create a source, you attach a buffer for playback, you start playback, you pull until your sound finishes playing, and then delete the source. And we've seen this even in circumstances like a game that's playing a machine gun or something where each one of those rat-a-tat-tats is doing this particular calling pattern.
Now this really kind of inefficient for that kind of scenario, so what you really need to do is be aware of what your source is doing. If your source is moving around in your space and you know it's going to be playing some more sounds, or maybe you're going to play the same sound again, you really just need to attach that source, play it, and then detach it, and then you'll have the source around, and not call the generate and delete sources so often if it's not necessary. OpenAL buffers. So the OpenAL buffer object as I mentioned earlier is the way that you encapsulate some audio data that you want to playback via an OpenAL source. And there's two ways that you can provide that audio data to the buffer object.
The first is the alBufferData call, that's the standard OpenAL API for providing audio data to your object. And you'll notice that the parameter to pass some information about the format how big it is, and a pointer to your data. The other way that you can provide data is using an extended version called alBufferDataStatic.
And you'll notice that the functions look quite similar, they take the same parameters, and the main difference here is that unlike OpenAL buffer data, which makes a copy of the data, the implementation makes a copy into its own internal buffers so that your application can release the memory. alBufferDataStatic, the implementation uses the data that you provided directly, so it's really important that you don't dispose that audio data before the library is finished using it.
So we'd recommend that you actually use the BufferDataStatic method primarily because you avoid a mem copy which is kind of an expensive operation on the device. And if you're constantly changing the data that's being played by these buffers, those mem copies can add up. And calling the BufferDataStatic method really is a bonus for your in terms of performance. So as I mentioned, you really don't want to release this data before we're done because we may be rendering that data at the time that you're trying to release it.
So when you call alDeleteBuffers or the BufferDataStatic in order to replace the data that that buffer object is representing, you have to check your error, it's really important that you check that error, and if calling alGetError returns no error, that's a great time for you to release that memory that the library is using.
So the last thing I really want to talk about is tuning your assets. And what I mean by that is being aware of sample rates both in terms of the sample rates of the audio data in your assets, and the sample rate of the rendering chain as it's rendering your data through the OpenAL pipeline. So what you'll see on the diagram here, on the far right-hand side, you see 44k, that's the default rate that the device is running at, and so by default if you don't make any other changes, will be running at 44k.
And then you'll see on the left-hand side, there's a stack of sources here that represent the audio data that each source is playing, and there's a bunch of various sample rates, 8k, 11k, 22k. This is a really typical scenario in the applications that we see using OpenAL. So to have an understanding of how you can tune this, let's kind of go down the list here.
By default, the context or the mixer will render its audio at the same sample rate that the device is currently set at. So in our case here, the context is going to do all its mixing at 44k. And this has a couple of implications for the performance of your app.
You'll see in the big red circle what's going on here is that every one of those sources that's not at 44k has to get up sample, so that's a processing hit, doing the sample rate conversion. The next thing that you get hit with is since it's a 44k, we're now pushing a lot more data through the mixing process, so that's another performance hit your going to take when you're rending this particular scenario.
So one of the things that you can do is to set the context mixer rate to whatever the highest sample rate of your assets are for instance. So in our case here we've got - the highest sample rate that we have in our assets is 22k, so we can set our mixer to 22k.
And there's really no benefit at running at 44k, you don't get any extra quality if none of your assets are actually sampled at 44k. So we get a couple of wins in this scenario. First of all, we've cut the amount of data that the mixer is using in half, so instead of 44k data running through the mixing process, we're at 22k.
And then you'll also notice that the samples that were already at the mixer sample rate no longer need a sample rate conversion, so you get two wins. And this is probably one of the best things that you can do in terms of getting some extra performance out of your OpenAL pipeline. Now another thing that you can do is actually resample your assets at the same rate that you know you'll be setting your mixer to when you're at runtime. So that eliminates all the sample rate conversion in the mixing process.
And to make this decision you'll have to make the trade-off because of course now your assets in your application package are going to be larger, so this may or may not make sense for you, but it is another thing that you can do to maybe tweak your performance just a little bit more.
The last thing that you can do, if your application isn't playing its own 44k stereo background track and you're just playing the sounds that are moving, you may want to consider setting the hardware actually to 22k too.styl"v And you can do that using the AudioSession property with the preferred hardware sample rate. So in this case we also eliminate a sample rate conversion at the device level where we'd have to upsample our mix up to whatever the hardware rate is.
As Bill mentioned earlier when he talked about AudioSession, you may not actually succeed in setting it, the iPod may be playing or something else may be already using the hardware 44k, so you have to be prepared to not actually get that rate. But that's yet one more thing that you can do. So those three things in tuning your assets really can help you get some extra performance out of using OpenAL. So with that, I'd like to bring up Doug Wyatt, he's going to talk about input and output units, and here he is. Thanks.
I'm Doug Wyatt, I work in the Core Audio group also, and I'd like to talk a bit about audio units used for input and output. So the names of these two audio units, we have one on the desktop called AUHAL and on the phone we've got one called AURemoteIO. They're similar enough that I'll talk about them collectively and describe some differences here. We'll talk about when to use one, how these units work, and I'll dive into the details of actually setting one up and using one.
So when would you want to actually use an input/output audio unit as opposed to some of these higher level APIs? The main consideration is that your application needs a high degree of responsiveness and low latency, and you're ready to operate in a constrained environment in order to do that. So for example, you might have a custom audio engine of some sort, you might be doing internet telephony, you might have a software musical instrument. On the desktop you might consider using the Core Audio HAL directly for these kinds of applications.
Although the advantage of AUHAL on the desktop is that it takes care of a lot of the busy work of being a client of the HAL, namely responding to certain notifications and dealing with some complexities of multi-channel audio devices. On the iPhone, the Core Audio HAL simply isn't available, so AURemoteIO is the lowest level way for your app to get audio in and out.
So just as a big picture of what you have to do as the client of one of these I/O units, you have to say whether you want to do input or output or both, you need to tell it what PCM formats you want to use as your client formats.
Typically you'll care about things like 16 versus 8.24 fixed bit, integer formats or floating point. But you may want to copy the sample rate from the hardware. In terms of connecting your audio unit up and providing output to it, you can get your output from another Audio Unit like a mixer or an effect.
Or you can supply a callback function that actually synthesizes or is the last chain in a big synthesis engine. If you're doing audio input, you specify a callback function that gets notified when input becomes available. And when your callback is called for reasons of symmetry in the audio unit spec.
At that time when you get that notification you'll turn around and call AudioUnitRender. Now in both cases, the audio unit will convert between some virtual format, that might be the hardware format. It will convert between that virtual or hardware format and your client format that you set up.
So we'll go into the details of that in a minute. So here looking at some code is the mechanics of actually locating and opening one of these I/O units. This is for the iPhone where we're going to locate the remote I/O audio unit, it's an I/O unit but for historical reasons we call them output units in the API. So we set up this audio component description and we pass that to AudioComponentFindNext which gives us back the component. And given the component, then we can turn around and create an instance of it, and that's our I/O unit.
Now this audio component API is also available on the desktop starting with Snow Leopard. On earlier versions of Mac OS X you have to use the classic, well I shouldn't say classic, but the carbon component manager. And we have a lot of examples of how to do that, it's very parallel to audio components. But on the phone and Snow Leopard going forward, we want you to be using the audio component APIs.
Okay, so this little diagram here turns out to be really useful when you start using this unit to just make a little drawing of this and set it next to you while you program, because you're going to be referring to these four quadrants of the unit all the time.
As you may know from working with other audio units, they have this concept of elements, for example, a mixer has input elements for each source that you're giving it. In the case of the I/O units, there are these two elements, one is for input and one is for output. Element zero is for output, and I remember that because output starts with O which looks like a 0.
And similarly input starts with I which looks like 1. Now we see that two of the four corners here are the virtual output and input formats that come from the HAL, well I'll leave that detail out. Those are your virtual input formats. And you've got your client formats, and in both cases those orange/yellow arrows signify conversions that the audio unit is performing for you between those two formats.
And then before we get into the actual process of specifying these formats, there's just a few things to think about, some wrinkles. These apply mostly to the iPhone. One is if you're going to be doing audio input, you should be asking AudioSession first of all whether there's actually even any input hardware present, because on the iPod Touch the answer will be no unless there's a headset or other accessory attached. You will probably care a lot about the hardware sample rate, so you can use AudioSession with a property listener to ask for notifications when sample rate changes.
You will want to probably tell AudioSession what sample rate you would like to operate at. And be aware that you won't always get that sample rate. If you're mixing with others, as Bill described in the previous session, the hardware sample rate could be set to something else and it won't change for you.
And if you're dealing with an original iPhone, when it goes into a mode with input active, then the hardware sample rate with the built-in receiver and microphone can only be 8 kilohertz. Unlike the newer models which support a wide variety of sample rates for both input and output.
The other thing is especially if you're familiar with using audio units in other contexts, you want to call AudioUnitInitialize relatively early because it's only at this time on the phone that your AudioSession parameters, most notably the sample rate and the enabling of input, actually take effect. So you want to call that to establish your sessions, or to find out what sample rates you actually get from your AudioSession.
So moving on to choosing your client formats for AURemoteIO, typically you will look at the virtual I/O formats. And if necessary, the audio unit is capable of doing a sample rate conversion between a rate that you like, but ideally in a lot of apps it's a good idea to just play along with the hardware sample rate. If you specified it, then you got the one you want. If you didn't, well then you do have to play along.
The difficulty is the rate conversion is a bit computationally expensive and it's not of the highest quality. One mistake that I've seen some people making, and it's partially our fault because we shipped a sample like this, and have since corrected it, but don't just simply blindly copy that virtual I/O format to your client format because you don't know what that virtual format is, it can and will change.
But on the other side of the coin, you are in control of the code that is used in that client format of yours. So be specific about what your code is expecting and just copy aspects of the virtual format that you want to use as variable, typically the sample rate and the channel count.
So going back to our diagram here, so this just shows you that you use the AudioUnitSubProperty call with the property stream format to set your client formats. And a few more small considerations for AUHAL, these are differences. Since there's no AudioSession here you get the hardware sample right from the virtual I/O format.
You use AudioUnitAddPropertyListener to watch for changes to those sample rates. And on the desktop with AUHAL, you want to avoid changing the hardware sample rate because the user tends to be a bit more attached to it especially if he's an audiophile with 96 kilohertz output system. The consolation is we have very good sample rate quality, sample rate converters in terms of quality on the desktop. Input is a little strange on the desktop. Here you may want to give your user control over the hardware sample rate to avoid rate conversions on input.
If he's making a recording for instance in an application like ProTools or Logic, he's going to care about how much disk space he's chewing up as he's recording.
[ Laughter ]
And part of the picture here is that AUHAL will not do sample rate conversion on the desktop, you have to record at the hardware rate, which means that if you want your audio at some other sample rate, you'll have to store it off in a ring buffer and do other rate conversions.
Okay, going back to some code, here's what it actually looks like to call AudioUnitSetProperty for the stream format. One kind of pervasive data structure in the Core Audio APIs is the audio stream basic description. And this is how you specify a PCM format. And we'll see a detailed example of that in a moment.
So now we've set our client stream formats and we're ready to connect up our code to send audio out and receive the incoming audio. On the output side, we've got two properties. The first SetRenderCallback says here's my function pointer that I want you to callback for the audio that I as an application want to send out.
You can also call or set the property MakeConnection, and what that says is here's another audio unit from which the outgoing audio is to be obtained. On the input side, the property is called SetInputCallback, and that just simply sets a function and user data, and that function of yours will get called back when input becomes available.
These calls are brain dead simple, they just have one argument. The first one I mentioned earlier, AudioUnitInitialize, that gets your session set up and established. The second one, AudioOutputUnitStart, even if you're doing input. That actually starts the hardware running and starts your callbacks getting called. So here is a diagram of what happens once the hardware has been started.
The first thing that happens, assuming you're doing both input and output, the first thing that happens is your input callback gets called and told that input is available. And in response to that you'll turn around and call AudioUnitRender, and the AU will convert the incoming audio in the virtual input format, to your client format, and that comes back to you from AudioUnitRender.
On the output side, element zero will end up pulling upstream from its source whether it's your callback function or that other audio unit that you connected to it. So let's talk about the context in which your callback functions execute. This is a bit tricky and daunting to people who haven't done it before. Your callback functions run on a real-time priority thread, and this is the highest priority thread in the system, so you have to be pretty careful about what you do here.
You can actually make the system unresponsive for long periods of time if you get into an endless loop or try to do inordinate amounts of work. So again, it's important to be efficient. You really can't allocate memory, touch the file system or the network or any other API like that. And the reason is because all of these things will take mutexes and block your thread.
And you really don't want to block the audio I/O thread because the result is it will miss its deadlines and we'll hear a glitch. So if you really have to lock, maybe you're changing the entire configuration of the engine, okay, so you're doing that from say your U/I thread. And then your render thread at that point should be trying to take the lock and then failing and producing silence.
It's a much better and more efficient pattern than actually blocking the render thread. One other thing to mention here, a lot of people learn Objective-C and think it's so great they want to use it absolutely everywhere. And in the U/I application I completely understand that until you get down into the level of working on real-time threads because there is some overhead here.
In particular on the desktop, you don't want to mix Objective-C and real-time threads in a garbage collection enabled app because that makes your real-time thread start participating in the garbage collection scheme and potentially causing major glitching. On the iPhone we've seen some new converts to Objective-C invoking methods to compute every sample that they output, and that's about maybe 20 times less efficient than it could be, so it's good enough to just use C arrays and pointers in your real-time threads.
Be careful about what you do on every sample. So that caveat aside, let's walk through a bit of an example of just to look at all the code that's actually involved in getting some input from AURemoteIO on the phone. This is pretty much what we looked at before using the audio component calls to find and open the I/O unit.
Here is a little detail that I mentioned earlier but we didn't look at in code. We have to explicitly enable audio input, and while we're at it since we're not going to do output we'll save some efficiency and turn that off. So the property is called EnableIO, the value is 1 for enabling input and we're going to do that on the input scope of element 1, which as we remember, element 1 is like I for input. And completely parallel to that, we disable audio output, and then we're ready to initialize the audio unit. And at that point we have a valid virtual I/O format.
The type gets a little smaller here. So we can look at this virtual input format that we got back by calling AudioUnitGetProperty for the stream format. And we're going to peek at that, that's now in our I/O format local variable. And as you see, as you scan through the code there, the only thing we care about there is I/O format.msamplerate. We're going to go and set up client format to otherwise be entirely programmatically generated on the fly here to be 16 bit native endian stereo interleaved integers.
But we are going to copy that hardware sample rate. So having set up that client format, then we can make that call to AudioUnitSetProperty to set that client format as the stream format for the output scope of the input element plus 1. So we've set our formats and we're almost ready to connect. First we need to set up a little bit of context that we want to have available when it's time for us to render or process our incoming audio.
So I've got a little structure called myContext where I'm going to hold onto a copy of my I/O unit. I'm going to allocate my input buffer list to be an audio buffer list structure, I'll fill that in on the fly later in the render proc. And I'm going to hold onto a copy of the client format that I set up and we'll see why when we get there. So having set up that context, I create an AURenderCallbackStruct, and it has two members, one is the function pointer, to MyRecordInput which we'll see in a moment.
And the other is a pointer to myContext. Just a note here about this example code MyContext looks like a local variable here but it shouldn't be, it should be a member of something or a global because it's going to be accessible when we get called back. Okay, so we've set the input callback and we're ready to actually start performing I/O, and to do that we call AudioOutputUnitStart.
And here in a nutshell is MyRecordInput, the function that we call in the render proc, this is what gets called when input becomes available. So you can see the first thing we do is retrieve our context which came in as the first function argument in RefCon. Given the context, then we can get to the input buffer list that we allocated earlier, and now we can fill in that audio buffer list.
And what we're doing here is we're preparing it to receive the incoming audio data. And we really ought to be doing this every time through the loop, every time we get called to be notified that input audio is available. This code is hard coded to know that we specified stereo interleaved so I just got one buffer with m number of channels of two. mdata can be null because there's a buffer inside the audio unit that it's already allocated for me for this purpose.
When I come back from AudioUnitRender that will be filled in. A really crucial thing here that even I screw up from time to time when writing code, is mDataByteSize needs to be refreshed every time we call AudioUnitRender. And the value of that should reflect exactly how many bytes of audio I expect to get back from AudioUnitRender. And that computation is the number of frames that is available, this render cycle, inNumberFrames.
And here we see why I saved off my client input format. So instead of hard coding this, I just thought I'd show you as a good abstract example, we know the number of frames, we know the number of bytes per frame, and that's the computation to obtain the data byte size, multiplying those two. So now we have a ABL, which is our AudioBufferList, just a short name for the context's InputBufferList.
And I can call AudioUnitRender, and if I don't get an error back from there, that means that ABL or context to myInputBufferList contains the samples from the hardware in my client data format. So the unit performed the conversion. And whatever I want to do with that, whether it's store it in a ring buffer, process it. If I had output enabled I could process it and play it back backwards or whatever I liked.
So there's my input function. So that's a pretty complete example of using AURemoteIO to obtain input. So in summary, you would use AURemoteIO and AUHAL, use I/O audio units to do low latency responsive audio I/O. Be careful when you're specifying your client data formats, don't assume anything about the virtual ones. The unit knows what they are and will convert from the virtual ones to the client ones you specify. And just be really careful about what you do in your I/O callback functions, please don't allocate memory, take locks or do too much with Objective-C.
Thanks for listening.
[ Applause ]
Thank you, Doug. So my name is Eric Allamanche, I'm also on the Core Audio engineering team. And so this is the last talk in this session, and I'm going to talk of a new audio unit that we've added in iPhone OS 3.0 which we call the Voice Processing Audio Unit.
So this is kind of a remote audio unit that Doug just presented. So first I'm going to explain why do we need a voice processing audio unit, and it turns out to be an Acoustic Echo Canceler or part of the processing audio unit as an Acoustic Echo Canceler.
Then I will present a few features that we exposed through the public APIs which are available to you if you want to use this in your code. And then I'm going to show a very small example how you interface your code with the voice processing audio unit. And finally I will wrap up this talk with a small demo.
So why a voice processing unit? In full-duplex audio systems like telephony systems, the loudspeaker and the microphone are typically in use simultaneously first. And also are typically spaced only a few inches away from each other. This means that the microphone captures parts of the sound which are emitted by the loudspeaker.
And in a telephony like application this means that this sound is looped back to the sender which just sent out the signal which went out to the loudspeaker. And due to various delays in the processing and transmission chains we have nowadays which is about a ballpark figure of 100 milliseconds for roundtrips.
This signal is perceived as an Echo to the other party, and one day or another we've all experienced this kind of situation and it becomes extremely annoying, and especially if the delay stretches over time. So you can't have conversations in such conditions. So this is a general problem in telephony and it also applies to analog telephony. And this has been an extensive topic which has been studied over last decades, so the solution is simply to apply what we call an Acoustic Echo Canceler or AEC to the captured signal.
So now in the context of the iPhone for example, how would this apply? You see that on the bottom right you have the microphone, on the bottom left you have the speaker, so these are separated by one and half inches apart. So there is a large amount of acoustic energy which is sent to the speaker which goes back to the microphone. But even if you are using the receiver instead of the speaker, you still have some echo which is noticeable.
And this is simply due to the fact that the receiver and the microphone are in the same enclosure, and as you can see, the distance is not that big, three inches or so. So in both cases we need to use some echo cancellation techniques to reduce the amount of sound sent back over the transducers.
So one thing I wanted to point out is that the iPhone is primarily a phone, so the base band unit is also already doing this when we get a phone call, when we hit the speaker button, it goes into hands-free mode, and it performs an echo cancellation already.
But the thing is that this echo canceler is built-in to the telephony part which isn't accessible to us, and we can't expose this to the [inaudible] for some APIs. So this is why we decided to add a new audio unit in iPhone OS 3.0 which performs exactly this task, and of course that you can use in your applications.
So now I just want to go through the functionality of an Acoustic Echo Canceler. Now imagine the following situation on the left-hand side there is the person who is called the near end speaker, for the sake of clarity it's going to be me, with my iPhone. And on the left-hand side, the blue person is called the far end speaker. And so we have a peer-to-peer connection or whatever kind of connection is set up.
And so the far end speaker, which has the blue arrows, starts to speak in his own microphone on the right-hand side. And so the signal propagates over the transmission chain and comes into my device which is played back over the loudspeaker. So this acoustic wave gets reflected, in this figure it's reflected in my forehead.
And so the wave is bounced back to the microphone, and this signal is then sent back to the far end speaker. So because of the delays I've mentioned before, the far end speaker will hear himself with a delay of 100 or depending on the transmission chains. But it's going to be a noticeable delay, it's going to be really annoying.
So now if I start speaking on top of this, you see that the blue and the red signals are mixed up together at the microphone level, and so this mix is going back to the far end speaker. And so the far end speaker will hear my voice but also will hear his or her voice. So to compensate this, we use what we call an Acoustic Echo Canceler which is mainly built of two blocks; one is an adaptive filter, and the second block is an estimator.
So the estimator is hooked up between the signal which is about to be sent to the loud speaker, and the signal which is captured by the microphone. And from these two signals, the estimator tries to calculate a set of parameters by estimating the amount of blue signal which is captured by the microphone.
And this set of parameters are then passed over to the adaptive filter which tries to eliminate the blue component of this signal mixture. And at the end when we have the output, it's actually a red signal which should be the near end speaker signal only. But this is of course under ideal conditions. So this is what happens basically when we have a simple setup with two different persons.
Now what happens if on my side I playback some UI sounds or let's say a text message comes in and we have this tritone playing back or other kind of sound effects if we are in a game scenario for example? If the signals are all the same, what would happen depends where we're going to mix in this UI signal.
So if we mix this signal in before in the chain, then we will be able to [inaudible] this otherwise, it will go directly to the speaker and the other person will hear exactly the sounds which I'm playing back on my side. And this is maybe an undesired effect because if we have a multi-player game with voice chat capabilities, we will like to have the ability to play back sounds on a local device but get rid of these sounds after some processing so that the other players don't hear the sounds playing which are played back.
So the voice signals are typically at 8 kilohertz for using telephony sampling rates, but the UI sounds or the other sounds may have higher sampling rates. So in order to cope with the differences of the sampling rates, we need to have some additional sample rates, converters in the chains. So just to illustrate this again, so we go back, this was the scenario where we had only the two speakers, and now I'm adding an app sound.
And I'm adding this at this level here because I want this sound to be estimated by the system and also removed from the signal which is going to be sent back to the far end speaker. And so this is what's happening, so my app sound which is characterized by a green arrow, goes into this path, and so the color changes of course of the signal which is sent out over the loudspeaker. But since this app sound is mixed before it is rendered through the loudspeaker, we're able to treat this.
So we've seen that there are two operation modes in this audio unit. And these operation modes relate to the fact how do we set up the sampling rates? So if we do a voice-only application, by means of setting up the AudioSession parameters as you have learned in the prior session, by setting the AudioSession parameters, setting the device sampling rate to be the same sampling rate as the voice client wants, then in this case we only have only single sampling rate in the whole system from end to end.
So the far end speakers at 8 kilohertz, but also my loudspeaker and my microphone are operated at 8 kilohertz. But this means if I play back a sound, then the sounds will have to be down sampled to 8 kilohertz to the target sampling rate of the device, and we lose quality at this level.
On the other hand, if we use the dual sample rates mode, as shown in the previous slides, we would set up the audio device at the highest sampling rate, let's say 44 kilohertz for example, and then the system will internally do the down and up sampling to the voice signals from 8k to 44.1k and back and forth.
And this allows to play back UI sounds and other sounds at the highest possible qualities. And so this an ideal mode for multi-player games which have voice chat capabilities. Also notice that you control this indirectly by setting the preferred hardware sampling rates, so this is done automatically behind the scenes, there is no switch where you would change this mode directly.
So UI and other sounds are canceled because they are mixed at some point in the rendering chain. And we added also an option, an automatic game control unit after the processing in order to boost actually the voice which is played locally. The supported voice client sampling rates are 8, 11, 12, and 16 kilohertz, the usual suspects for voice applications.
And one important note also is that the secondary audio signal is ducked to maintain intelligibility. So the problem is that if you play back a sound over the local speaker and the sound level is extremely high and you begin to speak on top of this, the system may not even notice that you are talking because the sound level is too high.
And if the system is not able to detect your voice, then it won't be able to remove anything and it will remove everything. So the other person, the far end speaker, won't hear you at all. And in order to compensate this, we duck the secondary audio signal. But one thing to keep in mind is that audio unit won't handle any encoding or decoding, so if you want to encode this in [inaudible] or whatever format you wish to encode it, you have to take care of this explicitly.
Currently we define three properties which can be found in the audio unit property set up file. So the first one is BypassVoiceProcessing, as the name implies, it's just a Boolean switch which turns the voice processing on and off. You are also in control of enabling or disabling the automatic game control, but this is on by default. And we also have to control the ducking of the non-voice audio.
And this is also on by default. So setting up the voice processing unit is very similar to setting up a remote I/O unit, as Doug explained in the previous talk, so I won't go into too much details here. But basically what you do is you set up an AudioSession, and the important thing is that you have to set up a play and record category because this is exactly what you want to do, you want to send sound over the speaker and capture directly over the microphone also. Set the preferred sampling rate, and this will tell the system if the user is the single rated or the dual rated mode.
And then you create a remote audio unit, but instead of using the remote I/O subtypes, you use the newly introduced voice processing I/O subtype. And of course you are in charge of implementing both the Input and the RenderProc callbacks to handle all the signal I/O. So that being said, I will like to go over to the overhead projector and show you a little demo.
So what we have in this UI here, you see that the lower part, the red voice, the level meter which is labeled voice unit output, is actually what's cutting out of the voice processing unit. But at this time I'm talking and there is no far end speaker and there is no app sound playing at the same time, so you see that the red level meter is moving along with the loudness of my voice.
So now I'm going to play back, it's just simulation, I'm going to play back a voice file which is a recording which I made with one of my co-workers. So I'm going to play back this file.
Hello.
So this is -
How's it going?
I think they're doing really well, I'm just here trying to prepare for WWDC.
So as you see, the blue level meter is moving along his voice, but the red one is, if I stop talking -
So you see that the red level meter is moving along his voice because this is what we are capturing through the microphone and we are processing.
How's it going?
So I hope you hear the sounds coming out. So now I'm going to play both signals so the sound effects output, the voice output. And now I'm going record actually what is coming out of the voice processing unit. And so I'm talking, and so we have a mixture of three different signals, one far end is our talker and one local app sound, and now I'm going to bypass the voice unit and so we may be able to hear the difference. So now I'm going to stop the recording and I'm going to play back the recorded file.
And so I'm talking, and so we have a mixture of three different signals, one far end is our talker and one local app sound, and now I'm going to bypass the voice unit and -
[ Applause ]
So I just want to summarize what we talked about in the session. So first the AVAudioPlayer recorder, the simplest ways of playing back or recording audio files in all those formats we are supporting on the device.
We presented some best practices regarding the usage of OpenAL, and in-depth review of AURemoteIOs for very low latency, for applications requiring low latencies. And finally some voice processing audio unit which gives you developers the ability to use out built-in echo canceler for your applications. If you need more information please contact Allan Shaffer, our Graphics Technology Evangelist at the email address which is shown right now.