Real-Time Media Effects and Processing during Playback - WWDC 2012

Graphics, Media, and Games • iOS, OS X • 50:37

Learn new ways to customize media playback using AV Foundation in both OS X and iOS. Explore how to synchronize video media with audio synthesis, achieve realtime audio processing from audiovisual media, and integrate AV Foundation's hardware accelerated media decode and display with your own custom media transport. Learn how to perform realtime video effects processing and how to use video media to texture 3D OpenGL environments.

Speaker: Simon Goldrei

Unlisted on Apple Developer site

Downloads from Apple

HD Video (402.3 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. It's so great to see so many of you here. So by now, many of you have gained some familiarity with AV Foundation. And perhaps you've even used AV Foundation successfully in your applications. What we're going to discover today is the new features that we've built into the framework, we've been working hard on this over the last year, that allow you to customize and really control the playback experience for your users.

I'm Simon, and I'm part of the Media Systems team at Apple. And we're going to cover three topics today. The first topic is how can you tightly coordinate, tightly synchronize events in your application, like gameplay and generated audio, with video playback or playback in general with AV Foundation.

The second topic is how can you achieve realtime audio effects, realtime audio visualization in your application. And then finally, the third topic, something that I know many of you have been requesting for some time, is the ability to do the same for video. Real-time video effects, real-time processing on both iOS and Mac OS.

Now, there's a common goal to all three of these. And if we don't achieve that goal when building these applications, we lose the magic of a movie. And that ingredient that we have to achieve is ensuring precise, accurate audiovisual synchronization. We're going to come back. I'm going to highlight how you can achieve that in all three of these topics.

There are some prereqs for this session. We'll need familiarity with the objects that I introduced this session last year. The static model objects, AV asset and the AV asset track. The dynamic at play time, dynamic play time objects, the AV player item and the AV player item track. You also need to be familiar with the media controller, the AV player.

I showed you this diagram last year. AV Foundation sits below the application toolkits UIKit on iOS and AppKit on OS X. And it sits above the core frameworks, Core Audio, Core Media, and Core Animation. But today, I'd like to introduce two new public frameworks to you. Media Toolbox and Video Toolbox. And functions from these frameworks start with VT and MT.

We're giving you these frameworks because they provide low-level functionality that will help enhance the customization and enhance the experience for your users. But first, let me take this opportunity to give you a quick playback update. Enhancements to playback in AV Foundation since last year. Last year, I gave you this recipe for achieving basic playback in just four easy steps. The first step was to create your AV URL asset. Then, using the AV asynchronous key value loading protocol, you loaded the tracks.

Once your tracks have been successfully loaded, you created your AV player item. And then with that AV player item, you associated it finally with your AV player. Now, given the feedback from developers like yourselves here at DubDub and since then, we've made it even easier. In fact, we made it easier even in iOS 5. You can now achieve basic playback in just three easy steps.

But if your application needs to discover the tracks over your asset, because you need to know about the collection of video tracks or the collection of audio tracks to do some logic in your application, you still need to load, using the AV asynchronous key value loading protocol, the tracks property.

If you don't, on iOS, you risk RPC timeout. And on OS X, you risk your application becoming unresponsive and your users will experience the spod, the spinning pinwheel of death. So let's move on. Before we can talk about how to customize the playback experience for your applications, I need to introduce you to some synchronization primitives, some objects that will help you customize that playback experience and tightly control the synchronization.

Last year, the classes that we talked about for playback were the AV Player, the AV Player Item, and the AV Player Layer. AV synchronization wasn't something that you needed to deal with. It wasn't something that concerned you. AV Foundation did the hard work of ensuring the right image was displayed in your AV player layer and audio that was coming out of the audio device was magically and perfectly in sync. But to do this, your AV player item was managing separate video and audio pipelines for you.

Now, to get all these objects perfectly in sync, they all need a common sense of time, a common source of time measurement. And as you probably know, like we do in the real world, the source of time measurement is a clock. But in our computer systems today, we have multiple clocks. On iOS, you have a host clock in your SOC. You've got an audio clock in your audio device.

And on Mac OS, you have much the same. You have an audio clock for every core audio device that's attached to your system. And you have a host time clock associated with the CPU. To get all these objects perfectly synchronized, they all need a common clock. And so we're introducing to you two new objects, the CM Clock and the CM Timebase.

So a CM clock is a source of time measurement. There are two types, the host clock and the audio clock. Now the time base extends on this notion by introducing rate. It's rate with respect to time. Time bases are something that's under the control either directly or indirectly by your application. And time bases can be slaved to a master. They can be slaved to another time base or directly to a clock. Let's take a look at a picture that describes this.

Time always marches on. Time constantly progresses so long as your system is powered. Now, currently, the time base is static. It's not moving with the clock because its rate is set to zero. If we set the rate to one to commence regular playback, The time base values now progress in lock step with the clock. For every one value on the time base, there is an equivalent value on the clock.

If our user now pushes the fast forward button and sets the rate to two, we see that values on the time base now progress at twice the rate of the clock. For every clock value, there are two time base values.

[Transcript missing]

Then we see a sort of mesmerizing pattern on the screen, and we have the clock still marching in the same direction that it has been before at the same rate, but now the time base is moving in the opposite direction with values decrementing.

How do you get these CM time bases and CM clocks? Well, on iOS, to create a reference to the audio clock, you can call CMAudioClockCreate. On OS X, you can optionally specify a Core Audio Device UID to get a clock that references that specific device. On either platform, to get a reference to the host clock, you call cm_clock_get_host_time_clock. To get a reference to the time base, we've added a new property on the AVPlayer item. And you can also create a time base and slave it directly to a clock by calling cm_timebase_create with master_clock. And you can use this for an AV sample buffer display layer.

Now that we've armed ourselves with the knowledge of how to create clocks, create time bases, and their relationship to one another, and their relationship with time, let's use this knowledge to tightly synchronize events in your application, such as gameplay, with video playback. In these series of slides, what I want to show you is how to synchronize video playback with our own generated audio.

Your application might use the audio toolbox to do this directly. Let's return to this diagram. And let me add back in the AV Player and the AV Player layer where our video will be displayed. And let me introduce to you new API on the AV Player that allow us to synchronize with now your own generated audio.

It's custom, so you generate the audio yourself using the audio toolbox APIs. And that, by generating that, you're driving the CM Audio Clock. What we want to do now is tell AV Foundation you need to synchronize your playback with that clock so that we all stay in sync.

What's the relationship between these two? How do we tell AV Foundation that's the clock we need to do our time calculations to and that's where playback needs to be synchronized to? To do that, it's really easy. There's a new property on the AV Player Master Clock. Once you set that property to a clock that you have created, AV Foundation will ensure that all time calculations occur with respect to that clock. or playback will be in sync with that clock. So that's a great way to stay in sync. But how do we start in sync in the first instance?

For that, you can call setRateTimeAtHostTime. We'll take a look at this in just a quick moment. By calling setRate at time at host time, what you're instructing AV Foundation to do is tightly coordinate so that when you're playing out your audio, we'll set the rate to 1 at the exact same host time that you anticipate audio hitting the speaker.

Now, this API takes host time in a CM time structure. And to convert that, we provide a utility, CM Clock, make host time from system units. Once you've used this utility to convert into CM time, you can now use set rate item time at host time. And what you're doing by calling this API is your declaring to AV Foundation that there exists a pair of times, one on the time-based timeline and one on the clock timeline, that occur at the same moment in time. Once you've made this declaration, all the times in the past and in the future now can be calculated by AV Foundation.

You can set the rate for both a past host time as well as a future host time. If you set rate for a past host time, What AV Foundation will do is it will pretend as if, it will behave as if playback had been going from that past time all along.

It will quickly catch up, load the media, and start playback as if it had been running since that time in the past. But if you want to commence playback in the not too distant future, you want to be able to give AV Foundation the opportunity to load up some media and prepare itself for that pending playback.

And to do that, we provide two APIs. You can ask your AV player to pre-roll for some playback rate and then execute a completion handler when it's done. This is an asynchronous operation. So you can also ask the AV player to cancel that operation if your user backs out of UI or stops playback altogether.

Let's take a look now at a coding example of where we can tightly synchronize video playback using AV Foundation with audio that you generate yourself. You get a reference to your audio clock where you want AV Foundation to do all its video playback to be synchronized to. You tell the AV player, that's the clock where I need you to do your time calculations with respect to, and that's where we should synchronize our video playback to.

To get AV Foundation to pre-roll and get ready for this future host time, we say pre-roll at rate, and the playback rate in this case is one, regular playback, and then execute a completion handler when it's done. When that completion handler fires and it was successful, we calculate a future host time for when we anticipate our audio to hit the speaker and finally, We set the rate to 1. We ask AV Foundation to set the rate to 1 when that future host time becomes current.

So now that we can tightly synchronize our own audio with video playback or our own gameplay with video playback using AV Foundation, How can we manipulate audio in an audiovisual item or an audio-only item during playback? How can we affect those samples, add an effect, or just simply visualize them, draw a waveform?

[Transcript missing]

And what we want to do here is we sort of want to tap into that audio pipeline. We want to intercept those decoded audio samples as they're on their way out to the audio device. We want AV Foundation to do the audio decode for us, but we just want to get at those audio samples, manipulate them a little bit, and then send them on their way. And for that, we're introducing a new object, the MT Audio Processing Tap. It's your opportunity for doing customized audio effects using AV Foundation.

To associate the MT Audio Processing tap to our AV Player item, we're going to use an object that you might be familiar with, and that's the AV Audio Mix. If you're unfamiliar with the AV Audio Mix, you can pop back to last year's session and re-familiarize yourself. To configure an AV audio mix, like we did last year for audio ducking and audio volume ramping. We use an AV audio mix input parameters object and we can configure that for each AV asset track that our asset has. Now to do the association with the empty audio processing tap, we've introduced a new property on that object, the audio tap processor.

even yet told you how to create an empty audio processing tap. Let's take a look at that now. To create an empty audio processing tap, which is a CF type, and even under ARC, you need to manage its lifetime with retains and releases, you call empty audio processing tap create. And to configure this object, you fill out a structure called the MT Audio Processing Tap Callbacks. This structure defines a set of function pointers, callbacks into your application, which will get executed throughout the lifetime of the tap.

I want to focus on the last three of these function pointers. The prepare, unprepare, and finally, the process callback. The prepared callback is paired with an unprepared callback. In the prepared callback, it's your opportunity to allocate buffers, to create objects that you'll use during audio processing. You do all your expensive operations here.

The unprepared callback, as you've probably guessed by now, is your opportunity to do the converse, to deallocate these buffers and throw away those objects. But in the processing callback, you need to stay completely focused on doing your audio processing. You can't, or you shouldn't, Obtain locks, do allocations, do deallocations. Because if you do, you risk your user encountering audio stutters which are very noticeable. They'll hear these audible glitches.

process callback provides another structure, the audio buffer list. And this structure, at the time of you getting the process callback, is only partially populated. At this point in time, it only describes the number of buffers and the number of samples in each buffer that have been made available to you for processing by the tap. To actually get those samples, you need to pull them out of the tap. And for that, you call MT Audio Processing Tap Get Source Audio. This is your opportunity to pull those samples that have been made available to you out of the tap.

It's important to note that the buffers that have been made available to you are only good for the lifetime of this callback. If you need them for longer, you should copy them. Now, if you want all the samples that have been made available to you, you can pass in the audio buffer list structure in its form that it was given to you at the time of the process callback. And this will get all the audio samples, all the audio buffers that have been made available.

So during the process callback, remain focused, keep your code honed in on just doing the audio processing. You're executing on a very high priority thread, and those of you who have worked with Core Audio before are well aware of the constraints of doing this sort of audio processing in real time. If you do need to do audio processing on some ancillary thread or ancillary work queue, you might want to consider the use of a CM simple queue. This data structure is both lockless and thread safe.

Let's take a look now at a coding example where we use the process callback to just iterate over the buffers and the samples in each buffer and do some trivial manipulation. The first thing we're going to do is call the empty audio processing tab get source audio routine to pull out of the tab all the samples that have been made available. Now that we've learned how many buffers and how many samples in each buffer are available, and we've even pulled out the data itself, we can now iterate over them.

We can do something interesting with each to, say, ramp up the volume a little bit. We can multiply each sample by a scalar. And if we need to discover if our samples are interleaved, you can do that in the init callback. Sorry, the prepare callback. In this particular example, I've discovered this already and I've sorted in Naiva.

Now I'd like to give you a demo of how you can easily combine the empty audio processing tap with a core audio unit. The audio unit that I'm going to use in this demo is a bandpass filter. And what a bandpass filter does is it allows us to define a subrange of audio frequencies that are allowed to play out through the speaker. High frequencies and low frequencies are cut out.

There are two properties we can affect on the bandpass filter. We can affect the center of that filter, whether it's centered around the high frequencies or down in the mellow low frequencies. And we can also affect the range or the bandwidth of that subrange of audio frequencies. Let me show this to you right now.

I first want to play the item for you at its full fidelity so that you can appreciate the audio track before I apply the effect. As I play, you can watch the video meters bounce in sync with the audio. I also want to highlight the audio synchronization with the visuals. You can see the artist here strumming along on the strings.

Now I'm going to play you the same item again, but I'm going to apply the bandpass filter. I'm going to leave the bandwidth roughly as is. And then as it's playing, I'm going to move the center of the range up and down the full frequency range so that you can hear how the audio unit affects the audio as it's playing out. As I move the center of the frequency up to the highs, you'll notice that the audio becomes nice and bright. As I move it down into the lows, it will become more mellow.

So how easy is it for you to use a Core Audio audio unit with the empty audio processing tab? It's really quite simple. The two technologies were made to work together. To create an audio unit, you fill out an audio component description. Once you've discovered and created this unit, you configure the render callback.

In the render callback, it's really quite simple. Every time the audio unit calls you back and says, I need more audio to render the effect onto, all you need to do is call empty audio processing tab get source audio to pull out those samples and then give them back to the audio unit for processing.

Now that we've discovered how to do real-time audio effects in your application, either by manipulating the samples directly or by applying an audio unit developed by Apple in your application, Let's do the same for video. Let's discover now two ways to achieve realtime video effects to either do image analysis or to use the video images that have been decoded in realtime with some other imaging technology. But before we dive in, let's discuss what the challenges are of all types of video processing, irrespective of which platform we're doing this on.

The next goal of this is to obtain images during playback. The second goal, if we're trying to do some non-trivial type of processing or image analysis, is that we'd want those images a little bit ahead of the playhead so that we have some time to manipulate them. If we're doing this, we'd like to still maintain AV sync because that's our number one goal. and we'd like to have each of those images time stamped with a display deadline.

Returning to our diagram, you've probably already used the AV player layer in your application. You've used it with CoreAnimation and you've produced some very creative forms of video presentation in your application. For those of you who are familiar with the AV player layer, you've never been able to associate multiple AV player layers with an AV player. Well, new in iOS 6 and Mountain Lion, we now are giving you the support to associate multiple AV player layers with a single AV player.

We haven't even gotten to the cool part yet. By doing this, you can achieve some really creative effects. It's very simple and you don't even have to worry about the audiovisual sync. AV Foundation will still do this for you. It's extremely easy to use. And I would go so far as to say that if you're considering doing real-time video effects in your application, you should first consider the use of an AVPlayer layer. What this means for you is that the code that you need to maintain is less, increasing your maintainability, and freeing you up to do other things. You can do some really compelling, easy, functional things with just multiple player layers associated with the player.

In fact, I want to show this to you right now. In this demo, I'm going to show you two things. The first thing I want to show you is how very simple it is to create a functional feature in your application with just a screen full of code.

The second thing I want to highlight in this demo is how multiple player layers are perfectly in sync. What I'm going to do in this demo is I'm going to show you an AV loop. A magnifying glass is a video loop that you might be familiar with from applications such as iPhoto and Aperture. It's a magnifying glass, so now we can appreciate the detail in a video image. Let me switch to the iPad and show this to you right now.

So here's our loop. It's multiple player layers. In this case, it's two. And we can see this chap out on the San Francisco Bay. He has a lovely yacht, which you too can own if you are successful in using AV Foundation in your application. And even though it's full HD at a very high bit rate, Because we're not reading back from the GPU or any of that stuff, it's super fluid.

It's like butter, or at least butter that you've left out overnight. And I can zoom in on the skipper here and magnify him. I can also zoom in on his vessel ID. And now, because we have this sort of a feature, we can appreciate the depth of field in this image. If you're an avid videographer like myself, you can appreciate the depth of field in this image. Shallow depth of field, how sharp the foreground is, and how beautifully Bookade out the background is, we've got Fort Baker in the background and Tiburon over here.

But all of this is just one screen full of code. And the synchronization between the multiple player layers is automatic. There's nothing that we needed to do in this application to achieve this effect. Let me switch back to the slides and describe to you how you can do this.

It's really quite simple. Like I said, there are two player layers. The first player layer is sized to fit the bounds of the display. It's our main view into the video. The second player layer is sized to fit four times the bounds of the first, so that we get that magnified zoom effect.

To make it look like a magnifying glass, we're going to create a CA shape layer. and associate with that a core graphics path for creating the circle. Applying that mask to the magnified layer, we get a visible region. And this visible region obscures a smaller magnified region on the main player layer.

To make this interactive, we're going to have a UI image view and a UI pan gesture recognizer that's associated only to that view so that touches and other user interface events only go to that view. Then we apply some fancy artwork, composite the two, and now we have an interactive application that's producing the loop. It's that easy. Our user can move around and it's very fluid and it looks as if they're magnifying the main player layer.

But as powerful as associating multiple player layers to a single player are, and as efficient it is because there's only one media decode to feed all these player layers, you might need something more powerful. You might need video images themselves so that you can do your own custom manipulation and your own custom image analysis. And for that, we're introducing a whole new object. It's the AV Player Item Video Output. You can associate multiple of these to a single player, and it's your key for achieving custom video processing. So how do you use one of these things?

To use an AV Player Item video output, you can use it with other imaging technologies such as OpenGL or Core Graphics. And you can use it to do your own video effects and your own video analysis. It's available both on Mountain Lion and iOS 6. To use it, there are two questions that you can ask of it. Do you have a new pixel buffer that I haven't seen before, a new image? And if you do, Can I copy that pixel buffer for a particular time in the item's timeline and then optionally receive a CM time describing its exact display deadline?

To drive your use of the AVPlayer Item Video Output, you need to use a fixed rate hardware synchronized service. And on OS X, the candidates are a CV display link and a CA OpenGL layer. On iOS, you've got the CA display link or the GL kit view or view controller.

And each of these services guarantees to you a callback to your application that's in at the same lockstep frequency as the vertical sync on your hardware. Here's a picture that describes this. As a device is displaying the most appropriate image, for HostTime T0, you get a callback that prepares your application, prepares your object for the next vertical sync at T1.

The next thing you're going to do is try and convert that host time into the item timeline. You can call item time for host time. And what you're doing here is you're asking AV Foundation to do that time calculation that we saw earlier. For T1, What's the appropriate image time? What's the value in the green circle?

Once AV Foundation has done the arithmetic to do that for you, you now know what the image time is -- sorry, the item time is that you need to ask for. And you can ask the AV player item video output, do you have a new pixel buffer that I haven't seen yet? Each time you copy an image from your AV player item video output, it will maintain a record of consumption. And that's true for every instance. It's a separate record of consumption.

If there is a new pixel buffer available for that item time, you now have the opportunity to copy that pixel buffer and receive a CM time describing its exact display deadline. You render that image out using the imaging technology of your choice. It gets split to the screen at the next vertical sync, and then the service prepares you for the next iteration at D2.

On iOS, you might use a CA display link. Let's take a look now at a coding example on how you set up the CA display link and use it with an AVPlayer item video output. The first thing you do is you set up the display link with a target of yourself and a selector where you'd like the callback to occur.

[Transcript missing]

And if we want to be efficient, we can set the initial state to pause so it's not calling us back immediately. And this will save power. And we'll unpause it when our user commences playback. And then we add it to a run loop.

In the selector, the first thing that we need to do is calculate the time value for the next vertical sync. Using a CA display link, you ask it for its time stamp and you ask it for its duration. How often, what time value is it between each callback? Then we ask our AV player item output the standard questions. Do the calculation from item time into--sorry, from host time into item time.

Once we've done the calculation, is there a new pixel buffer that I haven't seen yet for that item time? If there is, well, let me have it. I want to pull the next pixel buffer by calling copy pixel buffer and optionally receive the display deadline. If you're on iOS and you want to use OpenGL, now is your opportunity to create separate textures, one for Luma, one for Chroma, because that's the most efficient path, using a CVE OpenGL ES texture cache. And what's also important to note here is CV pixel buffers, CF types, that need to be maintained with CF retains and releases, even under ARC.

Of course, like I promised, you can pull ahead of the playhead. You can ask the AVPlayer item video output for images ahead of the current time. You can add an interval that's appropriate for your custom processing. and have the AV Player Item Video Output calculate the item time that's appropriate. If you fall behind in your processing, you don't need to worry. The AV Player item video output will automatically discard all images behind the playhead.

And as I alluded to earlier, you can pause many of these hardware synchronized services. And you want to do this whenever your application isn't doing any playback, it's paused, or maybe that view has gone away, your user has closed down the playback view. You want to try and shut down those services so that they're not constantly calling you back.

And to help you start up again, you can ask the AVPlayer item video output to give you a callback when it anticipates more images will come down from the decode. You can ask it to request notification of media data change with an advanced interval. And the advanced interval that you provide here is the amount of time that you require to restart your processing operations, to recreate objects, to recreate buffers. If you make this request of an AVPlayer item video output, your delegate will be messaged, output media data will change.

So on this slide, we have ten of our most talented Media Systems engineers. Not only can these guys code but they can also dance. And as funny as they are, I want to show you a demo of how to use an AVPlayer item video output. And this demo seeks to highlight four things. First, I want to show you how to use the AVPlayer item video output. And I want to show it to you within the context of an interactive OpenGL ES 2.0 game that uses GLKit to drive consumption from the AVPlayer item video output.

For each image, we're going to apply a real-time chroma key effect. This chroma key effect is implemented in GLSL in a fragment program and executed on the And I want to show you a technique that's particularly useful on iOS called Texture Atlas Sync, where we can push together or offer together multiple smaller movies into a much larger movie so that we maximize each texture that has been decoded and uploaded onto the GPU. Let me switch to the demo iPad and show you this game right now.

And here we have our game. You might begin to recognize it. Here's our interactive OpenGL ES game. It's interactive because I can swipe around and reveal other members of the team. The movie that's playing is the same movie you saw earlier on the slides. On the GPU, we're slicing up each of the individual movies and pushing them out to the pieces on this puzzle. I can move it around and manipulate it just by swiping around on the display. And like I promised, I can apply a real-time video effect at any time just by tapping on the display.

So maybe by now you're recognizing what kind of a game this is. And as much as I'd love to try and solve this for you right here, right now on stage, I'm going to do what any good programmer will do and get the computer to do it for him. I'm going to push and hold and have the application do it for me.

Everything is happening in real time and I can turn on and off the effects at will. If you want to try texture atlasing in your application, there are three things that you might want to consider. The first is that you can create the texture atlas very easy with an AV mutable video composition.

The second is, if you're using H.264, you want to try and line your tiles of video along 16 pixel boundaries. And then for the maximum effect, you want to try and have your videos loop with common start and end frames. I see it's solved, but I need to switch back to slides because I have a little bit more I want to summarize with you.

That's pretty neat, isn't it? There's so many very creative things that you can do with this. I think it's quite powerful, and I think the applications that you can create with this, with an AVPlayer item video output, are almost limitless. Let's summarize the key points of today. The first thing that we saw was that AV Sync is the key goal. It's what you need to maintain in your applications so that your users continue to experience the magic of a movie. If you lose it, the effect is lost.

To ensure tight synchronization with playback within your application, you tell AV Foundation specifically which clock you want to synchronize to, which clock AV Foundation should do its time calculations with respect to. With the empty audio processing tap, We stressed how important it is to remain efficient during processing, to completely focus that one callback on just doing audio processing and not any ancillary operations.

And then using the AV Player Item Video output, you drive consumption by using one of a collection of hardware synchronized fixed rate services. For more information about AV Foundation, please contact our media technology evangelist, Eric Verschen. I hope you are having and will continue to have a wonderful dub dub. And I'm really looking forward to the applications that you guys can develop with these new APIs. Thank you very much.