Understanding AV Foundation - Tech Talks

2011 • 27:13

AV Foundation harnesses a streamlined media architecture that enables high- performance audio and video playback, editing, recording, and more. Gain a practical introduction to the AV Foundation framework and understand its core concepts. Discover the tremendous control and flexibility provided by AV Foundation and get instruction about working with media in your apps.

Speaker: Eryk Vershen

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hi, I'm Eryk Vershen, Media Technologies Evangelist. The AV Foundation framework provides facilities for capture, playback, reading, writing, and editing videos. Introduced in iOS 4, it was brought to OS X in 10.7 Lion. AV Foundation is a modern framework designed on the basis of our many years' experience with video at Apple. In this video, you'll learn how to use AV Foundation to playback video, to edit videos together, and also handle title animations, essential knowledge for incorporating video in your iOS or OS X application.

AV Foundation is our powerful low-level framework for working with timed audiovisual media. It's available on iOS as well as OS X. The first thing you want to do with a movie or song is play it. AV Foundation has great support for playback. You can control which tracks are played, stop, start, seek, and get fine-grain notifications during playback.

As we'll see, AV Foundation is below the UI layer. While this may require more work, it allows you to completely customize the controls you provide to your users. AV Foundation allows you to capture from the camera and microphone providing you with control over features like auto focus and the camera flash. Support for editing is there, from simple cuts-only editing to compositing operations like fades and wipes, and the mixing of multiple audio tracks like music or commentary with the original audio.

You can both read and write at the frame or sample level. For example, to apply a sepia filter to a video. AV Foundation has a lot of functionality, and these are only some of the classes. AV Foundation is a great framework. When should you use it? When you want to inspect properties like duration, media type, or metadata, to implement custom playback behavior, when you need to edit multiple pieces of media together, to transcode a video into some other format, and when you want detailed control over capture.

AV Foundation is, as its name implies, a foundation-level framework. It is underneath the UI. There are no classes in AV Foundation that are part of the UI. Instead, there are model classes that provide data to the UI. AV Foundation sits on a number of core foundation frameworks for audio, video, media, and animation. I'll mention some things about these as I go along.

Now, AV Foundation can be broken up in a number of ways. For example, into playback, capture, editing, and read and write. And I really wanted to talk about the whole framework today, but you would be listening to me for several hours instead of just an hour. So I'm going to have to leave out Capture, but there's an excellent presentation called Capturing from the Camera Using AV Foundation on iOS 5 from the 2011 WWDC. And there's also another talk from that same conference introducing AV Foundation Capture for Lion. So that topic's covered pretty well.

It turns out that Read&Write is also well covered. At the same conference there was another talk, Working with Media in the AV Foundation, that went into some detail on the Read&Write classes. So instead, I'm going to talk about how Core Animation interacts with AV Foundation. These two areas, Edit and Core Animation, were areas that I found confusing at first. I hope I can dispel any confusion you might have about them.

Let's start with playback. In playback, we're taking data from a movie file, playing through the file, and displaying the result on our device's screen. There are three principal classes involved: AV Asset, that is the movie file, AV Player, that controls the playback itself, and AV Player Layer, that is the main class involved in displaying the images. We'll start by discussing AV Asset.

AV Asset is an abstract base class that represents any timed audio-visual media, be it a video, a movie, a song, a podcast, or whatever. Assets might be local, that is on the iOS device, or they may be remote, fetched over the network. While assets are typically finite, they may actually be streams of data, such as from a webcam, that have no real beginning or end.

Where do you get an asset from? Well, they can be built into your application in its bundle. They can come from either the iPod media library or the user's photo library. In these cases, you don't have arbitrary access to the files. You ask the library for a special URL that you can pass to AV Foundation.

They can also come from the network as an HTTP URL. An asset has a structure. It's made of tracks, and the tracks consist of segments. A track is a single kind of media, a video track or an audio track. And you may have more than one track of a given kind, say an English language audio track, a German audio track, or some other language. And tracks have segments. While many tracks will be a single segment, we will see later on that tracks can have many segments.

Now, merely because we've created an asset object does not mean that the tracks and other properties are immediately accessible. Getting values may take a substantial time due to the file format, the sheer size of the file, or it's being located across the network. To deal with the need to wait, AV Foundation implements a protocol on media objects which allows you to ask for values to be loaded and be notified when the loading has occurred. The second major class is AV Player. AVPlayer is a controller object. Besides being the object that you tell to play and pause, it also has methods for monitoring time passing.

Now, we can't directly connect AV Player with an AV Asset. Doing so would cause either the Asset to know too much about the playback state or the Player to know too much about the Asset. So instead, we use an AV Player item to hold that state. Now just as assets have tracks, so too player items have player item tracks that control whether the tracks are enabled.

The last major class is AV Player Layer. Here's our diagram again. Now, in order to display anything, it needs to be part of a UIView or a subclass of UIView. But UIView doesn't know anything about AV Player. It's not the kind of class it expects. UI views expect to be backed by core animation layer subclasses. So AV Foundation supplies a CA layer subclass, AV Player Layer.

So now our picture of playback classes is complete. But what do we actually need to do to cause playback to happen? There are six steps. Create the asset, ask it to load its tracks, then create a player item, attach that to a player, attach the player to a layer, then when the item is ready to play, play it.

So let's look at each of these steps in turn. How do I create an asset? The concrete subclass is AVURLAsset. In this example, I'm calling the class method URLAssetWithURL to create an asset object from some URL, which is either local or remote. Prior to iOS 4.3, you needed to instantiate remote assets differently, and you can refer to the documentation if you need to know what to do in that case.

In order to ask for tracks or other properties to be loaded, you need to create an array of strings that are the property names. Then you pass that array to load values asynchronously for keys. Sometimes later, the completion handler block you passed will get executed on some queue. Now, since we want to operate on AVPlayer in that block, and AVPlayer must only be accessed from the main queue, we need to trampoline onto the main queue.

So here's the completion block we're going to execute. The first thing you do is check the status of the tracks. If the tracks are loaded, then you can create a player item for the asset. Starting with iOS 5.0, you only need to load properties that you directly access, but you should still explicitly load tracks if you are supporting earlier versions.

Once we have a player item, we can attach it to a player. While we are creating the AV player here, it could be that the player already exists and we only set the player's player item. The player needs to be attached to a player layer in order to display. In a few slides, I'll show you how this view is hooked up to the player layer so that this does the right thing.

Now we have our objects hooked up, but the player item isn't necessarily ready to play yet. We need to key-value observe the status property of the player item so we can detect when the item is ready. In our observed value for keypath method, we look for the right context and trampoline to the main queue to adjust our UI based on the value of the player item status. Now we can actually accept requests to play from the user and subsequently requests to pause. I said I'd tell you how setting the player views player, which we did in our completion handler, caused the player layers player to be set.

What we have here is the simplest way of getting an AV player layer associated with a UI view. We make a UI view subclass and we make its layer class be AV player layer. This means that the UI view will instantiate an AV player layer object for us.

When we set the views player, what we do is set the player of the player layer. And when you fetch the view's player, what we do is return the player associated with the player layer. So the player view is just mediating our access to the player layer, so we have one less object to explicitly worry about.

Another detail to worry about: when playback ends, you may need to do something, like rewind the video. So you want to observe the "AV Player Item Did Play to End Time" notification so you can respond. And then you need to implement the method that receives the notification. That got our basic playback working.

A common thing you'll want to do is have a scrubber or slider that the user can use for random access to the movie. Let's talk about how you do that. When the movie is playing normally, you need to keep the scrubber in sync with the current time. You do this with a periodic time observer.

You tell the player you want a block called every tenth of a second or whatever time interval is appropriate. You want the block called on the main queue because it will update UI elements, and that block will get past the current time so you can move the scrubber thumb to the correct place.

When the user starts to move the scrubber around, you need to do a bit more work. You need to pause playback and remove that time observer. And as the scrubber is moved around, you need to move around in the movie. Since seeks can take a while, and because only one seek can be in progress at a time, you want to chain seeks together. I haven't shown all the details, but basically, if no seek is in progress, then you should initiate a seek. And if one is in progress, you should remember the time the user chose so your completion handler can start the next seek.

Lastly, when the user stops scrubbing, you want to recreate your time observer and continue your normal playback. We've seen a lot of multi-threading so far: completion handlers, notifications, key value observers, and time observers. You want to remember that your UI classes and your AV player need to be operated on only from the main queue. And because virtually all of the properties of an AV asset are non-atomic, you need to make sure you aren't manipulating the same asset on more than one queue.

We've covered a bunch of stuff here. Remember, the player is about control, the player item is about presentation state, you can't see it without a player layer, and you need to stay on the right cue. Let's turn to editing. What is editing? Combining pieces of two or more tracks to make a piece of a single track.

There are three kinds of editing operations we do in AV Foundation. Placing segments adjacent to each other, that is what is called cuts-only editing. Mixing audio tracks together and compositing video tracks. Let's look at the way we place segments adjacent to each other, which is with AV composition.

An AV composition object is just a kind of asset. It's composed of composition tracks, and the tracks are composed of segments, each of which can come from a different asset. Typically, we will be using an AV mutable composition and mutable composition track, but segments are always immutable objects.

Let's look at a picture to make it a bit clearer. We have three videos, three assets to start with. We take portions of each asset. The portion I take from each asset doesn't have to be the same across all tracks. I could take very different portions of each track. Let's look at what this looks like at the object level.

We have a composition object that contains two track objects, and each track object contains several segment objects. Looking at one of the segments, it indicates which movie is the source, which track within the movie, and the time range within that movie. I want to talk briefly about how AV Foundation represents time. We don't use floating point in AV Foundation. Time scales like 29.97 don't mesh well with binary floating point. Remember 29.97 is actually precisely 30,000 divided by 1,001 or 29.9700299700, etc. When you mix this with other time scales, it's easy to lose precision.

AV Foundation uses core media time. This is not an object, but a C struct. It is a value over a time scale. The value is 64 bits. The time scale is 32 bits. The EPAC is used to distinguish values associated with distinct timelines, and EPOC0 is also used for durations.

We also have CM Time Range, which is a start plus a duration. And especially for compositions, CM Time Mapping which maps a time range in the source asset to a time range in the composition. That's compositions. Now let's talk about mixing audio. As an example, let's take adding a commentary track to a video. When we add a shorter piece like this to a track, the framework will implicitly add empty segments to fill out the track.

What we want to do is have the volume of the main audio drop during the commentary. This is called ducking the main track. We will find that the only part of this that needs to be described is denoted by the part I've highlighted in red. That is, we only need to describe when the audio level changes.

Looking at this as objects, we have an AV composition that includes the video and audio tracks from an asset, plus the commentary track that comes from another asset. The audio mix object only needs to describe what happens to the one audio track because the commentary track will be implicitly mixed at normal volume. So audio mix objects contain a list of audio mix input parameter objects. Each input parameter object describes how the volume of one track should be adjusted over time. And tracks that are not mentioned are implicitly mixed. Let's turn to video compositing, which is a bit different.

A video composition object describes how to composite video tracks together. So it describes the compositing order and what should be done with each layer. In this example, we want to take track A and then composite track A over B with A fading out, and then use track B by itself for a while, and then instead layer track B in front of A and fade B out to get to A, so that we get a result that looks something like this.

What you need to do is for every time range, make a composition instruction and then make a composition layer instruction for each track that you need to composite. You order the layer instructions into the composition instruction in the compositing order and then order the composition instructions into the video composition in time order.

Audio and video are very different as we can see now. Audio is organized by tracks first and then we describe how the volume of the track changes over time. Video on the other hand is organized by time ranges and then describes the order of compositing. Also, audio has implicit inclusion but with video everything needs to be explicit. These merge objects, audio mix and video composition, are not assets. They are operations to be performed on assets. What you do is pass one or both of these objects to something else in order to have the operations applied.

For example, a player item can take an audio mix and a video composition, which describe how the asset should be modified before being displayed. An export session can do the same thing, describing how the asset track should be mixed as you save a copy. For still images, we can describe how to composite the video tracks before the still is generated.

And when reading either audio or video from an asset, you can also have a mix applied as you are reading. The things to remember: a composition is an asset. Audio mixes are by track and include other tracks implicitly, whereas video compositions are by time range and must mention everything explicitly.

Now let's turn to core animation. Your first question is probably, "What does core animation have to do with videos?" Core animation underpins all UI kit rendering in iOS. You may not notice this because we hide it for the simple cases, for example, DrawRect. Core animation has tremendous functionality for graphical operations and rather than create some new interface, we decided to leverage core animation directly to enable things like fancy title sequences.

Since UIView uses CA layer subclasses, AV Foundation provides CA layer subclasses for the UI to use: Player Layer for playback and Capture Preview for looking at capture. Here's a class diagram. You can see that AV Player layer knows about a player and AV Capture Video Preview layer knows about a capture session. The capture preview layer has some extra functionality to be able to easily mirror the video input.

What if I'm using a real core animation, say adding some title to my video? Let's take an example we used at the developers conference. We want to add a title and some stars, and we want to spin the stars around and have the whole thing fade out after 10 seconds. We want it to look something like this. There we have our stars spinning, and then it all fades out.

So we need a top-level layer, and that layer needs to have an animation on it that will do the fade-out. And then we need a layer underneath to hold that title string, and another layer to hold our ring of stars. The ring of stars will need an animation on it to do the spin, but there's a problem.

Our UI view is going to reference an AV player layer, or maybe some layer tree that contains that player layer, and that will bring in the video. But there's a distinction between times above the line and below the line. Above the line, things happen in real time, time since boot. Below the line, things happen in movie time, which can go forward or backward. So how do we make our animation happen in movie time? We add an AV synchronized layer, which fixes the layer tree underneath it to happen in movie time.

AV synchronized layer needs to reference a specific player item which is associated with the asset we want to sync with. But this only works in the context of an AV player. So we need something else: an AV Video Composition Core Animation Tool, which we use to sync otherwise. This is attached to a video composition. That is, I have to have a video composition in order to sync in other contexts.

Now both of these objects are constructed only with class methods. AV synchronized layer is created with a "with player" item. That's a shortened name. It's actually called synchronized layer with player item. Core Animation Tool has two methods. The first method is "With Additional Layer as Track ID." Now, I've shortened that name again. It's actually "Video Composition Core Animation Tool with Additional Layer as Track ID." What this does is add the first argument, a CA layer, as a virtual track to the asset that the video composition is operating on. The second method is with post-processing as video layer.

It renders its second argument, a CA layer, and replaces the first argument, a contained CA layer, with the results of the video composition. So either what we are doing is we're making a layer tree a track within our video composition, or instead we make the video composition a layer within our layer tree.

Now, rendering a layer tree is an expensive operation, so we can use the enablePostProcessing property to tell a composition instruction that is a particular time range of the video that we don't want to render the layer tree during that time range since we, as programmers, know it will have no effect. Another point that's important has to do with coordinate systems. The coordinate system for UIView has its origin in the upper left.

But CAA layers default to having origin lower left. Now, when I place a CAA layer in a UI view, then the coordinate system will get flipped. But this only happens when I'm in a UI view. So what do we do when we're using the core animation tool? Because we have no UI view there. Instead, we have to tell the layer to flip its geometry. This property is inherited, so we only need to set it at the root of the tree.

So, for display, remember to use AV Player Layer or AV Capture Video Preview Layer. For timing, use Synchronize Layer if you're in an AV Player context and use Core Animation Tool everywhere else. In closing, I talked about three groups of classes: playback, edit, and core animation. I didn't talk about ReadWrite or Capture. There are also classes around metadata and audio. If you have questions about any part of AV Foundation, feel free to email me. I mentioned several videos from the Developers Conference 2011. If you're interested in Capture, you should also look at the Capture talk from the 2010 conference.

I wanted to call out several sample code programs: ROSI Writer, if you're manipulating frames; AV Player Demo for information on proper queue handling and also how to deal with metadata. Also, you should look at the OS X samples. Remember that AV Foundation is almost identical on the two platforms. Look especially at the Reader Writer sample and the AV Metadata Editor. So there you have it. I look forward to seeing what you do with video in your application.