Graphics and Media • 1:08:35
QuickTime 7 supports advanced video coding formats such as H.264. With this major advancement you'll need to take a new approach when creating, editing and working with media samples. Learn how to handle H.264 and other advanced formats in your application, including how to leverage new APIs for compression, decompression, and synchronization.
Speaker: Sam Bushell
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
In this session we're going to talk about supporting the latest and greatest video compression formats in your application. We're going to start with a brief bit of technology background, and then we're going to talk about how to access this stuff in your applications using our most modern APIs. We're going to support this with a lot of sample code, which is already available for you to download. If you go to the page for this session, you can log in and download all that stuff right now and look at it while we're talking about it.
We have a great new codec in QuickTime 7 and Tiger. It's called H.264. It's a standards-based codec that's the result of a joint effort by two major standards bodies with a bunch of experts, international collaboration. Collaboration in particular between the experts from ISO, the International Organization for Standardization, which brought us the MPEG codecs, and the ITU, the International Telecommunications Union, who developed the H.261 and H.263 video conferencing codecs.
[Transcript missing]
It's got a lot of names. You'll see people calling it AVC, which stands for Advanced Video Coding. You'll see people calling it JVT, which stands for Joint Video Team, which refers to its joint heritage.
You'll see it called MPEG-4 Part 10 because that's the name of the standard. Apple uses the name H.264 which is another of the choices. All of these are referring to the same thing. So this is our new video codec in QuickTime 7. We put a lot of energy into making it work. I'd like to give you a brief demo of it over on this wonderful demo machine here. In 20 days and 20 nights, the Emperor Penguin will march to a place so extreme it supports no other life.
Excitement, adventure, and really wild things. So this is one of the videos from our site. If you have QuickTime Player Pro you can export to H.264 yourself. It's under Movie to QuickTime Movie and if you click Options there's another dialog that has a list of settings. The video settings, H.264 is one of the choices in the compressor list.
And down here you'll see that we've, if you've seen this dialog before, you've seen that we've rearranged it a bit in order to support our new features. A major new feature for H.264 encoding is a feature called multi-pass encoding, intelligent multi-pass encoding. This is what we use to get the best quality in our encodes.
Let's come back to slides. Since this is the year of high definition, I wanted to point out that Apple actually has been delivering quite a number of codecs which are high def ready. As well as H.264, which is our premier delivery codec. This year in Final Cut Pro 5, Final Cut Pro Studio, we delivered support for, native support for HDV.
HDV is an MPEG-2 19 megabit IPB based code format. It's constant bit rate. It's supported by a new breed of FireWire cameras, of HDV cameras from some manufacturers. And we have native support for this in Final Cut Pro 5. iMovie HD and Final Cut Express HD support HDV by converting it to an intermediate format called the Apple Intermediate Codec.
But these aren't the only ones. We've also supported a format called DVC Pro HD, which is a 100 megabit DV-style, DCT-based high-def format since Final Cut Pro 4.5 last year. In Panther we introduced Pixlet, and for some time we've supported uncompressed HD playback if you have a machine that has sufficient I.O. bandwidth.
These movies have exceptionally high data rates, up in the hundreds of megabytes per second. Now, the last four of these codecs are iframe only. H.264 and HDV use B-frames, and some of you may be saying, "What's that again?" So, let's have a little bit of technology background on what B-frames are and why you might need to worry about them and why they're important.
Video codecs present you with a trade-off between bit rate and quality. If you want a better quality, you generally have to use more bits. If you want to reduce the bit rate, you have to sacrifice the quality to some extent. So there's this curve, and we're constantly trying to move that curve better and better towards better quality at lower bit rates. We do this by adding more tricks.
Smart people who write codecs come up with more tricks. Some of those tricks are entirely inside the codec, but some of them involve... aspects that need to be taken care of by parts that are outside of the codec, by other things, other parts of the system that deal with compressed frames.
Suppose you want to compress some video. Here's a clip of me parking my car. Prosaic. Well, you can compress each frame independently. This is called iframes or keyframes or sync samples. In this case, every frame is self-contained, so random access is fast, but the bitrate isn't too great.
[Transcript missing]
So you can improve compression performance substantially by using one frame as a base for encoding another. In the example I'm illustrating here, we're describing frame 5 in terms of frame 4 as a base. First, we describe the areas that are similar to frame 4. Now the orange area in this diagram is the piece that's the same pixels from the same place in frame 4. The green bit in the middle is where we have the same pixels, but they kind of moved over from where they were in frame 4. This is called motion compensation.
Now, that doesn't completely describe the scene. There's still a fix-up that has to be made because not everything is completely stationary. The reflection on the car doesn't move with the car. It's got to be adjusted. The tire is moving, so you have to adjust the image for that as well. This is called the residue, this fix-up image.
Now there's a whole strip of the image that wasn't in frame 4 because it was out of the picture. It's just moved in. This strip has to be coded from scratch by the residue. So this is what it looks like if we encode the last five frames of those six using difference frames, or that is using motion compensation. Difference frames use information from a previous frame.
They're also known as P-Frames for predicted. This gives you better compression because Motion compensation is extremely compact relative to describing something from scratch. Another way of looking at it is to say that each of these frames depends on the previous one because each frame is described in terms of the previous one as a base. That means that if you want to display frame 6, you have to start with frame 1 and then decode all of the ones in between.
So we call this IP when you have iframes and pframes. It gives you better compression than iframes only, but random access can be slow. For example, to get to that last frame you had to decode all six. It could take some time. Also, another thing to note is that images that appear gradually have to be constructed incrementally. Like the last image in that sequence, when you see it, it's been constructed from strips in five different frames. That might not always be the most efficient way to do things.
So suppose for a moment that we encoded the first frame in that sequence as an iframe, self-contained, and then went and encoded the last frame in the sequence as a pframe directly based on frame one. Then we could pick a frame in the middle and we could encode it using motion compensation partly from the frame before it and partly from the frame after it.
Well, then you can see we can encode it almost completely using motion compensation. And in fact, we can do that for a whole bunch of frames in between those, and then mostly encode it using motion compensation so the bit rate is improved. These are called B-frames. B stands for bidirectional prediction. B-frames refer to information from frames that will be displayed in the future. They may also use information from a previous ILP frame. As you can see, the bitrate has improved.
Using B-frames as well as IMP frames gives you better compression when objects appear gradually. It also gives you somewhat faster random access. To decode and display any of the frames in the middle from a starting standpoint, you only have to decode the frames at the ends, not any of the others. Also, other frames can be skipped when, for example, you don't have enough CPU to play at full rate, or you're playing fast forward. The jargon for this is temporal scalability.
One weird thing that happens when you have B frames though, is that the order the frames are decoded in is not the same as the order the frames are displayed in. Because the codec isn't able to predict what's happening in the future, it has to use information that is already decoded. This reordering is why other parts of the system need to understand about B-frames in order to support them.
So, some of you are saying that you all knew all of that already because you've been doing some work with MPEG and it's all familiar to you. Well, there is a twist. You guys may know that you can implement this kind of semantic using what's called a one-frame delay. And you build a small finite state machine and you do certain transitions when you see that the frame type is I or P or B. And that completely defines the dependency structure.
Now this is true for MPEG-2's IPB pattern because only one future frame can ever be held at one time. This is not true for the general case of H.264. In fact, the H.264 encoder has allowed a lot more flexibility than previous frame patterns. P frames can use more than one prior frame for decode. B frames can use more than two prior or future frames. Not all I frames reset the decoder. We have a new name for these. These are IDR frames in H.264 terminology. It stands for Instantaneous Decoder Reset.
P and B frames can depend on other B frames and some I and P frames may be dropped. So where you had a pattern that if you knew what to look for was fairly regular for MPEG-2 IPB, the general case for H.264 could be a whole lot more complicated and unpredictable and you couldn't just derive it from the IP and B letters.
So instead of trying to store the IP and B letters and using that as our basis, We identified that it was important to record the following information per frame. Number one, is it a sync sample? Not every iframe is a sync sample now, because decoding an iframe, although you could do it alone based on no other information, the following P and B frames may need information that's from previous frames. Number two, is it droppable? Not all B frames are going to be droppable now, and some I and P frames might be droppable.
Number 3: In what order are frames to be decoded? And possibly you might have some hints about what timing they should be decoded. And number 4: At what time should each frame be displayed? So, the moral of this little background bit is that the dependencies between frames are becoming weirder and weirder, but it's all in the cause of improved compression performance.
IPB means that frames are stored and decoded in a funny order, and some parts of the system need to be aware of that. And convenient rules that were true for MPEG-2, like the one frame delay, don't work in H.264. They're not actually sufficient to handle the general case.
So let's talk about what this means for QuickTime and for developers like you. Let's take a look at what a video track looks like so that we can describe the API. At the basic level, a video track contains a list of frames. Now in some of our APIs you'll see we say samples. In some of our APIs you'll see we say frames. We use sample in APIs that need to be more generic because they might refer to things that other than video they might refer to sound.
But in our video APIs we tend to use the word frame. And in fact when we're talking about video we're using frame and sample interchangeably. Each sample has some timing and dependency information, and they're stored in decode order. You'll notice that these frames are kind of in a different order from how you'd expect them to, because the car is jerking in and out of the scene as it moves along.
In the cases where the decode order is different from the display order, we also introduce information called display offsets, which are how we describe that reordering for the codec. The display offset is simply the difference between the display time and the decode time. Specifically, it's display time minus decode time. So by interpreting the decode times and adding the display offsets, we know when each frame should be displayed. And now we have the frames from that portion of compressed video that can be decoded and displayed in the order that makes sense to a user.
Now in general, those display offsets don't need to be exposed to the user. They don't need to know about them at all. When the user wants to rearrange portions of media, they use a different mechanism. They use the edit mechanism in QuickTime. Edits describe the segments of video or other kinds of media that have been brought into the movie's presentation at various times.
So for example, in this case, if the user decided to trim down to just include the end part of that parking sequence and then follow it with a piece where the car's door is open, the next scene, whatever that is, then we would have an edit that described insertion of that piece of video into the track, into the movie's presentation.
Broadly, our APIs are divided into high-level APIs and low-level APIs. The high-level APIs work on tracks and movies. The low-level APIs work on media and samples. So, we have high-level APIs that implement cut, copy, and paste using the clipboard. We also have high-level APIs called track segment editing APIs, which lets you insert pieces--segments of media from movie to movie or track to track directly, to delete, to rescale, and otherwise manipulate bits of--bits of movies. These work without using the clipboard. The clipboard belongs to the user, and they get antsy when it gets trashed unexpectedly. So, if you're trying to programmatically manipulate media with the high-level APIs, you should use the track segment APIs rather than cut, copy, and paste.
Beneath that we have the media manipulation APIs which let you access individual samples, and we have the image compression and decompression APIs which let you compress and decode video frames. Now most of the plumbing changes we've needed to make in order to support H.264 have been in these low-level APIs. Specifically, the changes we've made in order to support H.264 have been to support frame reordering by adding display offsets of various levels of API. We've also started to store new dependency information. We have new sample flags, such as one that says that a given frame is droppable.
Applications that access individual video frames must migrate to our new APIs if they want to access this information in H.264 or other codecs with B-frames. We made this a deliberate design decision because we wanted to avoid a situation where a user's content with one of these new codecs would be damaged and messed up by an application that failed to preserve the display offsets.
These API changes are part of an evolution that you'll see happening across the QuickTime landscape. We're moving away from handles. We're moving towards core foundation style objects because retain counting helps you manage object lifetimes. We're moving away from QuickDraw. In place of Pixmaps as a place to store image data, you'll see we're using core video pixel buffers.
And as part of this plumbing we're able to do things in a multi-buffer manner. This allows us to do more pipelining and have more asynchrony between various stages. In place of GWolds and GraphPorts as a way to describe where a movie should be played, we're moving towards a thing called the visual context, which you would have seen a bunch of detail on in this morning's 207 session on a high performance video.
The visual contexts give you advantages like being able to do convenient hookups with OpenGL. If you've seen any of the audio sessions, then you know we're moving away from Sound Manager towards Core Audio, which gives us benefits like high resolution, high fidelity, high definition audio. We're moving away from FS specs. We have a bunch of APIs to replace those which all use data references.
This gives us access to Unicode and long file names. And there are two other transformations that we're going through which I wanted to touch on. We're moving towards 64-bit time values and towards 64-bit sample numbers. These enable us to have very high time scales and very long movies.
Now, those transitions are not complete yet. Not all the plumbing is completed. But you'll notice that we use 64-bit integers in the new APIs because all of the compilers support them natively now. There's no reason for us not to. There is a tech note, particularly on the FS spec to data reference transition that you should make in your applications. It's tech note number 2140.
So that's now it for theory in this session. The rest of it is going to be practice. We're going to show you how to do things using our favorite modern APIs, a bunch of high-level demonstrations, a bunch of low-level demonstrations, and then I'm going to finish it out by showing you how to write your own codec with the new codec APIs. Now as I said, these samples are already available for download. You can follow along with us as you like. We won't tell you when to turn the page. To start off the high level demos I'd like to introduce for the first time on stage speaking, David Eldred.
All right, thanks a lot, Sam. So as Sam has described, we have all of these high level APIs that are available to you that will transparently take care of all of the complexities he's been talking about. These APIs, as Sam has said, include using the clipboard to do cut, copy, and paste. This is user level operation. The Insert Segment APIs, there's both movie and track versions of these APIs. They're available in the C and the Cocoa QuickTime interfaces.
Saving movies. You don't have to worry about all of the complexities of B-frames when you're dealing with saving. This includes saving as reference movies, saving as self-contained movies, or flattening movies. and exporting movies to whatever, to various formats that we support, all high level APIs. So let's go to Demo Machine 1 and we'll take a look at a high level editing sample.
So here I've got three clips. These are all HD size H.264 clips. Let's actually open them in QuickTime Player. I have a question. Jim, he looks happy. And these are all clips of QuickTime engineers doing what they do best, standing around talking. But I'd like to edit them into a coherent composition. So I've got a little application here that does that. Very simple. Uses QtKit. And it's so simple I'm just going to step through it with you.
So first thing we're going to do is we're going to create Qt movie objects to represent the three source movies. And we're going to create a QtMovie object to represent our target movie. And we set an attribute on that to make it editable. And using the insert segment of movie API, we're going to insert a segment defined by a QT time range from our source movie into our destination movie.
And finally, we write out the destination movie. So this example is so trivial where I don't believe the sample code is available to you. If you're really interested in getting your hands on it, we can give it to you in the lab. And let's see what we've made.
[Transcript missing]
And we could have added together anything there. That could be DV content, that could be H.264, MPEG-4, whatever codecs QuickTime supports, or even other track types that QuickTime supports for editing. So there's your first simple example of high level. Editing. And now we're going to look at an example where we're doing multi-pass export using a custom data source.
So, Multipass Export is new in QuickTime 7, as Sam pointed out. It gives you the highest quality H.264 compression possible. And in this example, we're going to use the high-level APIs to allow QuickTime to present its compression UI. Using the custom sources, we're just going to provide the source data frames and QuickTime will handle everything else for us. It'll handle the multi-pass configuration of the codec, it'll handle the compressor configuration, it'll write out the file for us, flattening, hinting, everything else.
So how does this example work? First we're going to create a movie export component. Movie export component has video source procedures and audio source procedures. We just have to hook our movie up to it. In this case we're going to have a movie playing into a visual context. And that visual context is going to be using core image to do some video, some messing around with the video. And that's going to be the source for our video source procedure. Since we're video people, we don't care about audio. We're just gonna plug the audio directly through.
So, off to this demo. Demo machine one, please. So, this, how many of you were in session 207 earlier today? A few of you. So this is going to look familiar to you. The So I won't go into great detail about how the visual context is configured and how we're hooking up the core image filter. If you have more questions about that, come and see us in the lab.
But first a quick demo of what this application does. So we've got our favorite car driving along. We've got a core image filter we can change the settings of. And I like it a little darker. Go over here, make it, that's about right. Change the color some. So this is how I like it.
So when I click the export button, we're first prompted for where we want to save the movie. We'll just replace that one. And in the settings dialog-- so we're bringing up the standard QuickTime compression settings dialog from our application. And here we're going to choose H.264 and multi-pass encoding. And we're going to allow frame reordering because we're using high-level APIs. We don't care if there's B frames.
And we go ahead with these settings. And as you see, it's doing several passes through the video. And So for each pass through the video, our data proc is being called once for each frame. So if this had 100 frames of video and it did five passes, we'd be called 500 times. So it opens up our resulting movie in QuickTime player. Vroom. And there it is. So let's take a look at the code.
So we're gonna look at this pretty quickly since we've seen it before and of course come to the lab if you want more details. But we're gonna focus on what happens when that export button is clicked. So as you can see, First thing that happens is we prompt you for the file you want to save this as.
And we open a movie export component. That's as easy as describing the component and opening an instance. Two, two procs that we have inside our application, the video track property proc and the video track data proc. We pass, we create UPPs from these functions and we pass these in to the the exporter component we just opened. And since we're video people, we're just going to use the standard audio source procs from the source movie and add those to the movie exporter.
We invoke the standard compression dialog. Nothing too complicated there. We ask the movie exporter that we open to do it. Movie export do user dialog. Assuming that they don't cancel, we are going to start our export with the movie Export from Procedures to DataRef. At this point, we begin having data pulled from us. Our data proc that we provide is going to be called.
And we'll take a quick look at this Dataproc. So this is the procedure that we defined. It simply calls our export frame procedure. And inside that export frame procedure, The parameters that we're called with include a current time. That's the time for the frame that it wants the data for. And in this application we just set that time on our movie, task the movie to get it to draw.
Do a little bit of application specific stuff to get it to render that current time onto the screen. And then it's going to do a readback from that screen buffer. Since we're doing this core image filter and we want you to have visual feedback, we're rendering this to the screen. And once we read back that data, we just fill in the parameters that are provided as part of this call and return that data. That's it. That's how easy it is to use movie export procedures. So, yeah, let's go back to slides.
So there's a few things to note here. When you're using custom source procedures like this and you're doing multi-pass compression, you must return the same data for a given frame on each pass through the movie. That's very critical, otherwise you'll throw things off. And if you've got very slow rendering, you may want to consider exporting to an intermediate movie first, an uncompressed intermediate movie. That will make the entire process a little quicker since it has to do several passes through the movie.
And if you have very slow processing, that can take a while. That can also be handy if your manipulations that you're doing on the video are for some reason non-deterministic, you can't guarantee you're going to return the same thing. render to an intermediate movie first. So with that, I'm going to bring Sam back in to dive into the low levels. Where has he gone? There he is. SAM BAKER: Thanks a lot. I need a drink.
Thank heavens. So, some apps need to dip below those high-level APIs and access things directly at the frame level. For example, if you want to do more complicated, more advanced editing, if you want to use effects, if you need fine-grained control, all sorts of reasons, you may need to go down to those low-level APIs. Let's talk about them.
To access one frame at a time, we should call the new API GetMediaSample2. This is a replacement for an old API, GetMediaSample. To get information about multiple frames at once, you should call CopyMediaMutableSampleTable. This is a replacement for a suite of APIs with names like Get Media Sample Reference, Get Media Sample References, and Get Media Sample References 64.
The old APIs, the old-- that suite of old APIs used an array of C structs that you would allocate. Each time we changed the API, we had to introduce a whole new struct, and you had to migrate all of this code over to the new struct, and it was a bit of a mess. The new API gives you an opaque object called a QT sample table, which you call-- which you get the information out of through accesses. That means that we'll be able to introduce more information without having to completely reinvent the world.
The corresponding APIs for adding samples to media are AddMediaSample2, which is a replacement for AddMediaSample, and AddSampleTableToMedia, which is a replacement for AddMediaSampleReference, AddMediaSampleReferences, and AddMediaSampleReferences64. Those older APIs will return errors if you call on movies with B frames. Again, it's deliberate in order to protect users from having their movies damaged by failure to copy it across the frame reordering infrastructure. information.
In QuickTime 7 we have a brand new decompression API called the ICM Decompression Session. To use it, to create one, you provide a dictionary that describes the pixel buffers that you want to get out of it. Then you push in the compressed frames and your callback function is called with buffers that contain those new compressed frames and with other status messages. So, let's go and take a look at that on the demo machine.
Let's take a look at that movie that we looked at before. The traditional way to show the internal structure of movies was to use an application called Dumpster. Dumpster's been around since 1991. It's quite venerable and you can still download it from our developer tools page. We've updated it so that it is aware of the new information in QuickTime 7's movie files.
And we can take a look at that here. For example, this is the information about the video track and the video media. And if I click on this, it'll open up the information about the sample descriptions. You probably can't read this. It says H.264 here. And if I open up this, you probably can't read this either. But it says that all of the samples have the same duration, 125.
And even if you can read this, it's really hard to visualize. This is the list of display offsets or composition offsets for the movie. And they're plus or minus 125. Not very visual, so I thought it might be kind of neat to have a sample application which shows this in a more visual way.
It's called Movie Video Chart. It's already available for download. You could be doing this yourself. Let's take a look at that March of the Penguins movie with this application. You can see thumbnails for individual frames here. And you can see them in three columns. Let's move past that green thing we always have at the beginning of the movie so we can find some interesting pieces of video. Where are we here? I like this bit here.
Here we go. So, thumbnails of all of these high definition frames. This application is decoding these high def frames so they can display little thumbnails of them. And this is the same as the diagram I just showed you a little while ago. At the bottom we have frames in decode order. And you can see that they're in decode order because the penguin is going forward and back and forward and back.
And then above that you can see how the frames are reordered by the display offsets into display order. And here the penguin is jumping up, up and away. Above that is a place where we have information about the track edits that a user might have used in order to construct the movie as they wanted.
In this case the edit is rather dull, it's just the whole movie, it's all there, but the application displays it all the same. At the bottom you can see more information about the samples. You can see the data size of these frames, and I've also marked out which ones are droppable and which ones are keyframes. It actually says "sync" here when we find a keyframe.
Keyframes are rare here. Well, let's take a look at another movie. Here's the movie that David just showed us that he edited together from some high definition clips. Here's Tim and here's Jim and here's me. Let's zoom out so we can have another look at this, because the edits are much longer than the display offsets.
You can see this piece at the top here, these three different colors. I've chosen a different color for each of the edits. And you can see that we've brought together three clips of video. You can also see a couple of other things. The keyframes are actually marked in red. And none of those clips began with a keyframe.
This frame was about four or five frames after the keyframe. This edit was a few frames after as well. So there's some extra media information that we've needed to include in this movie that we constructed by bringing together these clips. So there's some extra media information that we've needed to include in this movie that we constructed by bringing together these clips.
In order to prime the decompressor for displaying the frames that you do want to see, we had to bring back the key frame and the intermediate frames. But by using the track edits, those frames are removed from the composition that has been created at the user level. Now, if you have a very long number of keyframes-- sorry, a very long number of difference frames between the keyframe and the beginning of the edit, then playing across these cuts might be hard and might take a lot of CPU, and maybe you don't have enough CPU.
But in this case, we were able to play across this because we have a nice, fast machine. So this application is a piece of sample code. Let's take a quick look at it. It is an HIV-based application. It's a composited HIV. It has a single main view that draws all of that art, including all of the thumbnails.
I'll show a couple of important points about it. We get information about individual samples in groups. We use copy media mutable sample table on a range and it gives us back a sample table object. Then we walk along the sample table object and call these accesses to get the information out that we're going to display in the user interface one, in the chart.
In order to display the thumbnails, we have to decode those frames. To decode the frames, we create one of these new decompression session objects. To create a decompression session, you build this dictionary that says the width, the height, and the pixel format that you want. And there's other information you might put in as well. You also provide the callback function that you want to have called when frames are emitted and other information happens. And you also provide the image description.
Once we've created the decompression session, we can decode frames with it. To load those frames, we call getMediaSample twice. Once to find out the size of the sample, then we allocate some memory, and then we read it in into that buffer. After that, we call the decompression session to decode the frame.
Now, note one important thing here. Because we're working at the low level, it's our responsibility to make sure that the decoder is always primed with the right state for doing the decode of the frame we're going to give it. So, if we find that we're not...
[Transcript missing]
It calls us with a flag that says emitting frame when it's emitting a frame, and when that happens, we wrap that frame with--as a CG image, and then we use an HIView utility to draw that CG image with the correct orientation. When we get the message that the Image Compression Manager no longer needs that source data to be held around, we can free the buffer. So that's it for this demo. Let's go back to slides.
Like I said, if you use the low-level APIs, it's your responsibility to ensure that frames are decoded in the right order, and that means that if you're jumping around, you may need to go back to the keyframe and decode ahead. We call this catch-up. Also, you may need to handle frame reordering.
Now, in this application, we are deliberately showing the frames in decode order because we're trying to show what the internal structure of the movie is like. Not all applications are going to do that. Normally you want to extract the frames and pull them out in display order. If all you want is pixels, then you should be using the high-level API to access the decompressed frames.
And those APIs are visual context APIs. If you're trying to go towards OpenGL processing, then what you want is the OpenGL texture context. If you're going to do more processing on the CPU, then you want pixel buffers, and so you want to use the Qt pixel buffer visual context instead.
It's time to talk about the low-level image compression APIs. Once again, sometimes the high-level APIs for movie export aren't applicable to you. For example, the movie export components always create a new file to write out their content, a new movie file, a new MP4 file or whatever. Well, if what you're trying to generate isn't a file, then that might not be what you want. So you may have to go down and use the low-level compression APIs yourself.
There's a new API in QuickTime 7 called the compression session. And when you create this, you pass in the width, the height, the codec type, and a session options object. Then you push in pixel buffers with display times in display order. Your callback is called with the encoded frames in decode order, so frames may be reordered during compression. For that reason, you're not going to get the compressed frames returned to you immediately, they might be returned to you after some delay. This delay, this queue, is called a look ahead window.
It also gives the codec an opportunity for further optimized compression. So I'm about to show a demonstration of compressing live video from a camera to H.264, similar to what you get in the QuickTime Player new movie recording feature in QuickTime Player Pro. What we're doing is we're going to have RGB frames stored in pixel buffers and then we'll pass those to a compression session which will return us H.264 frames that we're going to drop into a new movie.
How do we get those frames? Well, I took advantage of some sample code that was already there. I took the, I think it was the Son of Mungrab sample code, which shows you how to use a sequence grabber to get frames in. But I adjusted it. I made it use decompression sessions to decode the frames from whatever format you get them from the camera into RGB.
While they're in RGB format, we can also do other things. In this sample code I'm showing how to draw on top of these frames with a Core Graphics bitmap context. And for some variety, we're using-- we're also going to wrap them as a CG image in order to display a preview. Now this isn't the only way to display previews during Capture. In fact, it's probably more efficient to use OpenGL to display them.
There's also some sample code around that was demonstrated yesterday in the audio session about Capture. That shows how to use OpenGL to display a preview. So this is an alternative for variety. So, to help me with this demo on the demo machine, I'd like to bring up the QuickTime Mr. Canada, Tim Cherna. Big hand please for Mr.
Cherna.
[Transcript missing]
Okay, let's open this movie back in movie video chart to prove that the frames are being reordered, which you can only get if you use the new compression APIs. Here's a whole lot of Tims. And here's a flag. So you can see that the frames are being displayed in a different order from the order they were decoded. Great. Let's have a look at the code.
Once again, we create a decompression session by constructing a dictionary that has the width and the height and the pixel format. We also indicate that we want this pixel buffer to be compatible with Big Map contexts and CG images. And we provide the image description which came to us from the sequence grabber and our tracking callback. Now we construct the compression session by first creating the session options object. Now if you were looking at this on your PowerBook at the moment, you'd probably see a bit more text. I've removed some of the error checking just so it fits on the slide.
There's a bunch of options that it's very important that we set. We must set the flags to say that we allow temporal compression and frame reordering, or we won't get B-frames. We set the keyframe interval because it's nice to know what it can be, but we're not required to set that. Because we want to store the frames in a movie, we need to know their decode durations, and so we set the flag that says durations are needed.
Now we're only going to give the ICM display timestamps, but it's going to derive out of that all of the durations as well. We tell the ICM it's okay to drop frames by setting the allow frame time changes flag, and we also set the data rate. Once again, just like with the decompression session, we give it a callback that's going to be called with our encoded frames.
So, the sequence grabber works by calling the dataproc function that we provide it, and it calls us with each frame. Once we've made sure that we have a compression session and a decompression session, we decode these frames. In this case I'm using an API called non-scheduled display time, which is what you'd use if the frames had been reordered. Here it's just the way I'm triggering those frames to be actually displayed, well, not displayed, output to our callback function. So here's the callback function for the decompression session. Once again it gives us different flags.
When it tells us it's emitting a frame, then we're going to take that pixel buffer, we're going to draw on top of it using a bitmap context, and then we wrap it in a CG image and place it in an HI image view inside the preview window. We also then feed the frame to the compression session.
The compression session has another callback. Here's our callback from that function, from that session. Now when we get the first frame, we have to create a new media to store the frames. And it's convenient to do that once we get the first frame because that's when we get the image description, and we can use the information in the image description to work out how large what the dimensions of the media should be.
And then we call addMediaSampleTo to add samples. There's actually a utility here I'm pointing out which adds media samples directly from the encoded frame object. When we're told to stop, when I click the close button in the window, or quit, we tell the compression session to complete any frames that are still in its queue. And then we release the object.
So once we've done that, we have a movie that has a media that's full of frames, but the movie's still empty. Why is that? Because the edit list hasn't been touched. The edit list is still empty. So it's important to insert all of that media into the track so that we have an edit that includes all of that media.
It's easy to forget this, but if you do, you'll have an empty movie and you'll know what to do. Then the only thing left to do is to write out the movie header and to close the file. This sample code uses the movie storage APIs. These are a placement for FS spec based APIs.
We have a--the movie storage APIs use data references and we have a bunch of useful utilities to construct data references from FS refs, from CFURLs, and from CFStrings, from other kinds of things as well. So that's it for that demo. Let's go back to slides. And for me, another drink of water.
Okay. Once again, in order to get the advantages of H.264, you must use the new low-level compression APIs if you're doing compression at the low level. The high-level API, the movie export, already does this for you, but if you're at the low level, you've got to use the new APIs. The good news is that the new APIs support all codecs, new and old.
One more point on this. While we're doing that real-time compression there, there's a whole lot of pixel buffers that are being created and destroyed. Mapping and unmapping large pieces of virtual memory involves some per-page kernel overhead. And that can be quite a lot when all those frames are high def and quite a lot when you're doing 30 of them or 60 of them per second.
So it's important to have an efficient recycling mechanism so that you're not constantly mapping and unmapping memory over and over. Core Video provides a mechanism called the Pixel Buffer Pool which helps reallocate, helps reuse and recycle buffers efficiently. It's recommended that you use it so that, like I said, you're not always unmapping and paying kernel overheads. And this sample code, if you look at it, you won't actually find the word "pool" in it at all. But we are taking advantage of Pixel Buffer Pools in this sample code because decompression sessions use Pixel Buffer Pools to create the buffers that they return us back.
If you take that code and you change it and you just take the compression session stuff with you, then you should consider using a pixel buffer pool to create your source pixel buffers. There's actually utility as part of compression sessions that helps you do that. So, One more thing, one more piece, we're going to talk about writing codecs.
If you write a codec, generally that means that you need to write two components, an image compressor and an image decompressor. Now we have a brand new, bold new image compressor API for components in QuickTime 7. We have made more modest changes to the decompressor API in QuickTime 7. And I'm going to show you on the demo machine a brief survey of the example codec.
Now this is a well commented piece of code. It should show you how to use the interface well. But it's not a very good codec. A better spin? You can do better. It's a very simplistic encoding algorithm. It just encodes the high bits first, and when it runs out of bits, it stops. Well, one benefit of this is it's very easy to see the loss. Loss means you're rounding down by some amount, and that means that images get darker, or because we're doing it in YUV, greener.
So let me show you how to use this. I've already built this and so I'm going to install it in library/quicktime. Library/quicktime is the place where you'll normally install your codec components. They can also be installed in the home directory/library/quicktime. So now that I've installed this, I can open up one of those clips that we saw before. Here's Jim.
This time. Okay. I like it better when he's smiling. So now that I've added that codec, Assuming that I quit the player, let's quit the player. You have to quit the player and now when I relaunch it, it'll see the new codec. If you don't see your new codec, try quitting. Let's export again.
There it is, example IPB. So, at top quality we don't see any loss as we drag things down. Loss means dark and green and brooding and kind of messed up. Well, you can do better, like I said. Well, I've already compressed this one. I cooked one before the show. And here it is.
You saw a flicker there. The better encoded frames are the B frames. The ones that look poorer are the I frames because they had to start from scratch. The B frames could use the other frames as a better starting point. We could take a look at this in movie video chart. If I quit it, you can take a look at this movie video chart and you can do all of this yourself and confirm to yourself that it's using B-frames. You can see the frames are reordered there.
Okay, let's have a quick go with the source. Now, I've separated out the naive encoding logic, which is in these naive files here. You'll replace those with your very smart algorithms. I've separated that from the interface. Let's look at the important functions first in the new compressor API.
There are three main functions. In Prepare to Compress Frames you return another dictionary that describes the pixel buffers that you want to get as your source. For example, you want to say what the width and height you want are, and you can ask for extra pixels to be allocated for scratch memory on the right and bottom.
In this case we allocate memory to round up to 16 by 16 macro blocks. We can actually now allocate extra memory on the top and left, which may help if you're doing something called unrestricted motion compensation. Which means you can have motion vectors from outside. It's one of those geeky codec things. You can also ask for the bytes per row to be a multiple of a particular number. If you're using AlteVec or SSE code, it's likely that you want to ask for at least a multiple of 16 for the bytes per row.
Another thing you can ask for is a list of input pixel formats that you want to support. And this codec asks for our standard YCBCR 422 codec, which is pixel format type, which is called 2VUI. This is a new feature for the new Codec API. Previously you had to implement RGB input. We were sick of everyone having to write their own RGB to YUV converters, so now we provide some standard ones.
Another element that goes in this dictionary here, we're saying what the gamma level is. We ask for the video gamma level. Also, the color profiles, in particular the YCBCR to RGB matrix used for standard def and high def video are different. So we specify which of those we're going to use. We're using the standard def ones in this example codec. Apart from creating that dictionary, prepare to compress frames is also a place where it's appropriate to set any extra information on the image description. Here we're setting the gamma level to 2.2.
The encode frame function is called for each source frame. You don't have to encode it immediately. You can build up a little queue of frames. In this case, we're using a C array to store that queue, but you can store it however you want. You can use core foundation arrays because these objects are core foundation retain counted objects. You can use STL if you like. I don't care.
But you probably want to have some threshold beyond which you decide that, okay, I've got enough frames, I'll start encoding, or else you could end up starting to swap the frames out because you'll have so many sitting there. So encodeFrame is called for each source frame in display order. CompleteFrame is called to say it's time to decode a given frame. You don't have to return that one first, but you must emit it and encode it, or drop it, before this function returns.
So there's a function here that decides which frame to encode next. I'm not going to go into that in detail. I'll just say that it does implement an MPEG-2 IPB pattern. You could modify it to support other patterns if you want that kind of flexibility. When you decide it is time to encode a frame, you create a buffer with the ICM encoded frame create mutable function, and you pass it the worst case, the largest data size you might need. Then you fill in as much as you need.
[Transcript missing]
The other choice is that you could drop a frame. This codec doesn't actually drop frames, but if you look in the header file you'll see how to do that.
Let's have a look at the important functions in the decompressor API. There are five functions that are important. In initialize, You set some basic flags to describe the basic characteristics of your codec. In this case we say that we support B frames. This is out of order display times.
We also say that we are multi-buffer aware. We get high performance playback if we are able to decode each frame to a separate buffer. If you don't, you'll lose some performance because we may need to make extra copies of the same buffer to upload them to OpenGL, for example.
You set some basic flags to describe the basic characteristics of your codec. In this case we say that we support B frames. This is out of order display times. You can also ask for extra scratch space at the bottom and right of the buffer. We rounded up to a multiple of 16 by 16. This is also a convenient time to allocate internal buffers.
In the beginBand function, for most codecs, band is equivalent to frame, so you could think of this as beginFrame. In the beginBand function, it's time to classify the frame. You need to read as much of your frame header as is necessary to classify the frame as a keyframe, a difference frame, or a droppable frame. This is how the Image Compression Manager is able to know which frames to drop when we are low on CPU.
The DecodeBand function is new for QuickTime 7 in order to support B-frames and frame reordering. It's called to decode a frame that isn't the next frame to display. It's up to you. You can also choose to have it called for all frames if you want. Finally, the drawband function is called when it's time for you to output a frame to an output buffer. So that's that sample code. We'll return to slides and recap this.
Three important functions for the compressor: Prepare to compress frames, which is where you describe the source pixel buffers. Encode frame, which is called with each source frame so you can put it in your queue. Complete frame, time to put up or shut up. Five important functions for the decompressor: Initialize, report basic characteristics about your codec and what it does.
Preflight, where you negotiate the output pixel format. BeginBand, where you classify the frame as a keyframe, a droppable, whatever. DecodeBand, where you decode a frame to an internal buffer. And DrawBand, where you write a frame to an external pixel buffer. So it's now time for the Thread Safety Pledge. If you would all please raise your right hand. Repeat after me: I promise that every new component I write shall be thread safe.
It's the 21st century. It's time that we were all writing thread-safe code. Avoid using unprotected global variables. Use the instance storage to store your state. Avoid thread-unsafe APIs like the old resource manager calls. Test your codec to make sure that it's thread-safe and then ship it with the component thread-safe flag.
This excludes UI. None of our UI frameworks are actually thread-safe. We're referring--the thread-safety flag refers to the core processing that your component does. There's a good tech note on thread-safe programming in QuickTime that I recommend all of you have a look at, especially if you're component authors. And another topic about components, you are a child of the universe. And your component should be too.
Now Rosetta does not include a mixed mode the way we have with 68K and PowerPC. If you're inside a PowerPC application you will only be able to use PowerPC components. Intel based applications will only be able to see Intel based components. So it's important that your component is universal and contains both kinds of code. There are some instructions in the universal binary page. They give you a fairly aggressive approach, a very clever way of doing things so that you will change the resource template dynamically based on what platforms, what architectures you are compiling for with your Xcode project.
It isn't necessary to do that, all of that macro cleverness, if you just want to always ship, always build universal components. In that case all you have to do is put both platform codes in the platform list. If you remember back with 68K and PowerPC components it's just like that.
So, in summary, wake up. We've looked at a whole bunch of sample code. We've looked at the high-level APIs for movie editing and for export. We've looked at the low-level APIs which give you access to level sample information for decompressing frames and compressing them. And we've shown you the new APIs and the new example for writing your own codecs. If you use the high-level APIs, all of that B-frame trickiness will be hidden from you. Or you can choose to use the low-level APIs and QuickTime will expose it to you in all its glory.
We have a lab where there are going to be lots of QuickTime engineers who are waiting to solve your problems. Not just QuickTime, but also a bunch of other graphics and media engineers. You are here, and if you go out there and go around the corner, around the back, you'll find the lab and it's open until some hour each day.