Graphics and Media • 1:08:35
QuickTime 7 supports advanced video coding formats such as H.264. With this major advancement you'll need to take a new approach when creating, editing and working with media samples. Learn how to handle H.264 and other advanced formats in your application, including how to leverage new APIs for compression, decompression, and synchronization.
Speaker: Sam Bushell
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it may have transcription errors.
Good afternoon. My name is Sam. Hi. In this session, we're going to talk about supporting the latest and greatest video compression formats in your application. We're going to start with a brief bit of technology background, and then we're going to talk about how to access this stuff in your applications using our most modern APIs. We're going to support this with a lot of sample code, which is already available for you to download. If you go to the page for this session, you can log in and download all that stuff right now and look at it while we're talking about it.
We have a great new codec in QuickTime 7 and Tiger. It's called H.264. It's a standards-based codec that's the result of a joint effort by two major standards bodies with a bunch of experts, international collaboration, collaboration in particular between the experts from ISO, the International Organization for Standardization, which brought us the MPEG codecs, and the ITU, the International Telecommunications Union, who developed the H261 and H263 video conferencing codecs. It's been chosen for a number of standards, it's been supported by a number of, uh, major players as well as Apple. It's a very flexible codec. We talk about it being scalable. By that we mean that it's a great choice both at the low end of the spectrum and the high end, from tiny little cell phone video all the way up to high definition.
It's got a lot of names. You'll see people calling it AVC, which stands for Advanced Video Coding. You'll see people calling it JVT, which stands for Joint Video Team, which refers to its joint heritage. uh... you'll see a cold and big for part ten because that's the name of the standard apple uses the name a six four which is another of the choices all of these referring to the same thing So this is our new video codec in QuickTime 7. We put a lot of energy into making it work. I'd like to give you a brief demo of it over on this wonderful demo machine here. For 20 days and 20 nights, the Emperor Penguin will march to a place so extreme, it supports no other life.
excitement, adventure, and really wild things. So this is one of the videos from, you know, from our site. If you have QuickTime Player Pro, you can export to H.264 yourself. It's under Movie to QuickTime Movie, and if you click Options, there's another dialog that has a list of settings, and the video settings, H.264 is one of the choices in the compressor list. And down here you'll see that we've -- if you've seen this dialog before, you've seen that we've rearranged it a bit in order to support our new features. A major new feature for H.264 encoding is a feature called multi-pass encoding, intelligent multi-pass encoding. This is what we use to get the best quality in our encodes.
Let's come back to slides. Since this is the year of high definition, I wanted to point out that Apple actually has been delivering quite a number of codecs which are high def ready, as well as H.264, which is our premier delivery codec. This year in Final Cut Pro 5, Final Cut Pro Studio, we delivered native support for HDV. HDV is an MPEG-2 19 megabit IPB-based format. It's constant bit rate. It's supported by a new breed of FireWire cameras, of HDV cameras from some manufacturers. And we have native support for this in Final Cut Pro 5. iMovie HD and Final Cut Express HD support HDV by converting it to an intermediate format called the Apple Intermediate Codec.
But these aren't the only ones. We've also supported a format called DVC Pro HD, which is a 100 megabit DV-style, DCT-based high-def format since Final Cut Pro 4.5 last year. In Panther, we introduced Pixlet. And for some time, we've supported uncompressed HD playback if you have a machine that has sufficient IO bandwidth. These movies have exceptionally high data rates, up in the hundreds of megabytes per second. Now, the last four of these codecs are iframe only. H.264 and HDV use B-frames, and some of you may be saying, "What's that again?" So let's have a little bit of technology background on what B-frames are and why you might need to worry about them and why they're important.
Video codecs present you with a trade-off between bit rate and quality. If you want a better quality, you generally have to use more bits. If you want to reduce the bit rate, you have to sacrifice the quality to some extent. So there's this curve, and we're constantly trying to move that curve better and better towards better quality at lower bit rates. We do this by adding more tricks.
The smart people who write codecs come up with more tricks. Some of those tricks are entirely inside the codec, But some of them involve-- aspects that need to be taken care of by parts that are outside of the codec, by other things-- other parts of the system that deal with compressed frames.
Suppose you want to compress some video. Here's a clip of me parking my car. Prosaic. Well, you can compress each frame independently. This is called iframes or keyframes or sync samples. In this case, every frame is self-contained, so random access is fast, but the bit rate isn't too great. Because we're not taking advantage of similarity between adjacent frames. If you look at these two frames, the building and the tree and the road are pretty much the same thing in the same place. And the car mostly looks the same. It's just moved a bit.
So you can improve compression performance substantially by using one frame as a base for encoding another. In the example I'm illustrating here, we're describing frame 5 in terms of frame 4 as a base. First, we describe the areas that are similar to frame 4. Now, the orange area in this diagram is the piece that's the same pixels from the same place in frame 4. The green bit in the middle is where we have the same pixels, but they kind of moved over from where they were in frame four. This is called motion compensation.
This is called motion compensation. Now, that doesn't completely describe the scene. There's still a fix-up that has to be made because not everything is completely stationary. The reflection on the car doesn't move with the car. It's got to be adjusted. And the tire is moving, so you have to adjust the image for that as well. This is called the residue, this fix-up image. Now, there's a whole strip of the image that wasn't in frame four because it was out of the picture. It's just moved in. This strip has to be coated from scratch by the residue. So this is what it looks like if we encode the last five frames of those six using difference frames, or that is using motion compensation. Difference frames use information from a previous frame.
They're also known as P-frames for predicted. This gives you better compression because... motion compensation is extremely compact relative to describing something from scratch. Another way of looking at it is to say that each of these frames depends on the previous one, because each frame is described in terms of the previous one as a base. That means that if you wanted to display frame six, you have to start with frame one and then decode all of the ones in between.
So we call this IP, when you have iframes and pframes. It gives you better compression than iframes only, but random access can be slow. Because, for example, to get to that last frame, you had to decode all six. It could take some time. Also, another thing to note is that images that appear gradually have to be constructed incrementally. Like the last image in that sequence, when you see it, it's been constructed from strips in five different frames. that might not always be the most efficient way to do things. So suppose for a moment that we encoded the first frame in that sequence as an iframe, self-contained, and then went and encoded the last frame in the sequence as a pframe directly based on frame one. Then we could pick a frame in the middle and we could encode it using motion compensation partly from the frame before it and partly from the frame after it.
Well, then you can see we can encode it almost completely using motion compensation. And in fact, we can do that for a whole bunch of frames in between those. And then mostly encode it using motion compensation so the bit rate is improved. These are called B-frames. B stands for bidirectional prediction. B-frames refer to information from frames that will be displayed in the future. They may also use information from a previous ILP frame. As you can see, the bit rate has improved.
Using B frames as well as INP frames gives you better compression when objects appear gradually. It also gives you somewhat faster random access to decode and display any of the frames in the middle from a starting standpoint. You only have to decode the frames at the ends, not any of the others.
Also, other frames can be skipped when, for example, you don't have enough CPU to play at full rate or you're playing fast forward. The jargon for this is temporal scalability. See? One weird thing that happens when you have B frames, though, is that the order the frames are decoded in is not the same as the order the frames are displayed in, because the codec isn't able to predict what's happening in the future. It has to use information that it has already decoded. This reordering is why other parts of the system need to understand about B frames in order to support them.
So some of you are saying that you all knew all of that already because you've been doing some work with MPEG. And it's all familiar to you. Well, there is a twist. You guys may know that you can implement this kind of semantic using what's called a one-frame delay, and you build a small finite-state machine, and you do certain transitions when you see that the frame type is I or P or B, and that completely defines the dependency structure. Now, this is true for MPEG-2's IPB pattern because only one future frame can ever be held at one time.
This is not true for the general case of H.264. In fact, the H.264 encoder has allowed a lot more flexibility than previous frame patterns. P frames can use more than one prior frame for decode. B frames can use more than two prior or future frames. Not all I frames reset the decoder. We have a new name for these. These are IDR frames in H.264 terminology. It stands for instantaneous decoder reset.
P and B frames can depend on other B frames, and some I and P frames may be dropped. So where you had a pattern that if you knew what to look for, was fairly regular for MPEG-2-IPB, the general case for H.264 could be a whole lot more complicated and unpredictable, and you couldn't just derive it from the IP and B letters.
So instead of trying to store the IP and B letters and using that as our basis, we identified that it was important to record the following information per frame. Number one, is it a sync sample? Not every iframe is a sync sample now, because decoding an iframe, although you could do it alone based on no other information, the following P and B frames may need information that's from previous frames. Number two, is it droppable? Not all B frames are going to be droppable now, and some I and P frames might be droppable. Number three, in what order are frames to be decoded? And possibly, you might have some hints about what timing they should be decoded. And number four, at what time should each frame be displayed?
So the moral of this little background bit is that the dependencies between frames are becoming weirder and weirder, but it's all in the cause of improved compression performance. IPB means that frames are stored and decoded in a funny order, and some parts of the system need to be aware of that. And convenient rules that were true for MPEG-2, like the one frame delay, don't work in H.264. They're not actually sufficient to handle the general case.
So let's talk about what this means for QuickTime and for developers like you. Let's take a look at what a video track looks like so that we can describe the API. At the basic level, a video track contains a list of frames. Now in some of our APIs, you'll see we say samples. In some of our APIs, you'll see we say frames. We use sample in APIs that need to be more generic, because they might refer to things other than video. They might refer to sound. But in our video APIs, we tend to use the word frame. And in fact, when we're talking about video, we're using frame and sample interchangeably. Now each sample has some timing and dependency information, and they're stored in decode order. You'll notice that these frames are kind of in a different order from how you'd expect them to, because the car is jerking in and out of the scene as it moves along.
In the cases where the decode order is different from the display order, we also introduce information called display offsets, which are how we describe that reordering for the codec. The display offset is simply the difference between the display time and the decode time. Specifically, it's display time minus decode time. So by interpreting the decode times and adding the display offsets, we know when each frame should be displayed. And now we have the frames from that portion of compressed video that can be decoded and displayed in the order that makes sense to a user.
Now, in general, those display offsets don't need to be exposed to the user. They don't need to know about them at all. When the user wants to rearrange portions of media, they use a different mechanism. They use the edit mechanism in QuickTime. Edits describe the segments of video or other kinds of media that have been brought into the movie's presentation at various times. So, for example, in this case, if the user decided to trim down to just include the end part of that parking sequence and then follow it with a piece where the car's door is open, the next scene, whatever that is, then we would have an edit that described insertion of that piece of video into the track, into the movie's presentation. Thank you.
Broadly, our APIs are divided into high-level APIs and low-level APIs. The high-level APIs work on tracks and movies. The low-level APIs work on media and samples. So we have high-level APIs that implement cut, copy, and paste using the clipboard. We also have high-level APIs called track segment editing APIs, which lets you insert segments of media from movie to movie or track to track directly, to delete, to rescale, and otherwise manipulate bits of movies. These work without using the clipboard. The clipboard belongs to the user, and they get antsy when it gets trashed unexpectedly. So if you're trying to programmatically manipulate media with the high-level APIs, you should use the track segment APIs rather than cut, copy, and paste.
Beneath that, we have the media manipulation APIs, which let you access individual samples. And we have the image compression and decompression APIs, which let you compress and decode video frames. Now most of the plumbing changes we've needed to make in order to support H.264 have been in these low-level APIs. Specifically, the changes we've made in order to support H.264 have been to support frame reordering by adding display offsets of various levels of API. We've also started to store new dependency information. We have new sample flags, such as one that says that a given frame is droppable.
Applications that access individual video frames must migrate to our new APIs if they want to access this information in H.264 or other codecs with B frames. We made this a deliberate design decision because we wanted to avoid a situation where a user's content with one of these new codecs would be damaged and messed up by an application that failed to preserve the display offsets.
These API changes are part of an evolution that you'll see happening across the QuickTime landscape. We're moving away from handles. We're moving towards core foundation style objects, because retain counting helps you manage object lifetimes. We're moving away from QuickDraw. In place of Pixmaps as a place to store image data, you'll see we're using core video pixel buffers. And as part of this plumbing, we're able to do things in a multi-buffer manner. This allows us to do more pipelining and have more or asynchrony between various stages. In place of GWolds and graph ports as a way to describe where a movie should be played, we're moving towards a thing called the visual context, which you would have seen a bunch of detail on in this morning's 207 session on a high performance video.
The visual contexts give you advantages like being able to do convenient hookups with OpenGL. If you've seen any of the audio sessions, then you know we're moving away from Sound Manager towards Core Audio, which gives us benefits like high resolution, high fidelity, high definition audio. We're moving away from FS specs. We have a bunch of APIs to replace those, which all use data references. This gives us access to Unicode and long file names.
And there are two other transformations that we're going through, which I wanted to touch on. We're moving towards 64-bit time values and towards 64-bit sample numbers. These enable us to have very high time scales and very long movies. Now, those transitions are not complete yet. Not all the plumbing is completed. But you'll notice that we use 64-bit integers in the new APIs, because all the compilers support them natively now. There's no reason for us not to.
There is a tech note, particularly on the FS spec to data reference transition that you should make in your applications. It's tech note number 2140. you So that's now it for theory in this session. The rest of it is going to be practice. We're going to show you how to do things using our favorite modern APIs, a bunch of high-level demonstrations, a bunch of low-level demonstrations, and then I'm going to finish it out by showing you how to write your own codec with the new codec APIs. Now, as I said, these samples are already available for download. You can follow along with us as you like. We won't tell you when to turn the page. To start off the high level demos I'd like to introduce for the first time on stage speaking, David Eldred.
All right. Thanks a lot, Sam. So as Sam has described, we have all of these high-level APIs that are available to you that will transparently take care of all of the complexities he's been talking about. These APIs, as Sam has said-- include using the clipboard to do cut, copy, and paste. This is user level operation. The insert segment APIs, there's both movie and track versions of these APIs. They're available in the C and the Cocoa QuickTime interfaces.
Saving movies. You don't have to worry about all of the complexities of B-frames when you're dealing with saving. This includes saving as reference movies, saving as self-contained movies, or flattening movies. and exporting movies to whatever, to various formats that we support, all high-level APIs. So let's go to Demo Machine 1, and we'll take a look at a high-level editing sample.
So here I've got three clips. These are all HD size H.264 clips. Let's actually open them in QuickTime Player. Jim, he looks happy. And these are all clips of QuickTime engineers doing what they do best, standing around, talking. But I'd like to edit them into a coherent composition. So I've got a little application here that does that. Very simple. Uses QtKit. And it's so simple, I'm just going to step through it with you.
So first thing we're going to do is we're going to create Qt movie objects to represent the three source movies. And we're going to create a QtMovie object to represent our target movie. And we set an attribute on that to make it editable. And using the insert segment of movie API, we're going to insert a segment defined by a QT time range from our source movie into our destination movie.
And finally, we write out the destination moving. So this example is so trivial where I don't believe the sample code is available to you. If you're really interested in getting your hands on it, we can give it to you in the lab. And let's see what we've made. Hey, Jim. What is it this time?
There we are. All right, back to slides. Okay, so this wasn't really intended to be an introduction to QtKit. The real point of this is when you use these high-level movie editing APIs, all of the B-frame trickiness is handled for you by QuickTime. In this example, these were all H.264 clips. They all included B-frames. When those edits were done, QuickTime made sure that all of the samples that were referenced or the frames we were interested in copying, all of the samples that they depend upon were brought along in the edits. And there was no need for us to restrict our edits to the I frames, to the key frames. We could edit, our edits might have fallen on B frames or P frames, we don't care.
And we could have added together anything there. That could be DV content, that could be H.264, MPEG-4, whatever codecs QuickTime supports, or even other track types that QuickTime supports for editing. So there's your first simple example of high level. And now we're going to look at an example where we're doing multi-pass export using a custom data source.
So multipass export is new in QuickTime 7, as Sam pointed out. It gives you the highest quality H.264 compression possible. And in this example, we're going to use the high level APIs to allow QuickTime to present its compression UI. In the--using the custom sources, we're just going to provide the source data frames, and QuickTime will handle everything else for us. It'll do the--handle the multi-pass configuration of the codec. It'll handle the compressor configuration. It'll write out the file for us, flattening, hinting, everything else.
So how does this example work? First, we're going to create a movie export component. Movie export component has video source procedures and audio source procedures. We just have to hook our movie up to it. In this case, we're going to have a movie playing into a visual context. And that visual context is going to be using Core Image to do some video, some messing around with the video. And that's going to be the source for our video source procedure. Since we're video people, we don't care about audio. We're just gonna plug the audio directly through.
So off to this demo. Demo machine one, please. So how many of you were in session 207 earlier today? A few of you. So this is going to look familiar to you. So I won't go into great detail about how the visual context is configured and how we're hooking up the core image filter. If you have more questions about that, come and see us in the lab. But first, a quick demo of what this application does. So we've got our favorite car driving along. We've got a core image filter we can change the settings of. - Hmm. - And I like it a little darker.
Go over here, make it-- that's about right. Change the color some. too much. So this is how I like it. So when I click the Export button, we're first prompted for where we want to save the movie. We'll just replace that one. And in the Settings dialog-- so we're bringing up the standard QuickTime compression settings dialog from our application. And here we're going to choose H.264 and multi-pass encoding. And we're going to allow frame reordering, because we're using high-level APIs. We don't care if there's B frames.
and we go ahead with these settings. And as you see, it's doing several passes through the video. And... So for each pass through the video, our data proc is being called once for each frame. So if this had 100 frames of video and it did five passes, we'd be called 500 times. So it opens up our resulting movie in QuickTime Player. Vroom. And there it is. So let's take a look at the code.
So we're going to look at this pretty quickly since we've seen it before. And of course, come to the lab if you want more details. But we're going to focus on what happens when that Export button is clicked. So as you can see-- First thing that happens is we prompt you for the file you want to save this as.
And we open a movie export component. That's as easy as describing the component and opening an instance. Two procs that we have inside our application, the video track property proc and the video track data proc. We create UPPs from these functions, and we pass these in to the exporter component we just opened. And since we're video people, we're just going to use the standard audio source procs from the source movie and add those to the movie exporter.
We invoke the standard compression dialog. Nothing too complicated there. We ask the movie exporter that we open to do it. Movie export do user dialog. And assuming that they don't cancel, we are going to start our export with the movie export from procedures to DataRef. And at this point, we begin being pulled for, begin having data pulled from us. Our data proc that we provided is going to be called.
And we'll take a quick look at this Dataproc. So this is the procedure that we defined. It simply calls our export frame procedure. And inside that export frame procedure, we're going to-- the parameters that we're called with include a current time. That's the time for the frame that it wants the data for. And in this application, we just set that time on our movie, task the movie to get it to draw.
do a little bit of application-specific stuff to get it to render that current time onto the screen, and then it's going to do a readback from that screen buffer. Since we're doing this core image filter and we want you to have visual feedback, we're rendering this to the screen. And once we read back that data, we just fill in the parameters that are provided as part of this call and return that data. That's it. That's how easy it is to use movie export procedures. So, yeah, let's go back to slides.
So there's a few things to note here. When you're using custom source procedures like this, and you're doing multi-pass compression, you must return the same data for a given frame on each pass through the movie. That's very critical, otherwise you'll throw things off. And if you've got very slow rendering, you may want to consider exporting to an intermediate movie first, an uncompressed intermediate movie.
That will make the entire process a little quicker, since it has to do several passes through the movie. And if you have very slow processing, that can take a while. That can also be handy if your manipulations that you're doing on the video are, for some reason, non-deterministic, you can't guarantee you're going to return the same thing. render to an intermediate movie first. So with that, I'm going to bring Sam back in to dive into the low levels. Where has he gone? There he is. Thanks a lot. I need a drink.
Thank heavens. So some apps need to dip below those high-level APIs and access things directly at the frame level. For example, if you want to do more complicated, more advanced editing, if you want to use effects, if you need fine-grained control, all sorts of reasons, you may need to go down to those low-level APIs. Let's talk about them.
To access one frame at a time, we should call the new API getMediaSample2. This is a replacement for an old API, getMediaSample. To get information about multiple frames at once, you should call copyMediaMutableSampleTable. This is a replacement for a suite of APIs with names like GetMediaSampleReference, GetMediaSampleReferenceSays, and GetMediaSampleReferenceSays64.
The old APIs, the old-- those--that suite of old APIs used an array of C structs that you would allocate. Each time we changed the API, we had to introduce a whole new struct, and you had to migrate all of this code over to the new struct, and it was a bit of a mess. The new API gives you an opaque object called a QT sample table, which you call-- which you get the information out of through accesses. That means that we'll be able to introduce more information without having to completely reinvent the world.
The corresponding APIs for adding samples to media are addMediaSampleTo, which is a replacement for addMediaSample, and addSampleTableToMedia, which is a replacement for addMediaSampleReference, addMediaSampleReferenceSays, and addMediaSampleReferenceSays64. Those older APIs will return errors if you call on movies with B frames. Again, it's deliberate in order to protect users from having their movies damaged by failure to copy it across the real frame reordering information.
In QuickTime 7, we have a brand new decompression API called the ICM Decompression Session. To use it, to create one, you provide a dictionary that describes the pixel buffers that you want to get out of it. Then you push in the compressed frames, and your callback function is called with buffers that contain those new compressed frames and with other status messages. So let's go and take a look at that on the demo machine.
Let's take a look at that movie that we looked at before. The traditional way to show the internal structure of movies was to use an application called Dumpster. Dumpster's been around since 1991. It's quite venerable. And you can still download it from our developer tools page. We've updated it so that it is aware of the new information in QuickTime 7's movie files. And we can take a look at that here. For example, this is the information about the video track and the video media. And if I click on this, it will open up the information about the sample descriptions. You probably can't read this. It says H.264 here. And if I open up this, you probably can't read this either, but it says that all of the samples have the same duration, 125. And even if you can read this, it's really hard to visualize. This is the list of display offsets, or composition offsets for the movie. And they're plus or minus 125. It's really... not very visual, so I thought it might be kind of neat to have a sample application which shows this in a more visual way.
It's called Movie Video Chart. It's already available for download. You could be doing this yourself. Let's take a look at that March of the Penguins movie with this application. Let me... You can see thumbnails for individual frames here. And you can see them in three columns. Let's move past that green thing we always have at the beginning of the movie so we can find some interesting pieces of video. Where are we here? I like this bit here. Okay.
Here we go. So thumbnails of all of these high-definition frames. This application is decoding these high-def frames so they can display little thumbnails of them. And this is the same as the diagram I just showed you a little while ago. At the bottom, we have frames in decode order. And you can see that they're in decode order because the penguin is going forward and back and forward and back.
And then above that, you can see how the frames are reordered by the display offsets into display order. And here, the penguin is jumping up, up, and away. Above that is a place where we have information about the track edits that a user might have used in order to construct the movie as they wanted.
In this case, the edit is rather dull. It's just the whole movie, it's all there. But the application displays it all the same. At the bottom, you can see more information about the samples. You can see the data size of these frames, and I've also marked out which ones are droppable and which ones are keyframes. It actually says sync here when we find a keyframe. Uh... Keyframes are rare here. Well, let's take a look at another movie. Here's the movie that-- David just showed us, that he edited together from some high definition clips. Um, now here's Tim, and here's Jim, and here's me.
Let's zoom out so we can have another look at this, because the edits are much longer than the display offsets. You can see this piece at the top here, these three different colors. I've chosen a different color for each of the edits, and you can see that we've brought together three clips of video.
You can also see a couple of other things. The key frames are actually marked in red, and none of those clips began with a key frame. This frame was about four or five frames after the key frame. This edit was a few frames after as well. So there's some extra media information that we've needed to include in this movie that we constructed by bringing together these clips. In order to prime the decompressor for displaying the frames that you do want to see, we had to bring back the key frame and the intermediate frames. But by using the track edits, those frames are removed from the composition that has been created at the user level.
Now, if you have a very long number of key frames-- sorry, a very long number of difference frames between the key frame and the beginning of the edit, then playing across these cuts might be hard. It might take a lot of CPU. Maybe you don't have enough CPU. But in this case, we were able to play across this because we have a nice, fast machine.
So this application is a piece of sample code. Let's take a quick look at it. It is an HIV-based application. It's a composited HIV. It has a single main view that draws all of that art, including all of the thumbnails. I'll show a couple of important points about it. We get individual-- we get information about individual samples in groups. We use copy media mutable sample table on a range, and it gives us back a sample table object. Then we walk along the sample table object and call these accesses to get the information out that we're going to display in the user interface one, in the chart.
In order to display the thumbnails, we have to decode those frames. To decode the frames, we create one of these new decompression session objects. To create a decompression session, you build this dictionary that says the width, the height, and the pixel format that you want. And there's other information you might put in as well. You also provide the callback function that you want to have called when frames are emitted and other information happens. and you also provide the image description.
Once we've created the decompression session, we can decode frames with it. To load those frames, we call getMediaSample twice. Once to find out the size of the sample, then we allocate some memory, and then we read it in into that buffer. After that, we call the decompression session to decode the frame. Now, note one important thing here. Because we're working at the low level, it's our responsibility to make sure that the decoder is always primed with the right state for doing the decode of the frame we're going to give it. So if we find that we're not... decoding frames in the same order, in other words, if we notice that the next one that we need to display isn't the same as the previous one, then we have to go back to the keyframe and decode ahead. When we're doing that, we tell the ICM not to output pixel buffers because we don't need those. We only need the ones that we're going to display thumbnails for.
When the frames are decoded, it calls our callback function. It's called a tracking callback. It calls us with a flag that says emitting frame when it's emitting a frame. And when that happens, we wrap that frame as a CG image. And then we use an HIView utility to draw that CG image with the correct orientation. when we get the message that the image compression manager no longer needs that source data to be held around, we can free the buffer. So that's it for this demo. Let's go back to slides.
Like I said, if you use the low-level APIs, it's your responsibility to ensure that frames are decoded in the right order, and that means that if you're jumping around, you may need to go back to the keyframe and decode ahead. We call this "catch-up." Also, you may need to handle frame reordering. Now, in this application, we are deliberately showing the frames in decode order, because we're trying to show what the internal structure of the movie is like. Not all applications are going to do that. Normally, you want to extract the frames and pull them out in display order. If all you want is pixels, then you should be using the high-level API to access the decompressed frames. And those APIs are visual context APIs. If you're trying to go towards OpenGL processing, then what you want is the OpenGL texture context. If you're going to do more processing on the CPU, then you want pixel buffers. And so you want to use the Qt pixel buffer visual context instead.
It's time to talk about the low-level image compression APIs. Once again, sometimes the high-level APIs for movie export aren't applicable to you. For example, the movie export components always create a new file to write out their content, a new movie file, a new MP4 file or whatever. Well, if what you're trying to generate isn't a file, then that might not be what you want. So you may have to go down and use the low-level compression APIs yourself.
There's a new API in QuickTime 7 called the compression session. And when you create this, you pass in the width, the height, the codec type, and a session options object. Then you push in pixel buffers with display times in display order. Your callback is called with the encoded frames in decode order.
So frames may be reordered during compression. For that reason, you're not going to get the compressed frames returned to you immediately. They might be returned to you after some delay. And this delay, this queue, is called a look-ahead window. It also gives the codec an opportunity to further optimize compression.
So I'm about to show a demonstration of compressing live video from a camera to H.264, similar to what you get in the QuickTime Player new movie recording feature in QuickTime Player Pro. What we're doing is we're going to have RGB frames stored in pixel buffers, and then we'll pass those to a compression session, which will return us H.264 frames that we're going to drop into a new movie. How do we get those frames?
Well, I took advantage of some sample code that was already there. I took the -- I think it was the son of mung grab sample code, which shows you how to use a sequence grabber to get frames in. But I adjusted it. I made it use decompression sessions to decode the frames from whatever format you get them from the camera into RGB.
While they're in RGB format, we can also do other things. I'm--in this sample code, I'm showing how to draw on top of these frames with a Core Graphics bitmap context. And for some variety, we're using-- we're also going to wrap them as a CG image in order to display a preview. Now, this isn't the only way to display previews during capture. In fact, it's probably more efficient to use OpenGL to display them. There's also some sample code around that was demonstrated yesterday in the audio session about capture that shows how to use OpenGL to display a preview. So this is an alternative for variety. So to help me with this demo on the demo machine, I'd like to bring up the QuickTime Mr. Canada, Tim Cherna. Big hand, please, for Mr. Cherna.
And this time, I'm going to open up-- so just need to tell this application where to save it. I need to get it out of demo mode, don't I? Wake up this thing. Here we go. Hooray! Oh Canada. Okay, so you can see that we're... Thank you. Thanks Tim.
You can see that we're drawing on top of the video as we do this. So we've automatically opened it in QuickTime Player for my convenience, mostly. And here's Tim again. Wow. I guess it isn't so live. Waving his flag. Okay, let's open this movie back in movie video chart to prove that the frames are being reordered, which you can only get if you use the new compression APIs. Here's a whole lot of Tims. And here's a flag. So you can see that the frames are being displayed in a different order from the order they were decoded. Great. Let's have a look at the code.
Once again, we create a decompression session by constructing a dictionary that has the width and the height and the pixel format. We also indicate that we want this pixel buffer to be compatible with Big Map contexts and CG images, and we provide the image description, which came to us from the sequence grabber and our tracking callback. Now, we construct the compression session by first creating the session options object. Now, if you were looking at this on your PowerBook at the moment, you'd probably see a bit more text. I've removed some of the error checking just so it fits on the slide. There's a bunch of options that it's very important that we set. We must set the flags to say that we allow temporal compression and frame reordering, or we won't get B frames.
We set the keyframe interval because it's nice to know what it can be, but we're not required to set that. Because we want to store the frames in a movie, we need to know their decode durations, and so we set the flag that says durations are needed. Now we're only going to give the ICM display timestamps, but it's going to derive out of that all of the durations as well. We tell the ICM it's OK to drop frames by setting the allow frame time changes flag. And we also set the data rate. Once again, just like with the decompression session, we give it a callback that's going to be called with our encoded frames.
So the sequence grabber works by calling the data proc function that we provide it, and it calls us with each frame. Once we've made sure that we have a compression session and a decompression session, we decode these frames. In this case, I'm using an API called non-scheduled display time, which is what you'd use if the frames had been reordered. Here it's just the way I'm triggering those frames to be actually displayed-- well, not displayed-- output to our callback function. So here's the callback function for the decompression session. Once again, it gives us different flags.
When it tells us it's emitting a frame, then we're going to take that pixel buffer, we're going to draw on top of it using a bitmap context, and then we wrap it in a CG image and place it in an HI image view inside the preview window. We also then feed the frame to the compression session.
The compression session has another callback. Here's our callback from that function, from that session. Now when we get the first frame, we have to create a new media to store the frames. And it's convenient to do that once we get the first frame, because that's when we get the image description. And we can use the information in the image description to work out how large what the dimensions of the media should be.
And then we call addMediaSampleTo to add samples. There's actually a utility here I'm pointing out which adds media samples directly from the encoded frame object. When we're told to stop, when I click the close button in the window or quit, we tell the compression session to complete any frames that are still in its queue. And then we release the object.
So once we've done that, we have a movie that has a media that's full of frames. But the movie's still empty. Why is that? Because the edit list hasn't been touched. The edit list is still empty. So it's important to insert all of that media into the track so that we have an edit that includes all of that media.
It's easy to forget this, but if you do, you'll have an empty movie, and you'll know what to do. Then the only thing left to do is to write out the movie header and to close the file. This sample code uses the movie storage APIs. These are a placement for FS spec-based APIs. We have a--the movie storage APIs use data references, and we have a bunch of useful utilities to construct data references from FS refs, from CFURLs and from CFStrings, from other kinds of things as well. So that's it for that demo. Let's go back to slides. And for me, another drink of water.
OK. Once again, in order to get the advantages of H.264, you must use the new low-level compression APIs if you're doing compression at the low level. The high-level API, the movie export, already does this for you. But if you're at the low level, you've got to use the new APIs. The good news is that new APIs support all codecs, new and old.
One more point on this. While we're doing that real-time compression there, there's a whole lot of pixel buffers that are being created and destroyed. Mapping and unmapping large pieces of virtual memory involves some per-page kernel overhead, and that can be quite a lot when all those frames are high def and quite a lot when you're doing 30 of them or 60 of them per second. So it's important to have an efficient recycling mechanism so that you're not constantly mapping and unmapping memory over and over. Core Video provides a mechanism called the Pixel Buffer Pool, which helps reallocate-- helps reuse and recycle buffers efficiently. It's recommended that you use it so that, like I said, you're not always unmapping and paying kernel overheads. And this sample code, if you look at it, you won't actually find the word "pool" in it at all. But we are taking advantage of pixel buffer pools in this sample code, because decompression sessions use pixel buffer pools to create the buffers that they return us back.
if you take that code and you change it and you just take the compression session stuff with you, then you should consider using a pixel buffer pool to create your source pixel buffers. Um, there's actually utility as part of compression sessions that helps you do that. So... One more thing, one more piece, we're going to talk about writing codecs.
If you write a codec, generally that means that you need to write two components, an image compressor and an image decompressor. Now, we have a brand new, bold new image compressor API for components in QuickTime 7. We have made more modest changes to the decompressor API in QuickTime 7. And I'm going to show you on the demo machine a brief survey of the example codec.
Now this is a well-commented piece of code. It should show you how to use the interface well. But it's not a very good codec. A better spin? You can do better. It's a very simplistic encoding algorithm. It just encodes the high bits first, and when it runs out of bits, it stops. Well, one benefit of this is it's very easy to see the loss. Loss means you're rounding down by some amount, and that means that images get darker, or because we're doing it in YUV, greener.
So let me show you how to use this. I've already built this, and so I'm going to install it in library/quicktime. Library/quicktime is the place where you'll normally install your codec components. They can also be installed in the home directory/library/ quicktime. So now that I've installed this, I can open up one of those clips that we saw before. Here's Jim.
- This time.
- Okay. I like it better when he's smiling. So now that I've added that codec, Assuming that I quit the player, let's quit the player. You have to quit the player, and now when I relaunch it, it'll see the new codec. If you don't see your new codec, try quitting. Let's export again.
There it is, example IPB. So... at top quality, we don't see any loss. As we drag things down, loss means dark and green and brooding and kind of messed up. Well, you can do better, like I said. Well, I've already compressed this one. I cooked one before the show. And here it is.
You saw a flicker there. The better encoded frames are the B frames. The ones that look poorer are the I frames, because they had to start from scratch. The B frames could use the other frames as a better starting point. We could take a look at this in movie video chart.
If I quit it, we can take a look at this in movie/video chart, and you can do all of this yourself and confirm to yourself that it's using B frames. You can see the frames are reordered there. Okay, let's have a quick go with the source. Now, I've separated out the naive encoding logic, which is in these naive files here. You'll replace those with your very smart algorithms. I've separated that from the interface. Let's look at the important functions first in the new compressor API.
There are three main functions. In prepareToCompressFrames, you return another dictionary that describes the pixel buffers that you want to get as your source. For example, you want to say what the width and height you want are, and you can ask for extra pixels to be allocated for scratch memory on the right and bottom. In this case, we allocate memory to round up to 16 by 16 macro blocks. We can actually now allocate extra memory on the top and left, which may help if you're doing something called unrestricted motion compensation, which means you can have motion vectors from outside.
It's one of those geeky codec things. You can also ask for a particular-- for the bytes per row to be a multiple of a particular number. If you're using Altivec or SSE code, it's likely that you want to ask for at least a multiple of 16 for the bytes per row.
Another thing you can ask for is a list of input pixel formats that you want to support. And this codec asks for our standard YCbcr422 codec, which is pixel format type, which is called 2vui. This is a new feature for the new Codec API. And previously, you had to implement RGB input. We were sick of everyone having to write their own RGB to YUV converters, so now we provide some standard ones.
Another element that goes in this dictionary here, we're saying what the gamma level is. We ask for the video gamma level. Also, the color profiles, in particular, the YCBCR to RGB matrix used for standard def and high def video are different, so we specify which of those we're going to use. We're using the standard def ones in this example codec. Apart from creating that dictionary, prepare to compress frames is also a place where it's appropriate to set any extra information on the image description. Here we're setting the gamma level to 2.2.
The encodeFrame function is called for each source frame. You don't have to encode it immediately. You can build up a little queue of frames. In this case, we're using a C array to store that queue, but you can store it however you want. You can use core foundation arrays, because these objects are core foundation retain counted objects. You can use STL if you like. I don't care.
But you probably want to have some threshold beyond which you decide that, okay, I've got enough frames, I'll start encoding, or else you could end up starting to swap the frames out because you'll have so many sitting there. So encodeFrame is called for each source frame in display order. CompleteFrame is called to say it's time to make-- to decode a given frame. You don't have to return that one first, but you must emit it and encode it, or drop it before this function returns.
So there's a function here that decides which frame to encode next. I'm not gonna go into that in detail. I'll just say that it does implement an MPEG-2 IPB pattern. You could modify it to support other patterns if you want that kind of flexibility. When you decide it is time to encode a frame, you create a buffer with the ICM encoded frame createMutable function. And you pass it the worst case, the largest data size you might need. Then you fill in as much as you need.
And then once you've done that, you set the actual data size to say how much you did use. And then you set the flags to say, was it a key frame? Was it droppable? And so forth. And then you emit the frame. And when you emit the frame, you say which source frames it was based on. And that's how the ICM knows that you're done with those source frames. The other choice is that you could drop a frame. This codec doesn't actually drop frames. But if you look in the header file, you'll see how to do that.
Let's have a look at the important functions in the decompressor API. There are five functions that are important. In initialize, You set some basic flags to describe the basic characteristics of your codec. In this case, we say that we support B frames. This is out of order display times. We also say that we are multi-buffer aware. For high--we get high-performance playback if we are able to decode each frame to a separate buffer. If you don't, you'll lose some performance because we may need to make extra copies of the same buffer to upload them to OpenGL, for example. In the preflight function, you report what pixel formats you want to have for output. This codec also uses 2VUI for output.
You can also ask for extra scratch space at the bottom and right of the buffer. We rounded it up to a multiple of 16 by 16. This is also a convenient time to allocate internal buffers. Thank you. In the begin_band function, for most codecs, band is equivalent to frame, so you could think of this as begin_frame. In the begin_band function, it's time to classify the frame. You need to read as much of your frame header as is necessary to classify the frame as a key frame, a difference frame, or a droppable frame. This is how the image compression manager is able to know which frames to drop when we are low on CPU.
The decode band function is new for QuickTime 7 in order to support B frames and frame reordering. It's called to decode a frame that isn't the next frame to display. It's up to you. You can also choose to have it called for all frames if you want. Finally, the drawband function is called when it's time for you to output a frame to an output buffer. So that's that sample code. We'll return to slides and recap this.
Three important functions for the compressor. Prepare to compress frames, which is where you describe the source pixel buffers. Encode frame, which is called with each source frame so you can put it in your queue. Complete frame, time to put up or shut up. Five important functions for the decompressor. Initialize, report basic characteristics about your codec and what it does. Preflight, where you negotiate the output pixel format. Begin band, where you classify the frame as a key frame, a droppable, whatever. Decode band, where you decode a frame to an internal buffer. And draw band, where you write a frame to an external pixel buffer. So it's now time for the Thread Safety Pledge. If you would all please raise your right hand. Repeat after me. I promise that every new component I write shall be thread safe.
It's the 21st century. It's time that we were all writing thread-safe code. Avoid using unprotected global variables. Use the instance storage to store your state. Avoid thread-unsafe APIs like the old resource manager calls. Test your codec to make sure that it's thread-safe, and then ship it with the component thread-safe flag. This excludes UI. None of our UI frameworks are actually thread-safe. The thread safety flag refers to the core processing that your component does. There's a good tech note on thread safe programming in QuickTime that I recommend all of you have a look at, especially if you're component authors. And another topic about components, you are a child of the universe. And your component should be too.
Now, Rosetta does not include a mixed mode, the way we have with 68K and PowerPC. If you're inside a PowerPC application, you will only be able to use PowerPC components. Intel-based applications will only be able to see Intel-based components, so it's important that your component is universal and contains both kinds of code. There are some instructions in the universal binary page. They give you a fairly aggressive approach, a very clever way of doing things so that you will change the resource template dynamically based on what platforms, what architectures you are compiling for with your Xcode project. It isn't necessary to do that, all of that macro cleverness, if you just want to always ship -- always build universal components. In that case, all you have to do is put both platform codes in the platform list. If you remember back with 68K and PowerPC components, just like that. No problem.
So in summary, wake up. We've looked at a whole bunch of sample code. We've looked at the high-level APIs for movie editing and for export. We've looked at the low-level APIs, which give you access to level sample information for decompressing frames and compressing them. And we've shown you the new APIs and the new example for writing your own codecs. If you use the high-level APIs, all of that B-frame trickiness will be hidden from you. Or you can choose to use the low-level APIs, and QuickTime will expose it to you in all its glory.
We have a lab where there are going to be lots of QuickTime engineers who are waiting to solve your problems. Not just QuickTime, but also a bunch of other graphics and media engineers. You are here, and if you go out there and go around the corner, around the back, you'll find the lab, and it's open until some hour each day.