Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2003-209
$eventId
ID of event: wwdc2003
$eventContentId
ID of session without event part: 209
$eventShortId
Shortened ID of event: wwdc03
$year
Year of session: 2003
$extension
Extension of original filename: mov
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: ...

WWDC03 • Session 209

OpenGL Optimizations

Graphics and Imaging • 1:06:51

This session focuses on techniques to optimize your application's usage of OpenGL and 3D graphics hardware. A variety of optimization strategies, such as display list and vertex array optimizations, are discussed. A specific focus is on using the OpenGL Profiler tool to locate, diagnose, and eliminate performance bottlenecks in OpenGL-based applications.

Speaker: John Stauffer

Unlisted on Apple Developer site

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. Welcome to session 209, Advanced OpenGL Optimizations. I'm Travis Brown. I'm sure by now you know what I do, so we'll cover that. But this is a really good session for you to attend because we're going to be talking about the techniques to optimize OpenGL on Mac OS X.

And I want to just make a couple quick comments and get out of the way because we sort of have a supersized session here. So we're going to try to fit it in the allotted amount of time. It's going to be a little bit of a struggle, but we've got great content.

But the key thing I want to sort of point out is that one of the advantages Apple has in terms of our implementation of an operating system is that we deliver the driver and also the OpenGL stack. And because of that, we're able to put in certain fast paths that enable you to really unlock incredible performance out of the latest generations of GPUs.

And since we also work on the operating system and create our own tools, we have a tool chain that's available to you to be able to really take a look at how your OpenGL application is performing, debug it, and also just unlock it so it runs as wide open as it can. So it's my pleasure to welcome John Stauffer, the manager of the OpenGL engineering team, to the stage to take you through the presentation.

Thank you, Travis. As Travis said, I manage the OpenGL engineering team at Apple, and this is going to be a session on optimizing OpenGL, trying to get the most out of OpenGL on the Macintosh. So what we're going to go over today is we're going to try to cover some basics.

We're going to try to move through those as quickly as possible and get into some techniques that you'll want to try to leverage in your application to get the maximum performance. So some of those things are optimizing texture uploads, optimizing your vertex data throughput, optimizing for one-shot images. If you're just blitting an image up to the screen and you want to dispose of it, that's what I call a one-shot image.

Optimizing for copy pixels, if you want to move pixels around, how do you do that quickly? Using threads in OpenGL. OpenGL is thread-safe, and a great way to leverage our systems is to use threads. And lastly, we're going to go into the OpenGL Profiler, which is a tool you can use to hopefully animate your application. So you can analyze and look for hotspots, look for blocks in your code where it may be blocking up against OpenGL.

Okay, so the goals of optimizing, there's a couple goals. One is to maximize performance of your rendering. So you may want to get maximum performance, and that means utilizing the CPU and GPU in a combination that will get you that performance. Another possible goal is to minimize CPU burden.

So it's a little different. Maybe sometimes they result in the same code, but sometimes they don't. And to minimize CPU burden usually means maximize burden on the GPU. You may want to have a technique to offload as much burden on the GPU, leaving the CPU free for doing other work.

So key concepts to keep in mind while I'm talking. Keep these things in mind so that hopefully the concepts I'll be presenting will make sense. If I'm not pointing them out as a mistake, I'm not going to be able to do that. I'm going to be as clear as I need to be. Eliminating CPU copies is a key goal to any optimization.

You want to reduce the amount of times you're touching data. So you want to move it through the system, get it to where it needs to go, and start operating or start drawing on it. Cache static data in VRAM. So the video card has a higher memory bandwidth than system memory. So if you have static data, an ideal place to do it, to put it, is up in video memory. Put it in video memory, leave it there, and draw with it from there. And you'll get dramatically higher performance, as I'll show in some of the demos.

Maximize asynchronous behavior between CPU and GPU. That's key. You've got two asynchronous pieces of hardware. You're going to want to stay asynchronous. You're going to want to run one in parallel to the other. You're not going to want to block against each other. That's a key concept to getting maximum performance. And again, like I said, using threads is a concept that can be beneficial at times. So basic things to avoid. So we're into the basics of just a general overview of what we don't want to do.

And what you usually don't want to do is call geoflush. Flush is a command buffer up to the hardware. It uses resources in the driver. It can cost you a little bit of function overhead getting into the driver to make that happen. Don't do it unless you have to. There's a couple reasons you might have to. We'll cover those a little bit later. In general, avoid it. Never call geofinish. I frankly don't know of any cases where geofinish is really required. There's other ways to do it, possibly more efficiently. Avoid calling georeadpixels.

Georeadpixels is only really required when you want to take the pixels back off the video card and save them somewhere. Save them for later. But if you're going to use an algorithm for rendering an effect and you have readpixels in there, you probably want to look for a better way to do it.

Like caching the data somewhere else in the video card and then reusing it by just copying it out of that cache on video memory back into your scene. Avoid immediate mode drawing. So immediate mode drawing is when you use a GL begin and GL end and you have a series of vertices and colors.

You put in between those the begin end. There's one caveat to that. You can use immediate mode drawing in display lists. We have a post processor that will go through, take that begin end sequence, post process it, optimize it, cache it in video memory for you. So display lists are static data. You can't modify display lists once you've created it.

So by definition, it fundamentally is static. So we will treat it like static data, cache it in video memory. So display lists is a good place to put your static data. Minimize state changes. So state changes in the hardware can be expensive. The more complex the hardware gets, the more expensive they tend to get.

So therefore, you want to group your rendering according to state change. There's usually a hierarchy of how you want to group it. Maybe like group it by texture, group it by blending modes, group it by drawing method where you coalesce your type of primitive triangles, quads. But coalescing your database according to state will cause less state transitions in the hardware, and that can be very beneficial to performance.

Okay, so jumping into optimizing texture uploads. First, what we're going to go through is we're going to do an overview of the texture pipeline, and we're going to talk briefly about that. And then we're going to go into texture optimization basics, just some overview of what we want to look at.

And then we're going to go into some OpenGL extensions, and we're going to break those down into two categories. We're going to break them down into Power of Two extensions and Non-Power of Two extensions. There's ways to optimize slightly differently for those two different cases. Okay, so the open gel pipeline generally looks like this.

For this part of the talk, for optimizing texture uploads, we're going to focus in on the pixel pipeline. So we're going to focus in on the highlighted yellow boxes there. And to zoom in on that a little bit and talk a little more in depth on what happens while the data is moving through the system, here we have a basic block diagram. And each block on this diagram, oval or block, the data may be copied. So you have the application, you the application has a copy of the texture.

You hand it to OpenGL. OpenGL may make a copy of the data and store it in the framework. When it goes to draw it, the driver may make a copy into some harder or specific format for uploading to video memory. And then, of course, video memory has a copy, right? So theoretically, it's possible that you might have up to four copies in your system at some point. So the goal of some of this discussion is how to avoid some of that and how to control how that operates.

Okay, so basics. So again, like I said, we want to minimize CPU copies and conversions. It is possible that you pass data that is not necessarily in the format that the hardware wants it, and we'll have to do a conversion. So you're going to want to pick data formats that are optimal, and I mentioned some here.

The BGRA unsigned int 8888 reverse, BGRA unsigned short 1555 reverse, and the YUV format for YCBCR data. Those formats can be used for fast paths through the system, and without any other types of state that may cause conversions, those do not need to be converted for the hardware to natively understand those.

Okay, so like I said before, you'll want to avoid geofleshes, but there are exceptions. And one exception is when you have the GPU asynchronously reading data from your data cache, so if you have a texture and the GPU is going to read directly from it, you want to stay asynchronous with the GPU. So you may have to double buff your data.

For instance, you may have to have texture one and texture two, and while the CPU is uploading or working on texture one, you'll have the GPU, which is going to be working on texture two. And we're going to go into more details here, but I wanted to mention it, so you'll keep it in mind as we look at some of the diagrams. Okay, so double buffering.

So like I said, what you want to do is, to stay asynchronous, you want to have double buffering of the data. So if you double buffer your data, it looks like this, right? A few OpenGL extensions of how to optimize your texture pipeline. Apple Client Storage Apple Client Storage is where you tell OpenGL that I will allocate a piece of memory and I'll keep it around and you can just have a pointer to it. So we will retain a pointer to your data. We will not copy the data and retain it locally. So that requires that you retain your copy of the texture until you delete it, because we're going to be referencing it.

Another extension is Apple Texture Range. That has a couple interesting modes that you can define. One is cached, which means that you're going to want the data cached in video memory. One is shared, where it says that I don't want it cached in video memory. I want you to leave it in AGP space. Don't put it in video memory.

And what Texture Range does is it defines a region of memory. You put a texture there. We will map that region of memory in AGP space and leave it there. So that the GPU is able to come and DMA directly. from that piece of memory without us having to copy it out of that region of memory and put it into AGP space. We will map that memory directly in AGP space.

Okay, so if we look at what these extensions do to the stack, so client storage, what it does is it bypasses one copy that a texture may undergo. So it goes from the application to the driver without having to be copied by the framework. So that will automatically increase your performance if you happen to be incurring a copy in the framework. We'll go over some sample code. It's pretty easy to enable. All you have to do is make a single call. When you bind to a texture, you make a call to enable client storage, and you just set it to true.

Okay, the texture range and rectangle texture. Now, rectangle texture, that extension is for allowing some hardware to do direct DMA of the texture. And the reason for that is that some hardware, the power of two textures, and to define what power of two is versus non-power of two, is that power of two meaning a power of two width height versus rectangle, which means any dimension, right, is not restricted to a power of two dimension, is required by some hardware to do direct DMA because some hardware requires a hardware-specific format for the data to be in before it can upload it.

So, therefore, rectangle texture is required when you use texture range to get direct DMA. And you'll need to use those in conjunction. And when you do, you can bypass a driver's copy. So now we've showed how to bypass two independently. And looking at the sample code for texture range, like I said, there's the cached hint. And the cached hint in this case is for storing the data. It's up in video memory. And it's for non-power of two. And if we look at the shared hint, here's how you set it.

It's the same. It's just that you have a, instead of cached hint, you have a shared hint. And just for those that, if I'm going too fast, there's sample code that'll be up on the website. You can look at that. And it has all these features. So don't get too worried about writing these things down. You can reference the sample code. Okay, so if we use those two in conjunction, we end up bypassing all the copies.

We end up going straight from your copy of the texture, directly into video memory. And therefore, the GPU is directly DMAing from your memory. So the GPU and the application are directly talking to one another. OpenGL fundamentally has been moved out of the way. OpenGL did the setup, but the transfer is happening between your application and that GPU.

Okay, so looking at putting this all together and looking at a little bit of a piece of code that does all these things together. For non-power of two, the first thing we do is we're going to bind to a rectangle texture, and then we're going to set up the cached hint, and that is going to work in conjunction with the client storage, which is next on the fourth line. And between those two, that's going to set up for a direct DMA.

And then when we call the text image 2D with a rectangle texture target, it is going to set up the GPU for a direct DMA of your texture from directly DMAing it from your memory, right? So pretty simple to do, but this particular setup that I'm showing right now does require that you are going to be using rectangle texture.

And you'll want to read the rectangle texture specification because it does have some restrictions. It does have some restrictions on functionality, so it's not quite the same as a power of two texture, which allows for MIT maps, allows for different clamping modes and such. So you'll want to read the extension and see if rectangle texture suits your needs. Okay, so for power of two, it's slightly different, but not much.

All I did here in this piece of code, different from the previous one, is change rectangle texture to texture 2D. And texture 2D then allows me to... Use a power of two texture, get some of the additional functionality that power of two textures bring to me, but it won't give you a direct DMA. It's going to incur one copy, so it's not going to be quite as fast performance.

Typically, we see that as okay because rectangle textures are usually okay for things like video and such. Games like Quake 3 are going to use power of two, and they're going to load them at the beginning of the game or a level, and you're not going to need to do real-time texture loading as much. So rectangle textures is very powerful for playing. You can play video, playing through images that you want to get to the screen fast, which typically are non-power of two. Okay, so let's switch to the demo machine. I'm going to show you a demo of that.

Okay, so the first thing we're going to do is just look at this and explain what the demo is. This demo, it's hard to see, but it has numbers in the middle of that image. The numbers go from 1 to 5, and it is uploading a 1024, even though it's a small window, it is a 1024 by 1024 32-bit image. It is uploading it across AGP bus and blending it on the screen every frame.

So, you can see that we're getting about 650 megabytes a second, and you'll also see that we've got a couple sliders. I can switch from single buffered all the way up to five buffers to test the different effects that may have on the parallelization of the hardware and the CPU. You can also see a frame rate slider and a number of check boxes turn off the different extensions. So, the frame rate slider goes all the way up to 1,000. We actually had to add that to test the G5.

But we're on a G4, so we're going to keep it in the middle. So the interesting note here is, like I had said, is that the idea is to eliminate CPU copies. Well, this actually eliminates all the CPU copies, right? And you can see that the CPU monitor is showing very little activity.

The CPU actually isn't doing much here other than running the event loop and drawing a little quad on the screen. So the CPU's been effectively removed from the bulk of the work of this demo. Now, if we turn off, say, all the extensions, let's just start turning these things off.

You can see that our performance dropped from 650 megabytes a second down to 111 megabytes a second. And you can see now we have effectively saturated one CPU. So now we're a single-threaded app. We've taken one CPU. We are basically memory-bound by a CPU copying the data to get it into a format that can be uploaded by the GPU.

So with those extensions, again, I'll turn them back on, not only do you get higher bandwidth, but you save CPU work, right? Because you're not using a CPU, and you're getting higher throughput. So for people who are able to use some of these extensions, you can get quite a benefit. Okay, back to the slides, please.

Okay, let's jump into optimizing vertex throughput. Optimizing vertex throughput actually is very parallel to optimizing texture throughput, and you'll see as I go through the slides that it parallels the same concepts. In fact, we did that on purpose so that it has very analogous concepts. Okay, so we're going to look at an overview of the vertex pipeline. We're going to go through some of the basic optimizations that can be done.

We're going to go through some of the APIs that can help you, and we're going to break this into a couple categories, static and dynamic, and display list. So those are three separate categories we're going to touch on that will, slightly different techniques for each of those categories. Okay, again, same pipeline. This time we're going to focus in on the display list and the vertex path.

Let's get into the basics. For minimizing the CPU copies for vertex data, just like pixel data, you're going to want it into a format the hardware understands. A safe data type is GeoFloat. If you keep all your data in GeoFloats, you're pretty safe. All the hardware knows how to read GeoFloats.

If you start using doubles or bytes for vertex data or some combination that's a little bit off the normal, the driver may say, I can't directly upload this to the GPU. I may have to do a CPU copy, do a conversion, maybe some slow conversion. You may find your performance dismal. So stick with GeoFloats.

That's a guarantee to be one of the faster paths. Use vertex arrays. So like I said before, stay away from immediate mode drawing, just because that encourages... You may have to do some more overheads, overhead, per function call overhead and some other overhead I'm going to go into in a minute. So use the standard GL array function calls and maximize your vertices per draw command. So I will show some performance charts in a little bit that will show the benefit of maximizing the number of vertices you passed OpenGL at one time.

So for instance, instead of drawing one quad at a time, if you draw 100 quads at a time, you'll get dramatically better performance because you're lowering per function call overhead. You're lowering... You're lowering the driver having to do work on a per primitive basis. Cache your static data VRAM. We've already said that.

And use vertex programs to offload your CPU work. I'm going to show a demo in a bit that will show how you can do actual work with vertex programs and freeing up CPU cycles, not just in the data transfer aspect, but actually in the effects that you may want to do with your application.

Okay, and again, the same thing, double buff your data. It's analogous to the textures, and we have the same double buffer data diagram where if you have the GPU reading your data directly out of your application's memory, you're going to need to have some isolation between the asynchronous behavior of the GPU and your application. So you're going to want to double buff your data, give the CPU a buffer to work on while the GPU is working on a buffer, and toggle them back and forth.

So when you do that, you get asynchronous behavior, you can get some significant performance improvements. We'll show some of that in a demo as well. You'll notice I have a geoflush there. What you want to do is when the CPU's done with doing some work, you'll want to get that data in flight to the graphics card. So as soon as the CPU's done, issue geoflush, send it on its way, and hopefully you've done a substantial amount of work where you're not calling flush too often because that will hurt you.

Again, very much like the texture pipeline. When vertex data comes through the pipeline, it can go through multiple copies, depending on what APIs you're using and how the data is formatted. So, we can end up with the data going from your application, and if it's going to media mode, OpenGL is required to capture the current vertex state. So, we retain a golden image of basically the current vertex state when you're running a media mode. If you're running vertex arrays, we don't have to do that.

So, the first copy we have to do is into a local storage of a single vertex instance of a current vertex state. So, we incur that one copy if you're going to media mode, and then we're going to have to copy it to a format for the hardware to upload.

So, we're going to have to copy it somewhere into AGP space for the hardware to DMA it up, and then eventually it makes it to the GPU. So, if you use vertex arrays, you immediately just eliminate that one copy, and that one's easy to do. Immediate mode's an easy one to work around. No extensions needed, just use the right API.

Okay, so let's talk a little bit about dynamic data. Analogous to the texture range extension we previously talked about, we have a vertex array range extension. And it is exactly parallel. It has the same storage hints where you have a shared hint for leaving the data in AGP space. And that's what you're going to want to do for dynamic data. You're not going to want to necessarily cache it in video memory. You're going to want to leave it in AGP space. And what happens there is that you've allocated an array of memory in your application.

We come along, we map that into AGP, we wire it down, and then the application can come along and poke values into it, tell the hardware, I'm done with that, issue a draw command, we'll DMA it up. And therefore, the GPU is reading directly from your arrays, and it never makes it in video memory. And for dynamic data, obviously, that could have a benefit, that you don't want it to be cached in video memory, because you're going to change it again the very next frame.

Okay, so what does it look like if we use that extension? We have vertex arrays and we use the vertex array range. And just like texture range, we bypass all the copies in the driver and we're DMAing directly from your copy of the application's arrays. So we can get very high throughput doing that, very low CPU work is going on.

Okay, so looking at a little bit of sample code for that. The first two calls are just a standard vertex pointer setup, standard OpenGL for setting up a vertex array. The next two calls then is setting up a vertex range. And what you do is you pass it in a size and a pointer, and you tell us what memory to map in.

So you're just going to give us a pointer with a size, and we're going to map that memory in. And then the last call on this is a flush. Now, that's an important call, because every time you change that data, you're going to have to tell us you changed it.

And what we'll do with that is potentially flush hardware GPU caches, or we may DMA it to some other location. But what's important with that is that you have to tell us the areas that you've changed. So every frame that you come along, and you write some more vertices, and you change that data, you have to tell us the pointer and the size offset, the size from that pointer that you want us to flush. You've changed it. We will then know that that has changed and update the hardware.

Okay, so static vertex data. Very similar. You can use the vertex array range, but instead of using the shared hint, you can use the cached hint. And what will happen is that when you define the vertex array, if it has a cached hint and you say flush, then we'll know that you've changed it. We will DMA a copy up into video memory and we'll keep it there.

Every time you call flush, though, we will have to re-DMA that back up into video memory, but it'll be cached in video memory, and if you're going to draw from it multiple times, it's quite a benefit because you're not having to re-read that data across the AGP bus every time. Instead, what you're doing is you're local to the video cards bus, which is a very high-speed bus.

And like I said previously, you can use display list with begin end. The one caveat for using display list is that we do have an optimizer that goes back through the data and parses it and reconfigures it into an optimal format. You can fool that optimizer. And what you want to avoid is using inconsistent vertex data.

And what I mean by that is that if you go through GL begin and you say GL color, GL vertex, GL color, GL vertex, that's consistent. Inconsistent would be GL begin, GL color, GL vertex, GL vertex, GL vertex, GL vertex. You did a color for the first vertex but not the following ones.

You may fool the optimizer into not being able to handle that. So if you want to play it safe, just keep it into consistent formats and the optimizer will definitely be able to take that data, pack it into a format that we can then categorize. cache and video memory and you can get optimal performance.

So one last caveat then for display list is that there's a minimum threshold for which it's worthwhile for us to work on the data. And that threshold is 16 vertices. If you have less than 16 vertices, we won't even consider optimizing it. And that's just, we just found that out by testing different machines and finding out where the threshold was and deciding that, you know, if it's not going to give you performance benefit, in fact, it can actually slow you down because of other overhead of doing work on the data, that 16 was the minimum.

Okay, so what does that diagram look like then? So when using static data with either display lists or with the cached vertex array range, what happens is the data gets de-emated into video memory, and then the GPU draws from that, right? So it's gonna be taking the data from the video memory cache and drawing. So you get very high throughput for data that you draw more than once.

And the sample code for that, again, it looks like for static data, we're setting up a standard vertex array. Again, we set up the hint. This time, the hint for the vertex array range is cached. It's not shared like it was for dynamic. And like before, we set up the vertex array range pointer in size, and then again, we tell it to flush.

And again, the flush, this time, instead of just, the flush is going to cause us to re-upload that data. So if it's not there already, we'll upload it. If you have touched the data again and it was uploaded, we'll refresh it with another copy. So it's like a text sub-image call where we're going to refresh the data on the video memory.

Okay, so for basic review of what display lists look like, it's pretty simple. You just call begin list, draw your drawing, and then call end list, and you can pack anything you want in there. It takes any OpenGL calls. If you put your geometry in between begin list and end list, hopefully we'll be able to optimize it and get it cached in video memory.

Okay, so looking at what this can do for you for performance, this is a chart of low vertex count performance. So on the x-axis, we have the number of vertices per draw command, and on the y-axis, we have millions of triangles. So as you can see, the orange is immediate mode, and orange tops out pretty quickly as to what benefit you can get by going down that path.

And the red, then, is vertex arrays. Vertex arrays has a little bit lower per function call overhead, a little more, give you a little more performance. But if you look at the blue, the blue is vertex array range. Vertex array range has great potential for performance, and it doesn't give you a whole lot until you start giving OpenGL a lot of work to do at one time. So that's the key. The key is giving OpenGL lots of work to do. So you can see here, I'm giving OpenGL lots of work at one time, and then the green, the top one, is displayless.

So, you know, it goes up to, on this chart, it goes up to about 12 million triangles a second, issuing 30 triangles per draw command. Now, this is the high vertex count performance, and picking up a little bit where the other one left off, you can see that some of these continue to grow quite a bit. So you can see that the vertex arrays in immediate mode stay flat. Array range basically grows until you're limited, and in this case, the test I was running was AGP bus limited.

So I limited about 640 megabytes a second of vertex data that I could transmit across the bus. So I pretty much bottlenecked that AGP bus, and that's all the data I could get across. But in the display list case that I was testing here, the data was effectively static.

It only went across the bus once. So the GPU was able to utilize its internal bus bandwidth, which in the case I was running on an R300, that's about 20 gigabytes a second. So it can transfer a whole lot more data. And you can see that at the number, the top number I quoted here, it was about 2.8 gigabytes a second worth of geometry going into the GPU. So that's quite a bit of data, almost 90 million triangles a second. So let's do a demo and show a little bit of this.

Okay, so what we've got here, if anybody that's been in my session before, it's the same old thing, but next iteration of improvement. So initially we're drawing with quads, and we're doing the standard basic GL begin end. Not too impressive, we're getting about 800,000 triangles a second. So what I'm going to do is I'm going to step through the different, as I move the slider up, I'm going to step through different optimizations and using different extensions, and we'll see what effect that has on the performance.

Down at the bottom, you see the color coding. The color coding represents where time's being spent. So red is system time, time spent outside the application. Green is time spent calculating the wave, and blue is time spent in OpenGL. So you can see right now that I'm spending a lot of time in OpenGL, a lot of time calculating the wave.

So if we start moving up the level of optimizations, I went to quad strips. That got us quite a bit of speed improvement, about 25%. And that was pretty easy to do, worthwhile. But let's not stop there. Let's keep going up. So if we go to vertex arrays, a little bit more. That wasn't a great improvement.

Then we go vertex array range. So here's where it gets interesting. Now you can see that the time spent in OpenGL, which was the blue, went from filling basically the top of that bar almost to nothing. So now the time spent in OpenGL is very little, and we're basically now saturated on the calculation of the wave. We are not able to calculate the wave fast enough to get the data to OpenGL. So if we move up one more notch, and we see what Altevec can do to us.

So we Altevec the wave calculation, because that was my bottleneck. Once I optimized OpenGL, OpenGL was no longer the bottleneck, so the CPU was. I optimized that, and then we do one last thing. Like I said before, you may want to offload calculations onto the GPU. So what if we write a vertex program to do that wave? And now, again, the interesting thing to watch is that we are calculating a wave motion, and we are sending almost 12 million triangles a second to the screen. And look at the CPU.

The CPU's almost doing nothing, right? So we basically, not only have we optimized it, but we've offloaded the CPU from doing any work. Now the CPU, again, is just doing an event loop. CPU doesn't know that this complex wave is being calculated. And if we actually look at the density of this, it's a really dense wave. There's a lot of triangles there. Okay, back to the slides, please.

Okay, so let's go into a new subject, optimizing for one-shot images. One-shot images, again, are images that you may have that you want to get to the screen as fast as you can and discard it. You're not going to blidge to the screen multiple times. It's just one shot. So one possible way of doing that is draw pixels. Draw pixels is fairly effective in some cases. It's best with small images. If you have lots of little small widgets you want to draw somewhere, draw pixels is probably the fastest way to get the data there.

It's a very optimized path, very quick. For images larger than 128 by 128, you probably want to start considering doing some kind of texturing, like our previous demo showed, where you don't have to make a copy, because draw pixels is going to make a copy of the data. The larger the image gets, the more data there is to copy, and your benefit for draw pixels, for instance, is going to go down because it will make a copy.

Okay, so the trick for one-shot images using draw pixels is to get your state right where you're going to go down the optimized path. There's different paths. There's three different paths in OpenGL for how to draw these things. You want to hit the one that's fast. So the first thing you're going to need to do is get your state right, and listed here is a number of disables of things you need to have disabled before you will be going down the fast path. Again, don't worry about writing them down. We'll have a demo posted that you can look at.

Okay, so a little bit more code. DrawPixel is very basic, right? You disable some options and you call DrawPixels. You feed it the right pixel format type, like we talked about before, that will be a format that the hardware natively understands, and you give it the image and off it goes. Okay, we're gonna do another demo, please.

Quit it and relaunch it here. Okay, so first thing I'm gonna do This is a little bit strange, but I got an infinite button, and that infinite button is to sit in a... Yeah, it doesn't really go infinite. It's going to sit in a for loop, and it's going to beat on it really hard, because it goes so fast that just running through an event loop is too slow.

So it sits in a for loop really fast and bangs on it really hard. And I reduced the image size two by two. Now, most of you don't see that, but the key point here is how fast can we really get through the stack to OpenGL? And we can get 660,000 of these little images up to the screen.

So you can get a lot of little things up to the screen. And that's one of the things to remember, because other paths through the system may have more per-function call overhead and limit you not because of the pixel data, but because of what you have to do to get through OpenGL.

So that's the benefit of DropPixels, is it has a low per-function call overhead and can get lots of little small things up to the screen. So if I start increasing the size of these... Sorry, I wanted to go a little smaller than that. So here's a 75 by 75 image.

Megabytes per second is about 400 megabytes a second. Believe it or not, I'm already memory bottlenecked here. I'm basically saturated my memory, and it's no longer function call overhead that's stopping me. It's memory bandwidth. Obviously, with the G5, these numbers all change because these go much faster. And that's actually another trick, is how to tune for the different systems.

It can be a delicate job. So if you start increasing the size of this, we quickly... run into... some rather slower frame rates, right? So now we're down to... We're still at 400 megabytes a second. So we have bottlenecked the memory bus. We are just flat-lined now. And as I increase the number of pixels, I will proportionally decrease the frame rate because I am 400 megabytes a second limited.

That's all I can get through the system. That's my limiting factor. So as I increase it, I go slower. And that's why when you get to larger images, it's better to relieve the memory bus of that work and go down the texture path. But for small images, DropPixels is great. Okay, back to slides, please.

Okay, optimizing pixel copy operations. So, there's a lot of cases where you want to draw something, save it off, and then you want to be able to grab a saved copy and blit it back to maybe your back buffer, use it for some part of your scene, and you want to render it and save it.

So, one of the things you can use to do that is copy pixels. So, copy pixels will allow you to do a VRAM to VRAM copy. It's like drop pixels where you're going to have to set up your state correctly. One area you can store the data is in an auxiliary buffer. So, on OS X, you can create auxiliary buffers.

An auxiliary buffer is just another back buffer. So, if you have your main back buffer, you can create another one off to the side and use that as a temporary holding area for copying data into. You can either draw to it directly or you can copy data between your back buffers.

So, you can have a back buffer and this auxiliary buffer. One additional extension that we have that allows some more flexibility is the aux depth stencil, which not only will it create the back buffer, but it'll also create the depth and stencil buffers associated with that back buffer. So, you can have two depth buffers, two stencil buffers, and therefore you can copy your not only color data between these aux buffers, but you can copy depth and stencil data and use it as a temporary holding area for fast refresh of some pixels.

So, there's a number of techniques that people use for interacting with very complex objects. So, there's a number of techniques that people use for interacting with very complex geometry. That becomes an important technique. So, like draw pixels, there are certain states that you'll need to have right to make it go fast.

It's very similar to draw pixel state. And basically, what you don't want to be doing is you don't want to be trying to dither or alpha test or blend or things. Basically, what it comes down to, you don't want to do anything that can't be done by the 2D engine on the GPU. Because this is a 2D operation and we need to be able to stick within the feature set of the 2D pipeline on the graphics card. So, you can have a 2D pipeline on the graphics card.

So, you need to disable all the operations that require the 3D pipeline. 3D pipeline is not going to be as optimal. It's just a memory copy that you can do through the 2D pipe. So, there's a number of states that you want to disable. And you can look at the draw pixels example for what state. It's very similar.

Okay, so looking at some sample code. Very basic. We have the standard disable the write state, so you can go down the fast path, and then when you go to draw, you're going to want to set your read buffer and your draw buffer. It's just a source and destination. Source and destination can be any of the buffers you have allocated, whether it's the back buffer, aux buffer 1, 2, what have you, and you can copy between those two. Then you issue your copy pixels, and the transfer will be a VRAM to VRAM transfer.

Okay, so let's jump into using threads with OpenGL. So let's go over first the rules for using threads with OpenGL, and then we'll talk about some possible ways to divide up your work onto multiple threads. And then we'll talk about what data you can share between those different threads, and how to synchronize those threads.

So rules for threading. What you can't do is per thread re-entrance, per context re-entrance. So if you have an OpenGL context and you have two threads, only one of those threads can be in OpenGL referencing that context at a time. If both threads are in OpenGL with that same context, you're going to cause corruption in your OpenGL state and all kinds of bad things can happen. What you can do is you can share context state across threads and you can share surfaces across contexts.

And I'll show some diagrams of how that can be put together to help you with threaded applications. Okay, so divisions of work. One possible division of work is to move OpenGL as a whole onto a separate thread. This is like what Quake 3 does. Quake 3 moves OpenGL onto one thread and has a bunch of other CPU work for the game logic and other work on the other thread. And it's a reasonable division of work that's easy to manage. Other more complex ways to divide your work are to potentially split your texture work with your general... geometry work.

So you may have texture data that's getting spooled off of a disk. You may have another thread that's doing other work for the application, but then it comes along and uses geometry to utilize those textures for drawing. Another possible way to divide the work is to split your output, your surface.

And when I say a surface, I mean the OpenGL back buffer is basically the surface of the piece of memory in video memory that you're drawing to. So that's what we call surface. So you can split the processing of a surface. So let's say you want that you can have your CPU work be divided amongst regions in the surface. Then it might be beneficial to split those onto separate threads and leverage both CPUs to get that work done.

Okay, so sharing data between contexts. What gets shared? So when you share two contexts, when two contexts are sharing state, the things that get shared are display lists, textures, vertex array objects, and vertex and fragment programs. So those are the things that get shared, and those are really all objects.

Those are things that usually have a bind and some name associated with them, and those things are the shared items between OpenGL. There's lots of other state in an OpenGL context that does not get shared, and those are going to be per context, even if you've set these contexts up to share that state. So you can share, like I briefly touched on before, you can share a surface between contexts.

So you can have two contexts, multiple contexts, drawing to the same surface. And that's another way to have sharing of data. Okay, so let's look at some of the diagrams of how to divide up your work and move it onto different threads. So here's the first example of just moving OpenGL on a separate thread.

Very basic. You have one thread doing work for the application. You have another thread that's driving OpenGL. So thread one is generating data that is used for input to thread two to draw into the OpenGL context, which goes to the surface, which gets swapped to the frame buffer.

Okay, you can split your texture and vertex processing onto different threads, different contexts. So you can have two threads, two OpenGL contexts. They're sharing the same, some of the same state, right? They have shared state, and they're going to be attached to the same surface. And what you can do is you can have asynchronous processing between the two, where you can have one thread spooling data from a disk, say, decompressing JPEGs, decompressing a movie, what have you, reading that data, those textures into the OpenGL state machine, and then having the other thread come along, referencing those textures and drawing. So that's a way to split your workload if you have spooling or some kind of work to do with imaging that you want to offload.

Another possibility, then, is to use our new API for P-buffers, where you're not reading, say, the data, but you're using geometry to generate a texture. You're using some geometry to draw into a P-buffer, which is in video memory. That, then, is used as a texture, which is then referenced by Thread1's context, which then is drawn, which then goes to the surface and to the frame buffer. So, just to be clear now, the only difference between these two examples is that one, we're dynamically, this example, we're dynamically creating the surface by drawing to it, the texture by drawing to it. Previous example, we were basically loading a texture through the OpenGL API.

Okay, so we can also split the OpenGL processing of a surface, right? So we can take the surface and we can split it across some line and use one OpenGL context to render one part of it and an OpenGL context to render another. And where this might be beneficial is if you're CPU bound and your CPU work could be regionally divided along some portion of the screen real estate, where you have a lot of work to do geometric calculations or what have you in one portion, you can split that across two CPUs and divide your work across regions of the surface.

So one way to do that is to just create two threads, two OpenGL contexts, not have the state shared, and they're just open loop drawing to a surface, but they're drawing to different regions. Now the way you separate what regions they draw to, you can use OpenGL. You can use a scissor and a viewport. So you can just set the scissor and the viewport to the region you want to control, and with a scissor rack, the pixels will not come out of that.

So you can just set it to one half for one context and one half for the other context and allow the drawing to be open loop to that surface. Okay. Or, if your application wanted to, you could just share state. There's no reason you couldn't be sharing state. Same basic concept, just that they're sharing possibly geometry, programs, textures, to do their work of drawing into the surface.

Okay, so how do we set up OpenGL shared context? So here's a little bit of sample code to show how we would create a context, attach another context to it as a shared context. So this example assumes that you've already created an OpenGL view through, say, AppKit, and now you come to, say, your NIT frame, and all you're going to do is create an NSOpenGL context.

And you can see we're creating NSOpenGL context, ALEC, and then we come along and we do a NIT with format, and we're going to take self, the format from self. So we're inside of a, the creation of a, of a context, or in a view already that has a context, so we're just going to take the pixel format out of that context, and we're going to use it as the pixel format to create a new context, and we're going to provide the current view's OpenGL context as the shared context.

So we're going to create a new context. We're going to, create, and when we attach it, or I'm sorry, when we create it, we're going to hand in the pixel format and the already created context for sharing against the state, and that's all you need to do to make sure that the two are connected.

And the key line here then would be the self OpenGL context on the third line, which hands the new context, the previously created context for our sharing. Okay, then the next two are fairly standard OpenGL concepts where you're just making the current context, you're attaching it to the view.

So one small deviation on that, then, is that if we wanted to have two contexts that talk to a surface but don't share state, instead of passing in the already created context into the newly created context, we just pass nil, but we attach it to the same view. So we're creating two independent contexts, we're attaching it to the same view, that will allow them to talk to the same surface, but be independent as far as state.

Okay, thread synchronization. The main tool you'll have for thread synchronization, obviously, is going to be the OS tools that are provided, the OS APIs. So you'll have NSThread and NSLock. Those are going to be what you'll mostly need to leverage. There is one interesting API that you'll want to be familiar with, which is NSAppleFence. AppleFence is a way to insert tokens into the OpenGL command stream and then to test when they're done.

So I can do a set fence and I can test when that token I've inserted into the OpenGL command stream has gone through the GPU and made a round trip and is completed. So there's ways to test when portions of your drawing are done, and that's another way to potentially synchronize events within your OpenGL commands. So we'll look at a little bit of sample code how to do that.

Okay, so there's two basic ways you can do it. You can do it with the set fence, which you can see here on the first little piece of sample code, which is I'm setting a fence, and I'm giving it a name. So I can give it any name I want, and you can set that token into the OpenGL command stream, and then do some work and later test to see if it's done. So you would do a finish fence apple, and that call will block until that token is completed.

There's another simpler API. If you're wanting to block against a texture upload being completed, or a draw against a texture, or a draw against a vertex array object, you can just simply test for that object. So what you do is you call finish object, and finish object apple, and what you do is you pass it in the type of target you're wanting to look against, or check against.

So you may have a GL texture or a GL vertex array as the type, and again, you pass in the ID number, and you can see that it's a GL vertex array. So that'll be an ID number that you use to create or bind to the texture or vertex array object.

Okay, so let's do a little bit of a demo again. And this one is going to be the same demo we did before, but what we didn't show before is we have a multi-threaded button at the top. And I want to talk a little bit about that. So, if we enable multi-threading, we can see that we went from 800,000 triangles a second to 1.5 million triangles a second. So, we got pretty good parallelization, right? We almost got a 2x speed improvement just by doing multi-threading.

So, that was pretty worthwhile. And you can see the CPU monitor. We're working two CPUs pretty hard. Now, what's interesting about this is if I increase the optimization level of OpenGL, if you look at the performance, it's not doing a whole lot, right? Well, the problem is that the workload for calculating the wave is the bottleneck, right? The drawing of the OpenGL is not. So, we're not going to gain by improving OpenGL because the wave calculation is the bottleneck.

So, with the vertex array range in the Altevec, now we've again removed the bottleneck of the wave calculation. Now, multi-threading is paying off, right? So, now we're getting the 10.5 million triangles a second as opposed to the 8.5. So, it's not 2x because, as you can see, even with Altevec, the wave calculation, which is represented by the green, is still significantly more expensive than the OpenGL drawing. So, we're going to get some benefit by moving the system and the OpenGL drawing off to another thread. Okay, back to the slides, please.

Let's go into an important subject that we wanted to spend quite a bit of time on. It's the OpenGL Profiler. It's a tool that comes with the developer CD. It's a very powerful tool. It does have a lot of features, and it will take a little bit of learning to understand how to use it effectively. What you can use it for is optimizing, debugging, and experimenting with your OpenGL application. There's a lot of different features in it, so the Profiler is a bit of a restricted name. It does a lot more than just profile.

Let's go through some of the screens and have a brief overview of what the Profiler can do for you. First, the OpenGL Profiler, like some of the other tools in the system, you can have it launch your application for you. It will launch and basically attach to your application, or you can attach to a running application. If you already have an OpenGL application running, you can just attach to it and start utilizing the services of the Profiler just by simply attaching to a pre-running application.

One of the services it provides is it will provide function statistics for you. It will time in and out times of all the OpenGL functions and provide you counts and percent times and overall times spent in each function. This way, you can quickly get an idea of which functions you're spending time in OpenGL and quickly get an idea of how expensive those are for you. It'll generate, you can capture call traces, so you can simply enable call trace capture, and it'll capture all the OpenGL commands and their arguments so that you can scroll through it and look what your application is feeding OpenGL and get an idea of the call sequence.

It will capture textures, vertex, and pixel programs. So you can actually run your program and it will capture all the textures that you've passed in. It will capture the pixel programs and vertex programs. You can look at those and you can see, make sure that you've got the textures you think you have loaded under the right names or what have you.

You can set breakpoints. So you can go to an OpenGL function and you can say, I want a break here. And at that breakpoint, it will give you application call stack. So you can see what your application call stack was at that point. It'll also give you a complete listing of the OpenGL state.

So you can sit there and thumb through the OpenGL state, make sure that at that breakpoint, the state is what you expected it to be. It'll also, at a breakpoint, let you look at the off-screen buffers. So it'll let you look at the back buffer, depth buffer, stencil, alpha buffer, and you can look at it at any point you can set a breakpoint.

It'll let you write scripts and execute OpenGL commands. So at a breakpoint, you can type in an OpenGL command and say, well, I think that state's wrong. I'll modify it right here. Type an OpenGL command, hit execute, and it'll poke that OpenGL command right into your application and change some OpenGL state for you.

So one useful thing for this, then, is going to be debugging. If you think that you've got a bug in your state setup, you can modify it on the fly. Scripts can be attached to breakpoints, so they can be auto-executed if you wanted it to be executed every time a breakpoint came along.

The OpenGL Driver Monitor, another powerful tool. What this does is it attaches to the driver itself and starts collecting stats out of your graphics driver. There's a number of parameters you can monitor, like video memory usage, hardware wait times, so you can watch to see if the CPU is stalled against the hardware. You can watch what kind of stall it is. It breaks down into many different categories of why the CPU may be blocked up against the GPU. You can try to monitor those.

You can look at bandwidth usage of how much data you're getting through the system, so it'll track bytes per second through the system. So there's a whole bunch of useful stats. It takes a little bit of studying this tool to get useful data out of it because it is somewhat complex. So we'll go through a little bit of that in one of our demos. So why don't we switch to the demo machine and let's do that.

So, quickly here, before we go any further, screenshots I didn't show, you can also customize your pixel format. So, for instance, if I wanted to make a custom pixel format, I can come in here and change my pixel format attributes that the application uses. So, if you have a pre-compiled application, you can modify your pixel formats on the fly without having to recompile them. What you can also do is you can emulate hardware. Now, when I say emulate, yeah, it's neat, but I'm going to give you the bad part now. All it really does is deprecate your current hardware to some less capable hardware.

So, for instance, if I'm running an R300 like I am here, and I wanted this hardware to look to the application like a RAGE 128, I could say, choose driver RAGE 128 that got released, the same feature set that got released in, say, OS 10.3. And any time your application would come along and make a query into OpenGL for some kind of capability, like an extension, some kind of min-max values, the driver will return a value that looks like a RAGE 128. So, if your application is coded correctly to respect extension strings and values that are queryable, it's a powerful utility for making your application think it's running on a RAGE 128.

This was actually a feature request from last WWDC. So we got it in there. Okay, so I contrived an application, a slight variation of the texture range demo I showed before. And what I did is I did what I said not to do, and I stuck a GL finish in there. So first thing we're going to do is we're going to look at the statistics that this thing's collecting.

So what do we see? We see GL finish is taking 82% of the percent time in GL and 40% of the application time. So a couple values, let's go over the screen real quick, a couple values that are interesting to look at when you pull the screen up.

One is the number down here, which I highlighted. That's the estimated percent time spent in GL. So that will try to estimate how much of the time, total time, is spent in OpenGL. So we can see we're spending about 60%. So that's about 67% of the total time in OpenGL. Of that total time, we're spending 55%, 56% of that in GL finish.

So somebody's calling a synchronous call GL finish and causing the application to stall and wait for the GPU to flush the pipeline on that call. So here's what we're going to do. Since I don't like that call, we're going to go in here and we're going to pull up the breakpoints window, and we're going to get rid of it. So there's GeoFinish. You'll notice not only can we set breakpoints before or after a function call, but we can also stop executing functions. So I'm just going to disable that function. This is a favorite tool at Apple, by the way.

When we catch applications doing things, we'll just disable it, right? So now you can see that things are looking a little better, right? So now we got rid of the GL finish. Now we're only spending 24% of the time in OpenGL. We're not spending the 65% that we were before. We are spending the time where we want to be. We're spending it basically in GL begin, where some real work is going on of getting the data up to the system. And things look good.

Now, I talked before about double buffering and the importance of double buffering data stays asynchronous. Well, this application has the ability to switch down to one buffer, right? So I can make it look like I'm only feeding one buffer at a time. And you can see I'm stuck on buffer on texture zero. So real quick, let's just see what the performance impact is. So right here, I'm at about 500 megabytes a second.

If I'm at five, I'm at about 600, sometimes 630 megabytes a second. So quite a performance difference by stalling, having the hardware stall on the CPU, having to prepare the next texture. So let's look at the difference of what the call stats will show in this case. And what we're going to do is we're going to pull up the driver monitor. So let's look at a couple of things in conjunction. Okay.

Okay, so quickly here, let me move this back up a little bit. So what we're seeing here on the driver monitor is we have three lines I'm drawing. I'm drawing the hardware wait time in red, which represents the total time the CPU is waiting for the hardware. And so anytime the CPU is blocked up against the hardware, it's going to start registering wait time.

In the yellow, I'm measuring texture page and bytes. So since this is a texture demo, it's uploading lots of textures, I'm going to record the number of bytes worth of textures per second I'm sending up to the hardware. And green is the swap complete wait time. Now, let's see what happens.

We can see what the changes are when I go through and go from single buffer to double buffer, and we can watch the effect that has on some of the statistics and give you a little bit of an idea of how to use this tool and watch for the changes.

John Stauffer Okay, so now that was single buffered, and you can see up in the stats that when I'm single buffered, I'm spending all my time basically in GL text sub image 2D. So as I change the pixels for that texture, I'm spending all my time there, and I'm spending my time there because I'm blocked against the hardware.

The hardware maybe hasn't completed uploading that texture, and the CPU is ready to give it another one, but the CPU has to wait for the hardware to be done. So we're going to block, and that's the effect that single buffering is doing. John Stauffer Okay, so now what's happening to me is that the CPU is not running asynchronous to the GPU.

So now if I move this up and I double buffer this, we can start seeing some of the effects, and I'm going to change a couple options here to give me a little bit of a better vantage point here. John Stauffer Okay, so again, the red is the hardware wait time, and you can see that when I went to two buffers, you can see, and it's subtle, so you have to watch, you can see that the red line went down, so I'm now waiting, CPU's now waiting less on the hardware. John Stauffer And you can see that the yellow line went up, meaning that I'm getting more bytes per second up to the graphics card. So by double buffering, I have made myself more asynchronous to the GPU, allowing for better parallelization and less blocking on the CPU's behalf.

So, there's a couple other things here. Let's play these stats again and just look at what effect that had on the stats. So, So previously, I was spending all my time in text subimage. I still am. Let's bump it up a little more and see what happens here. If we go up to five like we were.

So now the blocking point switched back again. So you can see that we're actually able to catch the driver blocking at different points as we move to different numbers of buffers. So we can see that double buffering wasn't quite enough, was it? I can see that double buffering doesn't quite get me the same behavior that three buffers does.

Now, one thing to watch out for that can potentially fool you is that there's only limited numbers of different types of resources in the driver. As you vary the way your application works, you can actually start consuming those different resources. And when you consume a resource, the driver's going to have to block waiting for that resource to become free.

So what is happening here is that as I'm only at, say, three buffers, I'm running out of one type of resource, and that is I'm probably blocked up against the hardware waiting for the completion of that command buffer. But when I go to five buffers... It changes because I believe I'm blocked against swap buffers.

There's a particular packet type in the driver that is needed to swap, and there's only four of them. When I switch up to five different buffers, I've now made the CPU so asynchronous that I'm running out of a different type of resource. I'm so separated from the hardware that I'm consuming a driver resource that is making me block somewhere else.

John Stauffer So the key points here, though, are hardware wait time is always a good one to look at, and what kind of byte throughput you're getting. So you can look at a variety of different stats for byte throughput. Let me pull up the different stats here. So we can look at Command Bytes GL if we wanted to. For instance, if I poke that down there and I disable this other one. That one's not very interesting.

Actually looks like a bug. In any case, there's lots of different stats you can put up in here. And we're going to be releasing another version of this that has a little more descriptive names, hopefully with some better information. They are a little bit cryptic. If you need some more detailed information, don't be afraid to post information up onto the OpenGL mailing list. Can I switch back to the slides, please? Okay, so let's wrap up.

Okay, so texture optimizations. The goal is to minimize your CPU copies of pixel data. There's different ways to optimize for power of two, non-power of two. Vertex optimizations, you'll want to use the vertex array range for dynamic data with the shared storage hint. And for static data, use the vertex array range with the cache storage hint or display list.

Offload the CPU onto the GPU with vertex programs, free up some work onto the GPU. Use threads. You can share different types of data between threads, surfaces, context data. Draw pixels for one-shot images and copy pixels for fast VRAM to VRAM copies of your data, of your pixels. Use the OpenGL Profiler to find hotspots and points in your code that may be getting blocked in OpenGL. And with that, if you have more questions, you can contact myself or Travis Brown.

So, quickly, references. We got the OpenGL org webpage that you can go to. We have the Apple developer page. And we have some Apple documentation that is available on the developer page. And with that, I'm going to bring Travis up for the road map. All right, we're rapidly running out of sessions at this year's WWDC. Let me actually skip forward.

The next session in the Graphics and Imaging track isn't specifically related to OpenGL, but it's certainly a popular session nonetheless as Mac OS X printing. Tomorrow, again, we have Introduction to Core Services, which is if you're a game developer or a full-screen OpenGL application developer, please attend the session. We'll be covering the core APIs that the system uses to do display configuration and management. Our hardware partners from ATI are going to give us a presentation on Friday. Well, actually, tomorrow is Friday. It's Cutting Edge OpenGL Techniques, where they're going to really show us some of the absolute latest things they're able to do with their current Radeon products.

We also have a session on accessibility. This actually will contain some content that may be of interest to game developers, because we'll be covering, at least in a slide or so, issues affecting using assistive technology, which is software that adapts the function of the computer with OpenGL applications that take over the full screen. We have a suggestion there for some possible ways that you can ensure compatibility.

And then we have, historically, the last session of WWDC, which is the Graphics and Imaging Feedback Forum. If you want to voice opinions, give suggestions, please take the time to attend the feedback forum, because that's where we get a lot of the information that we use to create great new features in the operating system for next year.