OpenGL: Performance and Optimization - WWDC 2002

Digital Media • 59:13

Delve into performance tuning in this in-depth session on optimizing OpenGL code for Mac OS X. Learn about the fast paths through Apple's OpenGL implementation and graphics pipeline. This lecture includes specially developed tools and techniques for profiling, debugging, optimizing, previewing and compiling. Specific API performance insights will also be presented.

Speaker: John Stauffer

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hello, good afternoon everyone. Please have a seat. Welcome to session 514, OpenGL Performance Optimization. This is the last dedicated OpenGL session at the conference and a must attend for all of those coding for OpenGL. We're going to show you a specially designed tool for profiling and optimizing OpenGL coding and That tool is really cool.

OpenGL profiler is going to ease your way and the performance of your application. But most importantly, you're going to hear the do's and the don'ts of OpenGL coding from the man responsible for Apple's OpenGL implementation. And I'm happy to introduce you to the stage, Apple OpenGL manager, John Stauffer.

Thanks Sergio. Okay, today we're going to talk about how to optimize OpenGL and let's jump right into it. So that's me. I manage the OpenGL engineering group at Apple. We have a great team of people who work really hard to bring you a quality OpenGL implementation. And what I want to talk about today, briefly, is more of a summary of what we see developers coming to us with over the year. So we want to give you some basic suggestions about how to use OpenGL, what will work well for you, what to avoid, try to direct you down the right paths.

And also to point out some of the recent additions to OpenGL, both in optimizations and in extensions. So we've been working hard to optimize current paths that you may already be using. So you may get some benefit from OpenGL just by installing Jaguar. And then the extension side, there's some really advanced optimization extensions that you'll want to try to use. And then we want to show you how to use them if you can. It'll get you a lot of performance if you can utilize these extensions, as I'll show in the demos.

Using Threads The use of threads can bring you a unique set of performance optimizations depending on what type of rendering you're doing. We'll go through a couple possible uses for threads and how you can try to leverage that in your application. Tools As Sergio mentioned, we're going to be shipping a new tool in Jaguar called the OpenGL Profiler.

It is targeted towards helping you understand what your application performance issues are and helps you try to identify where time is being spent in the OpenGL pipeline. So you'll see that it's a powerful tool once we go through it and hopefully it'll give you some benefit when you start using it.

Sampler I briefly mentioned Sampler here. I'm not really going to go into Sampler much. But it is a tool that you should use in conjunction with our Profiler. It's a very powerful system level tool. That will give you a more broader view of your application performance issues. We won't be focusing on that, but it is a tool that you ought to be familiar with. And then lastly we'll go into a little bit of where to look for help.

App Services So, every now and then I'm going to pull up a slide and it's going to be titled Basic Optimizations. So, what these slides are is just the do's and don'ts. So, The first thing that I want to talk about on basic optimization is the thing that people need to try to identify in their application is what are their bottlenecks and what kind of data are they primarily passing around the system.

So what we usually find in applications is they're either passing lots of pixels, so they're moving lots of pixels around, or they're passing lots of geometry. Rarely actually are people passing both, believe it or not. So I tried to point out that you should look at your application and try to identify what is your data type that you're trying to pass around. And we'll talk about specifically optimizations that you can apply to those particular types of data sets.

So once you've identified what you're passing around, you'll also want to look at other things your application's doing, like what's dynamic. Are some of those data sets static, or are they dynamic? The different optimizations you'll apply to that will depend on whether they are dynamic data or static data. And also, how complex is your data? There is different techniques you'll want to apply for depending on how complex the data you're passing around, or complex I mean by size.

So if it's very large data, maybe it's something that you'll have to treat in a unique way, or if it's smaller, you can probably just pass it down through some of the more standard techniques and let OpenGL deal with it and not have to get a little more complicated. So another last thing that I'd like to have you look at your application and ask is, are you spooling data from disk in real time? There's some techniques to use to getting data spooled on and off the disk, and we'll go a little bit into that.

So firstly, just putting the OpenGL pipeline, a real basic drawing of the OpenGL pipeline up here so we can get an idea of the different groups of different areas of the pipeline that we want to just think about as we're talking about this. So the first one is vertices. Vertices are the geometric data, obviously, that go through the system. And that is primarily what people think of when they think of passing data to OpenGL.

But also there's a category that's equally as important, and that's the state data. State is really anything that OpenGL retains. It's a current state of the pipeline. So state is the configuration of the card or anything that OpenGL is going to not just simply pass through but retain. And a lot of that state can actually be retrieved by you back out through the API.

And as you'll see in some of the discussion, there is an interesting issue with when we start talking about caching data in the system. There's an interesting performance issues with that related to the that we have to do bookkeeping in OpenGL to keep it. We have to make maybe a copy of it and keep it.

So those are the two categories. Textures fall under the state category in this diagram that I have here. So you're passing, when you're passing a texture, really a texture is a piece of state that you can retrieve back. It can be bound. It can be deleted. It can be replaced and such.

So, so the first thing, my first suggestion is, um, I highlight it in red because I want to make sure that everyone understands. Uh, we ask people to almost never call geo flush. And you'll see at the bottom, to never call geo finish. And the reason I say almost on the top one is that the only reason you ever want to call Geo flush is if you're a single buffered application.

And we don't even like applications to be single buffered. We'd rather you be double buffered. So we really ask that people try to stay away from these. And I brought this up first because it is the number one easy performance trick that you can do to your application. And if you don't think you're calling that, but you're not 100% sure, search your code, look for Geo flush, Geo finish, and if you find them, comment them out. If you're a double buffered application, just comment out those lines because you don't need them.

They can give you potentially up to about a 20% speed improvement if you're calling GL finish just by commenting out that one line because what GL finish does is it actually is a synchronous call that will truncate the command stream and send it to the graphics card and wait for it to return after the graphics card is finished.

So what you've in effect done is you've stalled the CPU waiting for the graphics card to finish the processing of the data. So you're gonna wait for the graphics card to eat all the data, return back around and tell the CPU that it's done. So the CPU's stuck, it's idle, it's not doing anything while that's happening and you're fundamentally operating in synchronous behavior to the graphics card. You're not operating asynchronously to the graphics processing unit.

So, some other basic suggestions. And we're going to go over the basics and then we're going to get into some more complex things like the title of the session is advanced. But we'll cover these basics real quick. So, avoid using GeoReadPixels. ReadPixels calls an implicit GeoFinish. So anytime you retrieve pixels back out of the system, we have to flush the entire command stream because the pixels that you're reading in the command buffer, I'm sorry, in the frame buffer, need to be the pixels that represent the state of the system at that point. So we need to guarantee that we've rendered everything. So, ReadPixels calls an implicit GeoFinish. And it'll have the same performance impact as if you were to call GeoFinish and then some because obviously then you're reading data back across a bus.

So, really the only time that you want to be calling GeoFinish is if you're looking to spool data off the graphics card and save it to disk. Maybe some type of, you're using the graphics card for doing a rendering of a sequence and you want to save it.

We discourage it from being used for things that are meant to be interactive, like you just want to save a copy of something so you can keep refreshing it. There's better ways to save data for refresh, refreshing dirty areas on the screen. ReadPixels is usually a slow way to do it. So we recommend that you look into the other techniques first before you resort to ReadPixels.

So the second item here is the parallel to read pixels. It's GL draw pixels. And we also recommend that you don't use that function call because it's a very hard call to optimize in OpenGL. The reason being is it is a call that will require us to copy the data. And as you'll see in some of the demos that I'm going to give, there's ways to allow us to not even copy the data, but in fact the data to be read directly from your memory up to the graphics card. So the driver's not copying the data.

So we're going to advertise to you that the best thing to do is to use APIs that allow us to operate asynchronous. The CPUs are allowed to operate asynchronous to the graphics card. Drop pixels really doesn't do a good job of that. There's better ways to do it.

So before we jump into some more complex subjects, last suggestion here, minimize state changes. State changes, again, are textures, enables, disables. Anything that you're fiddling with in the state machine, you want to minimize. It has an impact on the system. Usually the impact is that we have to synchronize some software. We have to go notify the graphics card. It has to change the state of the graphics card, maybe do some conversion of the state into something the graphics card understands.

So there's processing that goes on behind the background. So some state changes, while they may seem trivial, sometimes they can add up to be quite a bit of time being spent. And it can be a surprising amount of performance loss if you're really banging on the state machine.

Okay, so what's new in Jaguar? So even though I mentioned previously that we discourage you from using draw pixels, read pixels, there they are, we do understand that people have applications and they have code that previously was written. They don't have an opportunity to use other techniques or it's just very inconvenient for them.

So we actually are spending quite a bit of time to optimize those paths for those people. So in Jaguar, you'll see that draw pixels now actually will, under the covers, simulate the technique that we're trying to advertise to you. It will copy the data once, so there'll be a data copy, but then once we copy the data, it will give an optimized upload to the graphics card.

So it won't be as good as some of the other techniques, but it gets as close as we possibly can. So geo-read pixels. The previous release of OpenGL would actually use the CPU to read the pixels back across the bus, and reading the pixels back across the bus obviously is a bad idea. You're consuming CPU time and you're only nibbling away across a PCI bus to get the data back.

So now we DMA the data. We have the graphics card pushed into a piece of memory and it's significantly faster. So the numbers that we're seeing, we're seeing about an order of magnitude performance increase. So we went from about 6 to 9 megabytes a second to 60 to 90 megabytes a second of read pixels performance. We're looking to improve that as well. We think there's still headroom to grow and improve that.

But that's something that's new. So text sub-image. That is now a VRAM to VRAM copy. So that will give you a on-video card copy from one region of memory to another. It will never touch the CPU. And if you read that texture back, beware we'll have to update the system copy because we now have stagnated, or the system copy has gone stale.

We've said this one is not the master copy anymore because you have updated it on the video card. So if you read that texture back, if you want those bits out of it, we will have to go through an operation in the driver to read the bits back off the graphics card, put them in the system copy before we give them to you. So beware that if you're using that call, don't ask for the texture back because it's going to take a little while to get it.

DisplayList. So DisplayList now are very optimized. They use some of the extensions I'll be talking about internally to OpenGL, one of them being the vertex array object. Vertex array object is built on top of vertex array range and offers very good performance for static data. And it's our intent that we will continue to optimize this, improve the algorithms.

Basically, how it works right now is that we take the data that you put into a display list, we will examine it, post-process it into something that's more optimal for the hardware, cache it into that optimal format so that when you go to draw with it, we're not reformatting it with a CPU or copying the data around. Basically, at that point, all we do is submit it to the graphics card. So drawing a display list now is fairly low burden on the CPU. It's just a matter of a submission call to the graphics card for drawing. So vertex arrays.

I'm not talking about vertex array range here. I'm talking about just standard vertex arrays without any extensions. We've optimized that as well. This is better than it was. It's still not the best you can do. We'll see later. Image processing. So we've been improving our image processing code. There is a number of formats, pixel formats, that we have optimized with Altevec, tried to get the maximum performance of texture upload and image conversion through the system.

Okay, so I'm going to jump into some specifics here. Let's talk about the texture pipeline a little bit. So the diagram we see here shows that the Basic path that textures take to get to the graphics card. So each of these blocks... Okay, how about now? Okay, how much did you miss? It was boring anyways.

Okay, so, the two boxes that I have up here, those represent the number of times a texture may get copied as it goes through the system. So, as I talked about before, textures are state, and we have to retain that state, because you may ask for it back.

So, what OpenGL does is it has to make a retained copy of the texture, and, in case you ask for it back. So, what we do is we take a texture, we copy it into a retained area, and then when the first time you go to draw with it, the graphics card takes that retained copy and copies it into a hardware specific format for uploading to the graphics card. So, if you're looking at using the standard OpenGL texturing path for spooling data up to the system, up to the graphics card, you're going to see that there's two copies of the data.

We're going to undergo two copies. So, the optimizations that we want to show today are how to eliminate both those copies. So, there's extensions that will offer a way to eliminate those copies and get you substantial performance increases. So, again, this is a standard OpenGL pipeline. We'll see how we can improve it.

So, again, some basic texture optimization suggestions. So what you want to do is you want to think about, again, how your application is using textures. You want to think about how much room you have in the graphics card and scale your textures to fit on the hardware. The best thing to do, and I'm not going to go into this session, is to examine the graphics card for how much video memory is available. Scale your textures such that they'll approximately scale to that piece of hardware.

So the second thing is don't second guess OpenGL's texture paging. The texture paging in OpenGL is very efficient, in Apple's OpenGL, it's very efficient. It uses very little CPU time. Basically we've already pre-compiled the data into a format that's ready for the hardware to consume. So if the texture has to be uploaded to the graphics card a second time, it's very simple for us. It's very similar to what we do to display lists, for instance.

We post-process it into a format that the next time through all we have to do is submit a token into the stream and upload that texture. So if you are thinking you're going to be running into a texture paging condition, you might want to not try to second guess OpenGL's texture paging. Because what you'll end up doing is fighting it or being inefficient with video. member usage.

So the formats that you want to use for textures preferably are the BGRA unsigned short 1555 format and the BGRA unsigned int 888 format. And those formats are the, believe it or not, the native pixel formats for the Macintosh. The naming convention came from the ARM. They're a little confusing.

The way the naming works is that BGRA and the reverse on the type reverses the nomenclature on the BGRA. So it ends up being ARGB. And that is the native pixel type of the Macintosh. So these are the data types you want to use for uploading images. These will be the most direct path through the system.

Okay, so let's get into optimizing that diagram that I showed you before and how to remove some of those copies so you can get the maximum performance. So the first one, the first extension is OpenGL texture extension is the Apple client storage. Eliminates the first block, the first copy that OpenGL makes. It was the retained copy. And what it does is this extension tells OpenGL that I don't want you to keep a copy of it. I want you to use the application's copy.

So the caveat to using this extension is that you have to keep a copy around for OpenGL. So you have the texture, you use this extension and now OpenGL is just going to simply put a pointer towards your copy of the texture such that if we need it again we're going to come back and ask you for it. So you can't delete the texture off from underneath us. You've got the retained copy, you own it. We're not going to copy it again for you. So the benefits of this are obviously one that we're not going to copy the pixels that you already have.

So obviously there's some caveats to that. You need to use a format that we can be compatible with and such. But the other benefit is it can save memory. So if you're going to keep your copy of the texture around anyways, no use to us keeping a copy as well. So we can eliminate potentially one copy of that texture residing in system memory somewhere and save some memory, maybe up the performance. You just have to follow the rules and keep the texture around for us if we need it later.

Okay, so the next extension is the apple texture range. This extension is very similar to the vertex array range where what we do is you are telling OpenGL, I've got this texture. Please map that texture into AGP space. So you've allocated a piece of memory. You've handed it to us.

We are going to use this extension to define that piece of memory to OpenGL, and now we're going to map that memory into AGP, and given that you followed some of the basic rules of what the format of that texture is, if you've given the texture format as the preferred format that I previously listed, what will happen is that we will map that memory into AGP, and now the graphics card is talking directly to your memory. You won't even have to convert it to a harder specific format.

You've basically already provided that, and what we're going to do is we're going to map that into AGP, and we're going to have the graphics card DMA it straight from your copy. So the performance gain and the CPU, the burden that we lessen by doing this is quite a bit, and we'll see in a demonstration I'll give momentarily how substantial that impact is.

So one thing that I want to mention is that there is a synchronization issue. So if you've got the graphics card coming around spooling bits out of your memory, obviously that's an asynchronous process. We don't know exactly when that graphics card is going to come around and do this, so you'll have to take care of some synchronization issues, making sure that you are... talking to the graphics card as to when you're going to update it and find out yourself when it's going to be completed.

So the extension we provide to do this is the GLAppleFence extension, and that is a synchronization extension for providing tokens into the stream where you then can query if a particular operation's done and find out if that texture is done being used by the current commands in the stream. We'll talk a little bit more about that later. I wanted to just mention it and we'll go into it more.

Okay, so the third extension here, off the top it doesn't seem like it has anything to do with what we were talking about. But it turns out that it does. The texture rectangle is the third missing piece in our full optimization path through OpenGL. What texture rectangle does is it allows non-power 2 textures to be defined and given to OpenGL.

It is the fastest path through our system and there are hardware specific reasons why that is. It's some of the hardware, the Power 2 textures are required to be swizzled. So, since the non-Power 2 textures don't have to be swizzled into a hardware specific format, it allows us to point the hardware directly towards your copy, not forcing us to swizzle it into another copy that we retain. So, there are, it is required to use this extension if you want to spool data at maximum performance up to the graphics card.

So, we think that this is an okay thing, that you want to be, that we're requiring you to use this primarily for the reason that if you're using Power 2 textures, usually you're looking at having the graphics card retain those, use them frame, one frame to another, and not, you're not looking to spool data for, up to the graphics card for blitting, for instance. If you're looking to play video, or if you're looking to, um, just blit images to the screen for one shot images and then discard it, a power, uh, non-Power 2 is actually okay, because that's, usually that's very, um, conducive to the environment that that, uh, situation arises.

So, there are some restrictions using this extension. Uh, the first one is, is that it doesn't allow mipmap filtering. So, so mipmap filtering means that obviously you have one layer. You have the base layer and there are no filters. There are no higher levels of mipmaps associated with that texture. It also doesn't allow GL repeat.

So, if you were mapping a texture to a surface, it won't allow you to repeat that continuously across the surface. So, it's, again, it's, it's going to be useful for just blitting an image to the screen, but it's not going to be real useful for, uh, doing generic games or other types of, uh, interactive, uh, applications.

So, these three extensions that I just went through are the exact three extensions that we use to get the Quartz Extreme to run at optimal performance. So, these are the extensions and the techniques that we use. And you'll see, I don't know how many of you have seen it already, but we're able to composite with deep video compositing where we have DVD playing.

We've got ten terminals over that are semi-translucent and able to maintain decent frame rate. We're able to do all the things you see on the desktop using those three extensions. So, they're very powerful once you get those integrated correctly into your application, once you are integrated to a point where you're able to run asynchronous from the graphics processor.

One last mention about this extension is the texture coordinates are no longer normalized texture coordinates, they are in textual coordinates. So if you are using this extension you'll have to adjust your texture coordinates accordingly. Okay, so we're going to run a quick demo. If we can switch to demo machine two. So, the visual side of this isn't all that appealing, but what we want to do is two things. We want to show the CPU monitor and the effect that these extensions have both on performance and CPU usage.

So I have three check boxes here. Each of these check boxes basically represents one of the extensions that I was just talking about. So I'm going to turn these off, and I've got two sliders here. The top one represents how many images I'm paging between, so you'll see the number toggling up there in the middle of the image, just to prove to you that these are different images.

And they are being uploaded every frame, so I'm not caching them on the graphics card. I'm uploading them and deleting them basically. And I've got a frame rate modifier here, so I can dictate how fast I want the frame rate to go. So let's just crank everything up to the maximum, turn all the extensions off. So what you'll see is that I'm basically consuming 100% of one of the CPUs.

So that's as fast as I can go. I'm pretty much CPU limited, and I'm pegged at 22 frames a second, a transfer rate of 95 megabytes So there's two bad things about this. One is I'm consuming a CPU, which means that I'm taking that away from what your application could be using. And the second bad thing is that I'm only getting 22 frames a second. Obviously if I'm bleeding an image, I'd like to be a little bit better than that.

So let's do a couple things. Let's turn on the client storage, which was the first extension I mentioned. So you'll see that we went from 22 frames a second, 95 megabytes a second, up to 152 megabytes a second. So right there we got about 50% speed improvement. But we've noticed that CPU burden is unchanged.

So again, we're still making a copy of the data. So as you remember, we have two copies. That eliminated one of the copies. We still have one copy we're making, so we're still CPU limited. We're using the CPU to make a copy and that's the limiting factor here.

So let's turn on some more extensions and see what happens here. So now if we turn on texture rectangle, we go to 210 megabytes a second. So there we got about another 30% speed improvement. And that was simply from the fact that this is a more optimal path through the driver.

So now the last thing we're going to do is we're going to turn on the third extension. And once we turn them all on, good things happen. So you'll see that immediately the CPU burden went down. And the frame rate went up and now we're pushing 350 megabytes a second. So we're using hardly any CPU usage and now we're pushing 350 megabytes a second.

And the reason is, is that we're not making any copies with the CPU, we're running asynchronous to the graphics card, we're only synchronizing when we have to. So there's a couple things about this. One is that I'm overdriving the system. I'm telling it I want 150 frames a second here on my slider. That's not very reasonable. If we dial it down to something that would probably be more realistic. Let's say I want to push almost the refresh rate of the monitor, let's say about 50 frames a second.

So you'll see the CPU usage, the little we were using, dropped even further. Once we stopped overdriving it and forcing the CPU to block against the graphics card. And the last thing I want to mention about this is the reason I have the number of buffers I have up here. I have five. So five isn't all that interesting, it turns out.

But at least having two is. So now I'm toggling between texture zero and one. And you'll see that the, if I pump it back up to overdriving again, the performance didn't change. So I could go from here, which is 355, up to here. It's unchanged. The interesting thing about this is if I go down to, uh... a single buffer now driving the system Then I drop down to 250 megabytes a second instead of 350.

The reason is, is that, The double buffering allows me to operate more asynchronous to the graphics card. So as the graphics card is reading one buffer, I'm updating the other, right? So I'm ping-ponging back and forth, giving the graphics card time to read the data while I'm working on the other data set.

So you can get some benefit by double buffering. You don't necessarily have to. There's still a benefit, a very large benefit to going with a single buffer. And again, once I dial it down and not, don't overdrive it. You can still see I'm not using CPU very much. I'm getting good transfer rate.

And in fact, the lower you dial it down, that also gives it time for the graphics card time to upload the data. So I'm not hitting up against the graphics card and stalling against it. So this demo kind of shows the benefit that the combination of those three extensions has.

And how, basically how big of a performance gain you can get by using at least some combination of these. You don't have to necessarily use all three. Each one has its own benefit. But the combination of three together gives you the maximum benefit. Okay, can you switch back slides? We're going to jump off that topic and we're going to get into Vertex pipeline a little bit. So the Vertex pipeline, I like to think of it as having two distinct paths that can go through the system. We have the medium of path, which is the top flow chart there, and we have the Vertex array path, which is the bottom chart.

The immediate mode path has an immediate penalty, basically, built into it where any data you pass in, fundamentally, we're going to have to retain a copy of it. So, you'll notice in the OpenGL API, I'm not sure how familiar you are with it, but you can retrieve the current color, current texture coordinate, current normals. You can retrieve that data back out of the API. So, that means we have to have it somewhere where you can get to it.

So, the first thing we have to do when you pass in our normal color or texture coordinate, we have to save it. We have to keep it there because we're not sure in the medium mode path, we're not sure what the next call is in the stream. You may be done there and the next thing you may do is you may ask for it back. So, we have to have it somewhere where we can retrieve it and give it back to you.

So, that's a penalty of the immediate mode. Then, we have to take that copy and we push it on the command stream for the graphics card to consume. So, right there's two copies just like textures. We have to save it in the current state and have it in a location where we can retrieve it as well as copy it into a command, into a hardware specific format.

Whereas vertex arrays give you the immediate benefit where you're giving us an entire stream of the data and in fact, the definition of vertex arrays in OpenGL is that the current state of the normal color and texture coordinates are undefined after the operation is completed. So, if you pass us a vertex array for drawing, the definition is that it is undefined after that's completed. So, we don't make a copy of it at all. So, we never retain it. If you ask for it back, it'll be undefined what you get.

Okay, so let's go over some basics of how we would go about optimizing this path. The first thing we would do is we would try to avoid the medium of path altogether. That's a path that is just not optimal. It never will be as optimal as it could be.

Maximize the, so if you're stuck on that path, let's just talk about that. Maximize the number of vertices per begin in. So what you want to do is you want to pack as many vertices between begin in as possible. And we'll go over a code example of that in just a second. Use efficient primitive types.

So use efficient primitive types. Use triangle strips, quad strips. That reduces the number of vertices per primitive, giving you more efficiency. And use AGL macros and CGL macros dot H. I'm not going to talk about those much, but what those do is those allow you to drill down into the OpenGL API slightly, down a layer underneath our top level framework, and it reduces the function call overhead. It also removes some of the lookups of the current context. So if you're interested, write those down. If you're using the medium of path, look at those.

Those macros, those files, if you utilize those, those can give you about a 20% speed improvement. So you're spending a lot of time in function calls when you're in the medium of path, and those will show you how much. Okay, so some extensions of how we optimize this path.

So you'll see there's a little new jelly button up there. GL apple vertex array range, we've talked about a little bit already. It's similar to the texture array range, texture range extension, where you are basically giving us a pointer to memory. We are taking that pointer of a specified format, we're mapping it in AGP, and then we're allowing you to dump data into that, and we're going to tell the graphics card to spool it directly using a DMA engine up to the graphics card. CP will not make a copy of that.

So there are some restrictions. You want to use it if available, but the restrictions are that it requires the graphics card to do transform clipping and lighting and hardware. So if the graphics card doesn't have that capability, we can't give you this extension. Because obviously if you're using a graphics card, if the idea is to not have the CPU touch the data, the graphics card better be able to do all the operations required for that data set.

So briefly, again, what it does is that this is a process of you handing us memory, just like in textures, we dynamically map that memory into AGP space. It's, it's, in OS X, we have a built-in technology such that you can, that we can map dispersed areas of memory into AGP and maintain those, and these, these memory regions can reside in the client's memory, they can reside in the driver, reside in, just about anywhere, and we manage those, mapping them, unmapping them. So again, just like in the texture range extension, we have to provide a synchronization mechanism to applications such that if you are writing on the data, you can have a mechanism for determining if the graphics card is completed reading the data before you touch it again.

So another extension is the vertex array object. This is fairly simple. It gives you the same level of functionality as texture objects provide for textures. So it's a parallel of that analogy such that you can have multiple texture arranged objects and each object can have its own AGP mapping region. So it allows you to give a parallel paradigm to texture objects.

Okay, so compile vertex array. This was last year's winner for performance. And now we're demoting it to number two. So you only want to use this path if the range, array range extension is not offered. So this one is still going to be the fastest path for non-TCL cards. and you'll want to use the array range extension if it's available.

So let's go through a quick code example. So this code example just starts off with a piece of code, um, and what we're going to do is we're going to walk through trying to optimize it and see what kind of steps people should take when they look at optimizing their code. So here it looks, average piece of code.

What we're going to do is, the first thing we notice is that the state, uh, is static. It doesn't change, uh, every time through the loop. It's static. So let's remove the state change out of the loop, and, uh, save the graphics pipeline from having to thrash the state.

Because what we don't do in the graphics car- uh, driver is we don't examine whether the state really changed. We assume that you're doing your job of only sending us state changes when they've changed. So we make the assumption that if you change a piece of state, that's different than the last time. We don't check. So let's move it out of there, let's do our job on the application side and get that state changed, uh, to an appropriate level in the-in the for loop.

So the next thing that we do is we notice that we're passing triangles, but we're only passing one triangle per loop through the, uh, through the for loop. So what we want to do is we want to move that begin end pair outside the for loop and get, uh, as many vertices per begin end pair as we can.

Next thing we want to do is we want to simplify the data types. So we were passing through a double form of the color. Um, OpenGL is not going to want doubles. It's going to want floats. So, uh, let's just pass natively floats through the system and not cause us to do double the float conversion. Uh, the second thing we did here is we passed through a vector form of the vertex instead of individual conversion. of individual components. Slightly more optimal.

So the next thing we do, like I said in my slides, we want to use optimal primitive types. So we change from a triangle, individual triangles to a triangle strip. So now you'll see that we've basically eliminated the number of vertex calls by factor three. So we're going to get significantly better bandwidth doing that.

Now, what we would like to be able to do is be able to use vertex arrays. So, I've looked in my code and I said, well, you know what, this is just an array of data. I can pass it through a vertex array. I can eliminate the for loop altogether.

Basically, what this does is it moves the for loop into OpenGL. It allows us to optimize around the for loop inside, internal to OpenGL. And the last thing I do is I use vertex array range to, to get the maximum bandwidth through the system. So, the previous slide showed using vertex arrays.

Now, I add in the three calls to enable and define the pointer to OpenGL, setting up the vertex array range. Now, the one thing I don't show in this slide is the fact that you don't want to be doing all these calls every time. Like, for instance, I'm doing a vertex pointer and enabling some state and setting up the vertex array range.

Ideally, you want to do this as little as possible, just like, like the other examples of moving things out of the for loops. So, what you would like to do is set this up potentially once somewhere in your application and then draw from it multiple times. Use draw elements multiple times, or draw arrays multiple times to get the drawing command. Okay, so let's show a quick demo of that. So we can go to demo machine two.

And we're Okay, so again we have a, this is a demo that I used last year, so if anyone was here, we're going to show this. So, What I've got is I've got a slider that goes up the different levels of optimization that I just showed. So the bottom one is doing the individual triangles very slow.

You can see we're only getting about 690,000 triangles a second. This next notch up shows moving the begin and pair outside of the for loop using a more optimized primitive type. So you can see that we almost doubled, about doubled in performance just by doing that. So that was a significant benefit to OpenGL.

This notch here, and each time, by the way, you can see the color coding of where time's being spent. I guess I should have mentioned that before. The red is where time's being spent elsewhere outside the application. The green is time being spent calculating the wave and blue is the time spent in OpenGL. So it's our job as the OpenGL team to minimize the blue. So we're giving you techniques to minimize the blue. It's your job to minimize the red and the green.

So we've gone to draw arrays. We've gotten a little bit better performance. We went from 1.4 million triangles a second to about 1.5 million. This top notch is vertex array range. So now using vertex array range, you can see the time spent in OpenGL has gone down substantially. And we've gone from basically 1.7 million triangles a second to almost 5 million.

So, the next thing obviously to do is to utilize the fact that we have multiple CPUs on this system. So now we're pushing about 10 million triangles a second. Um, and we, remember we started off at about a half a million. So, that's about, um, You know, that's 20 times the speed improvement just by using these extensions. There's no change in the algorithms elsewhere, just in how it's driving OpenGL. So let's switch back slides.

Okay, Apple Fence. So I've been mentioning that here and there. We'll just briefly talk about it. I'm not going to go into all the details of using it. We've got some other things we want to show and we want to have time to do that. So Apple Fence provides synchronization tokens for putting into the OpenGL command stream to allow you to synchronize with these different extensions that will require you to talk to the graphics card and determine when it's done.

So there's two ways to do it. You can do it synchronously or asynchronously. There's a call that you can call and say, I'm waiting for this token. Please don't return until you have completed that. Or you can do it asynchronously where you can query and say, are you done? And if it says no, maybe you can go off and do some other work. So if there's a possibility you have something better to be doing with your time, you can go do that with the CPU.

So, the other uses for this is there, it does provide a mechanism for synchronizing between threads. So it is a general synchronization mechanism. So if you have multiple threads that are looking to perform an operation, you can start looking at, at this extension as a way to talk between threads for synchronizing, um, operations between threads. And again, you'll want to use this for synchronizing vertex and texture range operations.

So using threads. Loading textures in a second thread is one possible use of threads in OpenGL. In a second, I'll show a brief example of that. But one benefit of that is you can have two shared contexts talking to a single pipeline through the OpenGL stream. And you can use one thread, for instance, for loading those textures, and another thread for drawing. And then start utilizing the preemptive nature of our operating system to flatten out the time spent in a given operation.

So the-- So what you don't want to do, the danger of using threads is that people immediately want to start calling a single context from multiple threads. While we do allow that, there's a danger with that and that is if you have two threads without careful synchronization between those threads talking to a single OpenGL context, you'll crash your application because what happens is you'll be poking data into the OpenGL stream and confusing the OpenGL engine from multiple threads, contaminating the command stream and bringing the system down. So just a note about threads, that's not, if you do want to drive multiple threads, be very cautious how you do it. You can get yourself in trouble.

Okay, so a brief diagram here. So what I was talking about before was using two contexts, two threads. These contexts in this diagram are set up to share textures so that one context can be loading textures while the other is drawing. And it looks basically like this where the, the texture is cached into the context through the first thread and then utilized and drawn by the second thread. And let's do a, let's do a quick demo of that. So if we can go to demo machine two.

So this is also a demo that I did last year and I felt it was important enough to leave it in the material. So what this does is, is a, something similar to a screen saver. In fact, let's just kill the CPU monitor here. Let's get it out of the way.

So what this is doing is if you were to hear the hard disk in here, we have one thread spooling data off disk, decompressing it, pumping it into the OpenGL stream while the other thread's drawing. And what this allows us to do is to keep both CPUs busy with the process of getting data to the screen. So what you'll see is this is similar to screensaver on steroids, basically, right? You can pump up how quickly you page through the images. And again, this is paging this data off disk real time.

Spooling through it, displaying it, and you can see you can get nice smooth animations even though the disk is going nuts over here. Okay, let's switch back to slides. We're moving kind of quick because I wanted to save some time for the profiler. Profiler is a great tool.

So the OpenGL profiler, let's get to a good subject here. This is a new tool that we're providing. It has a lot of powerful features you'll want to look into and several things that it can try to help you with. As the name implies, it can help you profile, but it also potentially can help you debug your application.

And I'll show you some potential means of utilizing it for that. So one of the things it does is it provides general statistics of OpenGL calls. It also provides a call trace. So it captures a stream of your OpenGL commands. And you can view that for the sequence of commands that OpenGL is seeing coming from the application.

The other thing it does is it has a driver stats window. So you can monitor kernel level data and see what the graphics card's doing. There's lots of useful data. There's about 25 parameters you can monitor that are kernel level driver data. That is a direct view into the driver level picture of what's going on.

And the other thing it'll do is it will allow you to set breakpoints. So for instance, you can say that you can set a breakpoint and you can at that breakpoint you can view the call stack of the application. You can also look at the OpenGL state. So for instance, if you're trying to debug something or look at the state of OpenGL at a particular moment in time, you can set a breakpoint, stop there, and examine the contents of the OpenGL state machine at that point. So the other functionality it has is you can no op and profile any particular function.

So for instance, you can go to the OpenGL API and say I don't want to call that anymore. And you can just hit a no op command and it'll just turn that off. You can also turn off profilings if you're not interested in a particular application or a particular function. Maybe it's causing some performance issues when you're profiling it. You can turn it off and it will free up profiling that particular function. You can also cause it to force the buffer, flush the buffer after drawing.

So for instance, it has a flag where you can hook up a flush command to any OpenGL drawing command and it'll flush the back buffer to the front and show you the contents after you draw something. So this will allow you to potentially step through an application and watch one item appear at a time in the background. So you can step through it and see the scene being built up. So, let's get to another demo and let's look at that.

So this is on the Jaguar seed. I have a version that's about two weeks newer than what's on the seed, so I'm going to, it's going to be slightly different in a couple cases. So let's pull up a simple application. So what you do is you choose your application and you hit launch.

And now we're going to turn on a couple effects here. Okay, so you've got something going on. You've got some strange chess board going. And what you do is you have this green button here. So the green button is just a master switch for turning profiling on and off. So I'm going to click that and I'm going to start my profiling process. Now we've got a number of options here. You can see we've got the function stats.

Like I mentioned, we can collect stats. We can capture call trace. That's what this button is. And we can force flush. So we can do these independently. One thing to remember with the profiler, it is performance invasive. If you have a lot of function calls going through the system, it will induce a slight overhead to those function calls. If you're using an immediate mode function path, it can actually start inducing quite a bit. So you'll want to potentially turn things on and off if you're seeing the profiler affecting your application performance. So let's look stats real quick.

So this is what it looks like. It's fundamentally a list of all your function calls going through the system. It times them and it tries to determine where you're spending your time in OpenGL. So for instance we can see here that it says we're spending 98% of our time in GL call list.

So let's shrink that window up a little bit. We're only working with a 1024 by 768 display here, so we're a little bit cramped for room. So let's pull up call trace just to look at it real quickly. So here it is. It's just nothing more than the function calls going through the system and Unfortunately, it's not happy with that. Okay, so there you got a quick look at it. We're going to start it again. So this is still in development obviously. Okay, so let's get back to where we were. Let's get the chessboard doing what we want.

So let's move it off to the side. Let's set these up. Okay, so the other thing, let me just check this other box here. So it's force flush. So there you see it's swap buffering many times a frame. It's just showing the contents of the back buffer in real time. Not really useful when you're not stopped and looking at it in a break point.

Um, frame rate, there's another button down here that will tell you the frame rate. So if you have an application that doesn't list, you don't have an explicit frame rate counter built in the application, you can just launch this and use it to monitor the frame rate right here. Okay, so let's get into usefulness of this. Um, so break points. So here's a, a, the break point window.

And what this does, this isn't the version I was wanting to run actually, I don't know where this came from. So what this does is this allows you to break at any given function. It allows you to turn profiling on and off and it allows you to turn off execution of that function. So we noticed from the call stats before, let's pull those back up, that we're spending time in our call list. So we'll shrink that back down. So we're spending time in call list. So let's see just briefly here what we can do with that.

So if we tab up, if I can learn to spell, if we tab up the call list right there and we set a breakpoint, so you can see we're stopped there. Now, one of the useful things you can do there is you can set force flush and you can basically step through your functions, your call trees, and every time you're at a breakpoint, what you can do is you can stop and you can open up the OpenGL state and you can look through your state determining if at that given point your state is correct.

So the thing that's missing on this version that I wanted to show you and the reason that I wanted to bring a newer copy here is that there's a tab here now and you can look at your call stack. So you can look at the OpenGL call stack, you can look at your state and in conjunction with your call trace window, you can look at how you got there, right? So three pieces of information available at any given breakpoint are the call stack, call trace and the OpenGL state. So potentially, if somebody is having problems with a rendering technique you can backtrack the system and find out why that's there.

Now, as the name implies, this is actually most useful for profiling, at least we think it is. One of the things that we like to do when we are profiling an application is we just let it run. And the first thing we want to do is we want to... Let's turn off force flush.

We want to no-op out all the OpenGL commands and determine how fast the application will run if OpenGL is infinite in performance. So what we would do is we would say execute none. So now you'll see that there's nothing running, nothing visually updating here. But what we do is we go over here and we look at the frame rate counter.

So, we can see we were graphics card limited. This application obviously is not the bottleneck, right? So, you know, I encourage everybody to do that because it's good to just see how fast your application will run without OpenGL and get a feel for whether you can actually blame us or not.

Okay, so let me turn execute back on. Yeah, briefly, this is a powerful tool. You all have it on your Jaguar seed. Give it a try. If you have any suggestions, tell us about them. I'm going to show you one last thing here, which is kind of the no op, maybe potential use for the no op. So if I turn off execution of threads, I'm sorry, of call list, I can see all my pieces went away. So those are all being drawn with call list. I turn force flush back on, and I want to break at GL begin.

So let's just set a break point at GL begin. And, okay, I'm stopped there. So one potential use is you can sit here and step-by-step go through your application and determine the contents of it. So this is a very simplistic application, obviously, but it potentially can give you the opportunity to see what the back buffer of your application has, what the state is, and debug some of your rendering that you're doing.

I'm going to quit this application and we're going to go back to the slides. Okay, so we're going to wrap up. So, briefly, the optimization techniques that you want to use are application dependent, so there's a variety of techniques, depending on your application and what you're trying to perform.

Try to understand where your bottlenecks are, use the OpenGL Profiler, use Sampler to find where you're spending your time, and apply the available extensions to the problem. We've got a number of extensions. We've got 30 new extensions for Jaguar that you can look at for optimizing your application, tuning it, getting it to run, grading it, etc. on macros 10.

Okay, so for more information, I suggest everybody buys a copy of the OpenGL programming guide and the OpenGL reference manual. These are great books. I use them all the time. They're always on my desk. Okay, the roadmap. So, we're a little bit late into the, into WWDC, so we're already through a few of these, but for reference, once you get back and, and want to potentially review some of the material that was given here, we're going to show this again. Graphics and Imaging Overview is, was a great session. If you haven't seen a lot about how to use the graphicsing, general graphicsing power of OS X, some of these sessions will be good for you to review.

So, the second one as well, Exploring Quartz Compositor and Session 504, Going into OpenGL Graphics Programming. That one talked about our OpenGL shader, uh, application, which will help you program, uh, shaders and, and write code for the graphics card and hopefully, uh, help you do that a little more efficiently.

OpenGL Graphics, uh, Integrated Graphics 1 went into some of the system integration issues, how to do certain things within the operating system, integrate your code better with the operating system and work with it. And hopefully, uh, to a better product. Integrated Graphics 2, same thing, went into some of the graphics, uh, techniques that are generally applicable for system integration issues. Um, ColorSync, Digital Media, um, and some game solutions at the bottom there.

Advanced 3D was a session right before this. Hopefully that was useful for people that saw it on how to put together some of the advanced rendering techniques that people use in NVIDIA and ATI for producing some of these amazing demos that they do. So 5.14 is the one we're in now.

And so what we have coming up tomorrow is graphics and imaging performance tuning and graphics and imaging feedback form. So if you have any suggestions, show up at the feedback form tomorrow. We'll be there to listen to your suggestions and hopefully improve the product the next time around.

Well, in fact, even this time around since you've got Jaguar. So I'd like to get Sergio Mello back up on stage. He is a contact point for Apple. If you have questions or suggestions after the feedback form, email them to Sergio. He will do what he can to help.