Maximizing OpenGL Performance - WWDC 2005

Graphics and Media • 1:01:56

In this hands-on session, the Mac OS X OpenGL team will explain the latest techniques to optimize your application's OpenGL usage. All aspects of optimization will be covered including vertex and texture submission, elimination of redundant copies of data, and ideal pixel formats. Threading tips and techniques will also be discussed. Code along with live demonstrations of the OpenGL Profiler and Driver Monitor, and discover how to solve real-world performance problems. You'll learn to maximize your use of the GPU to increase your application's frame rate.

Speaker: John Stauffer

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Welcome to the Maximizing OpenGL Performance session. My name's John Stauffer. I manage the OpenGL Engineering group. So, let's talk a little bit about what we want to learn today. So what this session is, it's a little bit different than last year, if anybody's gone to that session. We tried to provide a little bit different content here. What this session is about is to give you a hands-on session about how to optimize your application. So we're going to talk, let's go down the subjects here. We're going to introduce a seed and tell you how to get involved in the seeding program.

We're going to go over some key concepts, some just high-level key concepts that we're going to want people to keep in mind. Basic tips, and they are fairly basic. Then we're going to get into more detail about texture uploading and vertex throughput. So, the two main types of data bandwidth primitives that are going through the system, being pixels and vertices. We'll talk about how to optimize that data throughput. So, the OpenGL seed.

So, for some of you that haven't seen this already, we have an OpenGL seeding program that we have announced for developers. What this allows developers to do is get early access to the OpenGL framework and drivers. And the benefit that brings to you is it gives you early access to features, bug fixes that you may have that you're looking for, profiling and tuning, enhancements that may be coming.

But it gives you early access and it gets you started developing with the technology advanced. So, we're going to talk about the software updates and the improvements we're making earlier. You don't have to wait for a software update before getting your hands on it and then giving us feedback. We want the feedback before we release it the first time.

So the email address to get access to that is [email protected]. And what you'll want to do if you want to become part of this program is send an email to that address, list who you are, why you want to become active member of this seed, and then that email will go to Apple. And we'll sit down and see if we can make you part of that program.

So, key concepts. So this is kind of kicking off what we're going to be doing in this session. What we try to do when we analyze problems at Apple is you kind of put yourself in the mind of a detective, right? You've got to sit down, you've got to gather clues, you've got to try to come to a problem with an open mind and make sure that you are basing your decisions on facts, clues that you're gathering. So what we're going to try to do in this session is give you the tools, the ability to gather clues as to where your performance is going to be.

So I'll be going through some demonstrations of going through the tools, using the tools, analyzing some problems, and seeing if we can figure out what's going on. So, what tools do we want to use? So, for OpenGL applications, clearly the OpenGL tools are an important part of that process. The OpenGL Profiler, the OpenGL Driver Monitor, but also Shark.

Shark is a very powerful tool and a very important part of any attempt to optimize your application. So, some of the things that we will do in this session. So, one of the things that we will do in this session then is start using some OpenGL extensions to optimize some of the problems we're finding. And we'll go a little bit more in detail about what extensions we're going to use, but we're only going to touch on a couple of them. There's lots of extensions out there. We'll touch on a couple.

So some basic tips. I like to repeat these every year to make sure that everyone is aware of these. So calling GL flush is something you rarely want to do. You have to have a good reason to call it. There are a few cases out there you'll want to call it, but most people don't need it. So if your application calls GL flush, remove it. If you call GL finish, there's no reason to ever call GL finish. Remove that. Those are the first things you want to do to an application.

So some OpenGL extensions. We're going to be looking at today our vertex buffer object, and that's an extension for moving vertex data through the system. There's Apple Texture Range and Apple Client Storage. Those are extensions used for optimizing texture upload. So we'll be looking at those for optimizing our data flow in our applications today.

So the other important basic tip then is to minimize your state changes. State changes in OpenGL can be expensive. And in fact, if you change state a lot, you can be bottlenecked simply by your state changes, not by your data movement, but by how you are managing your state. If you're doing a lot of state changes and little geometry, little in the way of moving data, then it's your state changes that will be your predominant bottleneck. Okay, so let's get into a demo here. Switch to Demo Machine 1, please.

So what I've got here, let me just run it and we'll take a look. I've got a basic demo here showing a mesh that's been texture mapped. And I want to analyze the performance of this. So what we're going to do is we're going to launch a few tools, and we're going to see what this shows us. So the first tool I'm going to launch is Shark. Sharks fairly easy to use. We're just going to launch it. We're going to hit start to take a sample for about five seconds. We'll hit stop. Let it generate the report.

So, looking at the top items here, the first one we see is it's often an interrupt. Okay. Well, that doesn't look good, right? Down at the bottom here, if people can see, we can see that we're using about 39% of the CPU. Well, let's say that this application, I want to use 100% of the CPU. I want it to be maximum performance.

So, why am I only using 30% of the CPU? Why am I spending 30% of my application time up here in an interrupt? So, those are things I'll want to solve. Thank you. Looking down here, I'm spending about 30% of my time calculating a wave. So let's launch some more tools to see what other information we can gather. So I'm going to launch the OpenGL Profiler.

For those of you that have never seen this tool, there's a variety of ways to use it. We'll get into a little more detail later, but right now I'm just going to attach to this application while it's running. Scroll down, find my demo app there. I'm going to attach this tool to it. Now, once I'm attached to the OpenGL Profiler, I can go up here and I can pull down a statistics view. I like to clear it, just to make sure my stats are current.

Let me shrink it a little bit, not to use too much screen space. So, and I'll sort the times based on percent time OpenGL. So, looking at this, there's a variety of information we can pull out of this. One is I can see that the top OpenGL call that's consuming time in this application is draw elements.

Using 97% of the GL time, or looking at this column,

[Transcript missing]

And what the OpenGL Driver Monitor does, it queries values directly out of the kernel graphics driver. And the parameter I have preset here is a parameter called CPU Wait for GPU. And what that means is that's the amount of time the CPU is waiting for the graphics processor to complete.

So, what we can see here, let me set this to linear scale here, and I'm using the percent numbers on the right-hand side of the scale. So what I can see is I'm spending about 35% of my CPU time waiting for the graphics processors. So my CPU is just spin looping in the driver, waiting for something to happen in the graphics processor.

Now, That's approximately correlating with Shark. Shark is telling me that something is spending about 30% of the time spin looping. So I'm finding that these two tools are showing me something here. So now the trick is to find out what's causing that. So what you can do in the Driver Monitor is you can open this parameters window, okay, and there's a variety of parameters you can investigate that will give you more resolution into these problems.

Anything here that starts with a, that has the word "wait" in it is something that's going to monitor blocking points in the driver, points at which the CPU is waiting for the GPU. The CPU wait for GPU is the total of all of those. So I'm looking at the total. So now what I want to do is go in and find the individual item that's contributing to that.

Well it turns out, since I set this up in advance, I already know what it is. It's CPU wait for free OpenGL command buffers. It's this command here. So if I select that, I can see the red line and the green line are fairly close to each other.

So now the question is, what's causing that? Well, that's where it gets a little tricky. So let's look at the code and see if we can analyze a little bit more detail as to what could be causing that. So let's quit this application and let's open up my actual source code. And I'm going to... Someone was fortunate enough to put comments in here labeling everything that was bad in here.

So, one piece of information actually that I forgot to mention was that what's interesting about this application is that it seems to be locked in at 60 frames a second. That's actually a pretty good clue to me because I know that the refresh rate of these monitors are 60 frames a second. So, somehow what it tells me is that it's likely that my refresh rate in my demo is somehow locked to the refresh rate of the display.

The one thing I know that can do that is a CGL set parameter setting the swap interval to 1. And what that does is that tells OpenGL to be tear-free. Okay, so my OpenGL presentation is going to be tear-free, which is good in some cases, but in this case I want maximum performance. I'm not looking for tear-free presentation. Okay, so let's go ahead and comment that out, because I'm just looking to get maximum throughput.

So we commented out the CGL set parameter. We're going to close that window. Now we're going to launch it again. I'll let it compile. OK, so we went from 60 frames a second to 118, 115. Let's go ahead and take another shark sample. Take a look and see what that looks like.

Okay, so we can see that, in fact, that was it, right? So the top item, which was blocking spin looping down in the kernel, is now gone, and now we're spending, instead of approximately 30% of our CPU time calculating the wave, we're spending 44. Our performance got about twice as fast.

So, okay, so let's, so we did what we needed to do there. Now looking over at the driver monitor, we can see that our CPU time waiting for the GPU has dropped, right? It went from 30 plus percent down to zero. Okay, so now we believe that we are not artificially blocking ourselves against a graphics processor.

So let's back up a little bit now. Let's go and review the tools. I just gave you a quick example, but let's go over the tools in a little more detail. So the OpenGL Driver Monitor is a powerful tool that has a number of features. So the first screen you see in the OpenGL Driver Monitor is this screen right in the center. And this screen gives you the ability to launch or attach to an OpenGL application.

So if I wanted to, I could launch an application simply by clicking the Add button, adding an application to that list, adding it, and then it'll show up here. Now, the basic difference between adding an application and attaching is that if you launch an application, I'm sorry, from launching an application and attaching to an application, is that when you launch it, you can actually vary some of the startup characteristics of the application. So, for instance, I could vary the pixel format that's used.

For those that program OpenGL, you know that to create an OpenGL context, you have to create, define a list of attributes for how you want that context to be created. So if you wanted to, in real time, without recompiling your application, you could come in here and vary that pixel format attribute list.

So an example of that would be, let's say I wanted to change my depth size. Let's say I wanted to make it a 16-bit depth instead of a 32-bit depth, for example. Okay, so we're not going to use that, but good to know. Another item that is useful for people is the ability to emulate different hardware, and the emulation is simply on the return values OpenGL will give you.

In other words, if your application is querying into OpenGL, this will vary the return values that OpenGL gives your application. It will not actually vary the behavior of a graphics driver. So let's say that this graphics device I have here, I want it to behave like a RAGE 128.

If you look at what that does, is it will vary these return values from OpenGL. Okay, so we have a list of extensions that will simulate a RAGE 128. We have a list of return parameters that simulate a RAGE 128. This is useful for developers who are looking to try to test certain behaviors of their application, how their application responds to different graphics devices with different feature sets. Okay, and we're not going to use that right now either.

So let's go ahead and launch the application we were just looking at previously, but launch it through the OpenGL Profiler. So there it is. Now let's go look at some of the other features of the Profiler. So one of the features is called the trace. I'm going to suspend it here because I don't want it to go too far. The trace gives you a trace view of all the OpenGL calls. It gives you a time for each OpenGL call and how long it took to execute. And it'll also give you a call stack. Now, to get the call stack, I actually have to enable it.

Let me clear it. Let's resume the application and suspend it again just to get a small snapshot. So what we see here then is we see the same basic list. And over in this window, if I select the function, I can see I get a call stack. So every call that you make into OpenGL will be tagged with time, call stack, and the parameters it used to call into OpenGL. So this is useful for debugging, getting you some feedback as to how you're driving OpenGL.

One of the other features that this allows is to post-process that data. So what you can do is you can run a script. I've made a pre-made script here called Condense. I've loaded this condensed as the current script. Now I'm going to click the Filter button. It's going to ask me for a-- - For a name for the file, the Save It As, it goes off, post-processes that data. Now let me hide the profiler and we'll hide Xcode so I can get to the file.

So opening that up, what I did is I wrote a script that generated a report. This report tells me how-- a summary of how I'm driving OpenGL. So you can see that what it's telling me is that you're starting your frame with GL clear, you're calling draw elements 239 times, calling GL finish, and then GL flush drawable.

So obviously, one thing that pops out at me is, didn't I just get done telling people not to call GL finish? So we're going to go fix that in a minute. So let's make OpenGL Profiler the top application again. So that's the trace window. Let's look at some other windows here. We saw the statistics window. Let's resume.

Okay, so here's what the new statistics looks like after we removed the wait for VBL. So we still see that draw elements is the main time consumer. Though we noticed the time spent in the application has dropped from 70% down to about 50%. So we've removed about 30, or I'm sorry, about 20% of the time that OpenGL is consuming out of the application's time.

And But draw elements remains the top item. So one important number I'm going to highlight again here is this number down at the bottom. This number here tells you what percentage OpenGL is taking out of your application's process time. So it's a good way to get an idea of whether OpenGL is your main performance culprit.

So if OpenGL is consuming 10% of your time, you probably want to look elsewhere for how to improve the performance of your application. If it's taking 20, 30, 40, 50% of your time, it's something worth investigating. So that's a good place to look right away for an indicator of what tools you ought to be going to.

Let's go down the list a little bit. So let's open up a breakpoint window. So what breakpoints do is allow us to stop at any function call in OpenGL. So let's pick the flush drawable call. So we stop there. You can see the application is paused. It gives you a backtrace where you stopped it.

If you click on this tab here, it gives you the current OpenGL state. So what you're able to do with this is you are able to examine your entire OpenGL state. So what you're able to do with this is you are able to examine your entire OpenGL state. So what you're able to do with this is you are able to examine your entire OpenGL state.

So let's, well, actually, while we're at a breakpoint, let's pull up the resource view here. Now the resource view lets you examine some of the resources that OpenGL may have loaded. For instance, we know we have the OS 10 logo loaded here. And we can see from the list here that it's under an ID of 20.

We can see it's a rectangle texture type RGBA, size of 768 by 768. So it'll give you a list of your textures. It'll also give you a list of your programs. So if you're using fragment programs, vertex programs, you can also view those. This demo doesn't actually have any, so the list is empty.

So let's look at what else we have here. So one other important item here that's valuable is to check for errors. If your application is having some trouble, sometimes what happens is that your application is causing OpenGL to throw an error. And when OpenGL throws an error, it ignores the function that threw the error. So it will not take effect. The state change that it was supposed to do to execute will not take effect.

So you can simply say break on error, hit continue, and the OpenGL Profiler will watch for any errors in which your application is throwing within OpenGL. Another important feature here is break on thread conflict. What this does is OpenGL has certain rules as far as how threading should behave. This will watch for any violations of those rules. When people use threading in OpenGL, sometimes they get it wrong, and that can cause a variety of unpredictable results.

So this will watch for threading conflicts and help you debug any threading problems you may have. We also have another item down here called break on VAR error. VAR, standing for vertex array range error, and that will watch for any time you're trying to draw your geometry and it happens to span outside of your currently enabled vertex array range.

And the net effect of doing that when you try to draw outside of your currently defined vertex array range is that you're not going to be able to see the error. So this will watch for threading conflicts and help you debug any threading problems you may have. You actually fall off the fast path you're trying to enable, and it falls onto the immediate mode path.

The extension effectively gets disabled. So it's good to know if you're running into that case. So, I'm going to leave that up, and we're going to go look and see what else we have here. So, pixel formats. A pixel format window tells you what pixel format attributes you passed into OpenGL at the time you created the context. Useful for reference.

So, let's talk about scripts. OpenGL allows you to execute scripts. You can basically add a script, write it, let me call one, say, clear color, and then I can go over here and say, "GEO clear color." Let's say I want the background to be red. Okay. Now, let's see here. I got to get to a breakpoint to actually execute scripts. They will only be executed at breakpoints.

Okay, so now we're at a breakpoint. I'm going to say execute that script, and I'm going to tell it to go again here and see what happens. So you can see that I overwrote some of the OpenGL state. And you can also see in the state window that it told me I did that.

So anytime you modify state from breakpoint to breakpoint, the state window will highlight in red the class, the state that actually was modified. So I can open it up and I can see that I modified that clear color value. So it will highlight state changes between breakpoints. Okay, so let's set it back to black because red is kind of annoying.

Close that window. Let's tell it continue. Oops, I said to white. We'll leave it white for now. So what else do we want to look at? Buffer views. Another thing you can do at a breakpoint is you can come up and look at the back buffer, the alpha buffer or the depth buffer of your application. So what this allows you to do is to incrementally view the buffer contents.

So for instance, if I wanted to incrementally look at my back color buffer, let's choose something other than flush drawable. Let's choose--

[Transcript missing]

Get to that. Now, on the depth buffer view, you can see the slider at the top. What the slider does is helps you analyze how many significant bits you're using out of the depth buffer.

So what you do is you click this magnifying glass. The sliders will zoom in on the number of significant bits out of the total number of bits available in the depth buffer and tell you how many of those bits you're using. So I can see that I'm only using about 10 to 15% of the precision of the depth buffer. It's not very efficient. I should be using more because I've allocated that memory and I'm wasting it, basically. So let's go and do some debugging on that. Let's continue. Let's quit the application. I'm going to debug that problem of not using all of the depth buffer.

Okay, so we, there's the GL finish. While we're here, why don't we comment out the GL finish. And before this session, I played around with a few values here. And the call that actually sets up your frustrum for determining what your near and far clipping plane is the GL frustrum call. GL frustrum is also called from a GLU perspective, which is a utility function for setting up your viewing projection matrix. So I'm going to comment in the one that I had set up.

And what that does is it changes this 10. to a 1.2. And what that does is moves my near clipping plane out, thus moving it nearer the object, giving me a tighter range of values around the actual geometry that I'm drawing. So let's save that. Let's run the application.

So there's the application again. Let's select flush drawable. Oops, sorry, I've got to attach to the application. So let's attach this time. Scroll down, find the application, attach to it. And there we are. We stopped at this breakpoint. So let's see here. I stopped at a point that it's not going to be available. Let's do this.

I'm actually going to change the way I did that. So let's do that again. Okay, now let's go look at the depth buffer. Okay, so now we can see from the previous settings, it's all red. Red telling me that the current range of this slider is outside the bounds for which defines the entire region. So let's click on the magnifying glass. So now we can see that we're using about 90% of the precision of the depth buffer. So this is just a simple tool to help people analyze how effectively you're using the precision of the depth buffer. So let's move on.

Now, we saw a little bit with the Driver Monitor. I'm going to explain a little bit more about the parameters of the Driver Monitor. So the Driver Monitor, again, is querying values directly out of the graphics kernel driver. So these values get exported up to this tool, and they allow you to monitor a variety of parameters from the driver. So to talk a little bit about some of the parameters you can look at, you can look at things like current free video memory.

So if I want to look current free video memory, I select that item, I drag it down into the list, and now I can see what the values are. So let me change the color on this to, let's say, blue. So I can see what the values are. So I can see that on this graphics device, I have something over 200 megabytes available of video memory at this moment.

Other things I can examine are data throughput. So, for example, I can look at... Let's scroll down. Optical context. So, yeah, so you can look at command buffer data for DVD. You can look at OpenGL command data. And command data is any data that goes through the command buffers.

There's several different ways of transporting data up to the graphics processor. One is command buffers, another is data buffers. So let's look at data buffers, because we tend to try to sneak a lot of things into data buffers. But let's just see how they're being used right now. Let's change that color to yellow.

So we don't see a lot there, do we? Let's deselect a few of these others here. Okay, so let's look at OpenGL command data and see if the data that we're sending through here right now is going through the command data channel. And there, sure enough, there it is. So what we're seeing, looking at the green line there, we're seeing something over a half a gigabyte a second of data going through the command buffer channel in OpenGL. So that's where the bulk of our data is going. So, I need a case.

You can look at these different parameters. If you mouse over them, they will give you a description. Okay. And let you analyze OpenGL in a variety of ways. So let's hide that application. Okay. So let's -- I think we've done enough there. Let's switch back to the slides.

Okay, let's talk a little bit about what we were just seeing there and how we're going to further improve that application. So when we look a little bit about, look a little bit at the OpenGL pipeline when we're talking about vertex throughput, the OpenGL pipeline can make copies of the data at various stages.

When you're using a media mode, what you'll see is that the application hands the data in, we have to copy it off into a current vertex data structure, and then we copy it out of the current vertex structure into the command buffer. So you end up with your copy of the data, and then OpenGL will make two copies. So a better way of doing it is use vertex arrays. Vertex arrays optimizes around the current vertex state, right?

So we don't have to maintain that in OpenGL. And what happens is you hand us an array of data, we take the data directly out of that array, and we copy it directly into a command buffer to transport across to the graphics device. So this is a fairly decent optimization, and this is actually what my demo is using. It's using draw elements, which is a vertex array data path.

So let's talk about some basics here then. So what you want to do when you're optimizing data throughput is you want to eliminate data conversions. Some data types you can pass into OpenGL are not necessarily natively supported by the graphics device. So some basic data types you probably want to stick to are GL float, GL short, and unsigned byte.

Those tend to be reasonably well-supported across the different products. You want to minimize your function call overhead, and there's several ways to do that. One is to use draw arrays. Obviously, what you want to do is to minimize the number of times you're calling to OpenGL, so you're going to want to maximize the size of the array that you're passing to OpenGL at each call. The more data you pass us, the less time we're going to be spending handshaking with you.

You can also use the CGL macros. The macros is a way to reduce the per-function call overhead. I'm not going to get into it too much here, but one thing you ought to keep in mind, if you're making a lot of OpenGL calls, you can reduce that function call overhead by using the CGL macros.

And another thing to keep in mind is that the graphics processors are becoming programmable. So what that allows you to do is to consider offloading some of the computational work onto the graphics device. So for instance, in previous years, this exact demo in fact, I wrote a vertex program that offloaded the work onto the graphics processor for actually calculating the wave. Now it depends on what you want to achieve, what effect you're doing, or what performance profile you're trying to optimize for. But in some cases, offloading the computational work to the graphics processor can give you quite a benefit.

So, talked a little bit about the extensions already, but for vertex data, again, our vertex buffer object is a useful extension for you to look at. It is primarily designed to give you a high bandwidth path for uploading data into the graphics processor. Static data, you'll want to treat slightly differently as far as where you want the data to live. Dynaptic data, you probably want the data to reside in system memory where the CPU can talk to it.

For static data, it's ideal to tell OpenGL that it's static and that we can pick it up and move it in video memory and leave it there, thereby using the internal memory bandwidth of the graphics device, not having then to transport it across the bus every frame. There's two ways to get static data into video memory.

One is using the static type qualifier for a vertex buffer object, or you can just give the data to us in a display list and we try to post-process the data and store it up in video memory without any direct interaction by you. Other than just wrapping your calls with the display list.

So with that, now let's go back to the demo one machine and see if we can use that extension to optimize our performance. So here's the application as we left it. For comparison's sake, we see we're running at 120 frames a second. We can see down here at the bottom the CPU is working really hard to make that happen. We're consuming about 100% of one of the CPUs at this point. If we take a shark profile and take a look at what it's doing, again, we're spending about 45% of the time calculating the wave.

We're spending then a bunch of time down here in OpenGL. We see two items down here in OpenGL taking approximately 30% of the time. You know, about 30-35% of the time spending in these top items in OpenGL. So, let's see what we can do to reduce OpenGL's overhead.

So, let me quit this app here. Now what I've done is coded up the use of the arb vertex buffer object extension. So moving over from straight vertex arrays to our vertex buffer object is actually relatively straightforward. So I've got two pieces of code here. The first one is going to represent the items I have to turn on, and this is the items I'm going to turn off. So there's various places I'm going to comment code out and a couple places I'm going to comment code in.

So the first thing I need to do is I need to bind to a vertex buffer object. It's an object semantic, just like a texture. So you need to create an object, and you create an object by binding to a new object. So I'm going to bind to an object that gives it a name and something I can switch to when I want to draw from that object.

Now, the remaining two calls, when you use a vertex buffer object, the meaning of vertex pointer changes. It changes from being a memory address to an offset. That's important to remember, because you may already be using vertex arrays, and if you don't modify the parameters to vertex arrays, you'll find that things don't work as expected.

And the reason that it's done this way is that the memory is contained within the vertex buffer object. You're no longer going to be passing arrays to OpenGL. You're just telling OpenGL the offset into your object at which you want to pull the data from. OK? So I'm going to comment those two lines in, and I'm going to comment out the two old lines that are no longer using the correct syntax.

So let's go down. Now, I've created the object right up here with the bind buffer. I've set the offsets of where the data's going to live in the object. And now down here, once I've loaded the data-- see here, I'm calculating the initial wave values-- the initial values. And now I'm ready to give this data to OpenGL.

And the way you can initially give data to OpenGL is called GL Buffer data. And GL Buffer data then will take the pointers you give it and copy the data into the object. So it's not going to retain a pointer to that memory. It's going to copy the data into the buffer object.

Let's see what else we have to do. OK. So now when we calculate the way of every frame, What we need to do is we need to bind to the object because we're going to have to modify the data in the object. So I want to make that object current.

Now, since the data lives in the object now, it doesn't live in my space. The object, the data actually is part of OpenGL's state, and it lives in OpenGL's memory space. I'm going to have to retrieve that memory, so I do that by calling mapBufferArb. What that does, it tells OpenGL, go give me a pointer to your object memory, and I need to modify it. So I call mapBufferArb, and it returns the pointer. So now I've got to comment out my old code I had here that set the pointer to something else, because that's no longer valid. I'm no longer in charge of the memory.

Now, anytime you call map buffer, you need to un-map it. OpenGL R buffer objects require you to bound your access to the data. Before you access it, you call map, and when you're done accessing the data, you call un-map. So, here I am calculating the wave, and I've got various alt-evec commands in there to try to do it fast.

So down at the bottom of this whole thing, I've got the UNMAP. So now that I've filled in my wave calculation, which I will be executing every frame, I need to call UNMAP to tell OpenGL that I'm done accessing the data. So now I've done the initialization, I've done the bounding of the dynamically modifying the data. Now I gotta go modify, have to go modify the data, how I'm going to draw the data.

So here in the wave display function, I have the, again, the bind to the object. I want to make sure the object's current before I draw to it. And that's it. Everything else is the same. I'm going to issue the draw elements command as I did before. And everything should just work. So let's close this window and give it a try. So remember before we were using a hundred percent of the CPU, we were at a hundred and twenty frames a second. So let's see what difference this makes.

So we're at 140 frames a second. And it turns out that I actually throttled this at 140 frames a second. And the CPU behavior, I'm not quite using 100% of the CPU. I'm using something that looks like maybe 80%. So let's take a shark trace of that and see what it looks like.

So the wave calculation's still up there pretty high. But the other items of OpenGL have changed. Now all of a sudden I'm seeing this 15% spent in an idle loop, just like I had before. Whereas before I was seeing about 30%, but now all of a sudden this is showing back up.

So I'm going to talk a little bit more about this in detail later, but what happens is that when you are sharing data with a graphics processor, which vertex array range, R buffer object, texture range do, they are talking to the same memory region that the graphics processor is accessing. You are synchronizing with the graphics processor, and you have to be careful how you do that.

So what this is, is this is telling me that I'm potentially hitting a blocking spot where the CPU is waiting for the graphics processor. So let's go see if that is true. Let's open up the Driver Monitor, which is the ideal tool for analyzing. And let's again look at the CPU wait for GPU time.

And sure enough, it looks like I'm spending about 15% of my CPU time waiting for the graphics processor, just like Shark told me. So this tool's confirmed it, so let's go dig in deeper and find out where that time's being spent. Well, just like before, I figured it out before, so we're not going to waste that time, but I'm spending time in CPU wait for user code. And that's a pretty interesting parameter to be moving around in.

I'm going to go ahead and do a little bit of monitoring, because that parameter tells me that a blocking point that the developer has put in their code is what's blocking OpenGL's execution. And that can happen from a fence command or a map command. Now remember, we call glmapbufferr before we go modify the data. So this actually is where the time will show up in the Driver Monitor if I'm spending time in any routine.

That is a fencing type routine in the OpenGL API. And again, those are going to be the glmapbuffer calls, or they're going to be the fence calls. Those are the two calls that will start attributing time to this parameter. So what this is telling me is that the graphics processor is slowing down the CPU some.

In other words, I'm ready to go off and start modifying the data, but the graphics processor hasn't quite gotten done displaying it. So I'm stalling the CPU a little bit, waiting for the graphics processor to consume the data, before I can go and touch it with the CPU. So we'll talk a little bit more later, but keep that in mind.

Now, as a last little experiment here, let's just again open the Profiler and let's pull up the statistics. And we have to attach to this application. And let me clear the stats to get a fresh view. So what's interesting is that now the top call is the map buffer.

And that's also consistent with what we saw in Shark and the Driver Monitor, where we believe that map buffer is a stalling routine. It's stalling, waiting for the graphics processor. So all those pieces of information are consistent. And again, we'll talk about that more in a little bit. So I'm going to clean up here a little.

[Transcript missing]

Slides on the screen, please. There we go. Okay, so let's review a little bit. So we were using vertex arrays, then we wanted to go off and start using some OpenGL extensions. And what we did is we used our vertex buffer object to try to optimize the data throughput. We saw that we got some performance. We started using less CPU. So optimize both CPU and performance.

And what it looks like when we start using that extension is that the data gets pulled directly from the application's memory. So we have a space directly to the CPU. There's no copies being made of the data by the CPU. The data is being pulled directly by the graphics processor. CPU is not spending time doing copying of data.

Okay, so let's look at the code, summary of the code here. So, as we showed before, we always have to do a bind before we are going to access or operate on a buffer object. We enable the vertex array state, we set up the pointers, the vertex array pointer, but we set it up as an offset, not as a pointer.

We call the buffer data to initialize the object with some data, and then we call map buffer if we're going to modify the data. So we call map buffer, we get the data pointer back, we fill in the data, and then we unmap. Just like we did in the code.

So again, static data, you want to tell OpenGL if it's static. Now, static, you're not going to want to be modifying it every frame, but if you modify it once in a while, it's okay, but it primarily needs to be static data. So if you're only modifying it every fifth or tenth frame, put it up in video memory, let the graphics processor use its internal bus instead of transferring it across the graphics device bus every time.

And again, displayless, you can get static data and put it into video memory by wrapping it with a displayless calls. You can put that data in begin, end, and we'll parse through the data, pull it out, and put it up into video memory. So I want to show you one more thing on the Vertex Performance Demo.

So let's see here. What are we going to do? So one thing I want to show you was what happens. We can see that we're using about 80% of the CPU. And we saw that we were consuming or stalling about 15% of the time. And Shark was showing that as well. So let's modify this application one more time. And let's take out the throttling of the 140 frames a second. Let's see what these profiles show us once we do that.

So, what I'm going to do, oops, wrong file here. So what I'm going to do is I'm going to go down to the timer that I have. I'm going to increase this from 140 frames a second. Let's increase it to 180 and let's see what this looks like in the profiles. Because right now I'm kind of artificially limiting the performance. First off, let's see if it can keep up. So it turns out that it doesn't get a whole lot faster.

So we're pretty much hitting the limit of this system here. We can look at the blocking behavior. As we were looking at before, we were spending about 15% of the time blocked, but now we're spending over 20% of the time blocked, the CPU being blocked against the graphics processor. So what I'm doing is I'm driving the system harder, but it didn't get all that much faster, but I'm spending approximately 7% more time blocked against the graphics processor. So I'm driving it harder, but I'm not getting that much faster.

So this is kind of the point here is that you have to make your architecture behave in a way that it's asynchronous with the graphics processor. If you drive the system in such a way that you're running synchronous, you can drive it as hard as you want, and all you're going to do is block harder against the graphics processor. So let's also look at Shark, and let's see if Shark is consistent with that data that the OpenGL Driver Monitor is giving us.

It's telling us about 15%, which looks a little low, but not all that different than what the Driver Monitor is telling us. So, okay, let's switch back slides and we'll get into more about asynchronous behavior here in a minute. So texture uploads, let's talk about that for a while. So just like vertices, texture uploads have the same kind of behavior, where the OpenGL framework will make copies of the data through the various parts of the OpenGL pipeline.

One copy could be in the OpenGL framework, one could be in the driver, and another copy could be resided up in video memory. Now these copies, obviously, to make a copy, to save a memory copy somewhere in the pipeline, there is a copy of the data being performed, either by the CPU or the graphics processor, physically copying the data and depositing it somewhere.

So everywhere we see an arrow, basically the data is being moved. Everywhere we see a cylinder there, the data is being deposited. So there's memory being consumed. So let's go to another demo that I'm going to show you about texture performance. Switch back to demo one, please. Thank you. OK, so we're going to run a completely different demo here.

So let me just run it as it is. So the first thing you'll see here is you'll see my boss's head popping off. He's in the audience, so this may be my last day. But let's say we have a bad prank we're doing at work and we want it to run really well.

Sorry. What I'm seeing here is that I'm using a lot of CPU time. I'm getting about 46 frames per second. My application's actually trying to get 60 frames a second. So I know that somehow I'm being limited by the performance of my application, and I'd like it to run smoother. So we're going to run through the same basic steps we did before. And what we're going to do is we're going to run Shark. We're going to take a trace.

So we take a trace for about five seconds. And we look and we see the top item, 65% of the time is being spent in memcopy. So we know we're copying a lot of data around, and that's something we would like to avoid. So let's also look at the OpenGL Profiler, and let's attach to this application Okay, so let's attach to it and let's open up the stats view. Hit clear to get a nice fresh trace of this.

So the OpenGL Profiler is telling me that 98% of the application's time is being spent in GL text SUBMG2D. And the shark is telling me that 65% of the application time is being spent just doing mem copy, probably happening underneath the text image SUBMG2D call. So we know that we're moving a lot of data and it's happening in that call. So let's go back to the slides and let's talk about some extensions we can use to try to to eliminate copies of the texture data as it goes through the pipeline.

So, first some basics. So, the goal, again, is to eliminate CPU copies and conversions. The way you do this is you use the right formats. The right formats are going to be key because what you want is you want a pixel format that is natively supported by the graphics device. If it's not natively supported by the graphics device, we're going to have to convert it to a format that is natively supported.

So, here's some formats that are common on our platform, supported by all of our chips. There's other formats that are out there. You'll have to experiment to see which ones work best, or you can email us and ask whether a format is optimal for the system or the graphics device you're using.

So, texture extensions. So, to optimize the texture throughput, some extensions you want to keep in mind are the Apple Client Storage. The Client Storage extension, what it does is it retains a pointer to the texture data. It does not make a copy of it. So, what OpenGL does is it hangs onto a pointer of the data, and now the application's responsible for making sure that the data is retained. It stays around. You can't throw it away while OpenGL has a pointer to it.

So, that extension has the benefit that the framework will not make a copy of that data anymore once you use that. Apple Texture Range. It has the benefit of the Texture Range is it will bypass the copy made by the driver. It has two modes, shared and cached. Shared and cached are similar to static and dynamic. Cached is static, meaning cached up video memory. Shared, meaning put it down as system memory because I'm going to share it with the GPU.

And EXT Texture Range. Texture Range can be an important extension because some hardware will require that extension for doing direct DMA. It has a more compatible format with the data as it resides in system memory. So, the Texture Range extension is useful to ensure optimal performance. So, let's look at what each one of these does in this diagram. So, before I showed cylinders, now we have the images of my boss.

So, the Texture Range extension gives an image to OpenGL. Now, if I use Client Storage, I'm bypassing the framework's copy going directly to the driver. So, I've eliminated one copy of the data. And then, if I'm using the data as cached, it's going to be cached up video memory.

The data gets copied up video memory and used from there. So, I've eliminated one copy down to three. So what does the client storage app look like? It's one line. It's the PixelStore Eye Unpacked Client Storage Apple. Just setting that to true when you bind to your texture object will enable this extension. Just one line of code.

So, texture range and rectangle texture, what does that do? So those bypass the copy that the framework, I'm sorry, that the driver is making. So now I've bypassed the graphics driver's copy of the data. So now it's pulling directly from the OpenGL framework, directly DMAing it into video memory. So now I've bypassed the driver's copy.

And what does that look like? Also, one line of code. You just call GL_TEXT_PARAM_I rectangle texture target type. And in this case, I'm calling the texture storage hint, which is the extension parameter. And I'm calling storage cache. So I'm going to cache it up in video memory. So combining these extensions, the behavior I get is that the pixel data that is stored up in the application will be DMHed directly into video memory. No copies made by OpenGL. Zero CPU copies.

And putting all that code together then, what it looks like is you bind to the texture object, you call the texture param i to set up the client storage. Actually, in this case, you call the rectangle texture extension first. Then you call the pixel storei, which is the client storage. Those are the two lines you add to enable these two extensions. And then you load your texture. You call your GL text image 2D call to load the image.

So let's talk a little bit about the blocking behavior I was pointing out before. When you are using some of these extensions, again, you are sharing data with the graphics processor. The application and the graphics processor are pointing to the same data out in system memory. So when you do that, there is synchronization involved. You have two asynchronous devices trying to talk to one region of memory. They need to be synchronized. And you need to make sure when you're synchronizing those that you do it in a way that you're not blocking each other.

So for instance, if I was to only use one object, one vertex buffer object, for say, or in the case of textures, one texture range, and every time I went to talk to the CPU, I would call a fence to block before I access the data. And what would happen is the CPU would block until the graphics processor is completed using it.

The CPU would operate on it. And then the graphics CPU would operate on it. And then I would draw with it, and the GPU would pull it up. So this is what this block diagram looks like, right? The CPU operates. I flush, get it. It's processed up to the graphics processor.

Graphics processor completes its processing, lets the CPU go. And they're basically operating in serial, right? But if I double buffer my data, I can have a copy that the graphics processor is operating on and the CPU you're operating on in parallel. So it's important to double buffer your data when you're using these types of extensions to make sure that the graphics processor and the CPU are operating in parallel, not serial. That's a key aspect of when you're using these optimal extensions, because you're sharing. data with a graphic processor.

So, texture range synchronization, I mentioned a little bit. For texture range, the synchronization extension is the Apple Fence extension. You'll need to use that any time you want to go and access the data which the graphics processor is pointing to. So let's look at what that means. So just like with our vertex buffer object, we had to call a map.

The map was a synchronization point for the vertex extension. The finish object with a GL texture type is the synchronization call that you'll need to call before accessing vertex array range, or a texture range data, okay? You call that before you access the texture data, and that will make sure that the CPU does not access the data before the graphics processor is completed accessing it.

Okay, so let's go back and take some of these extensions now and apply them to this demo and see what performance improvement that does. So the first things to note are that the CPU is at about 100%, the performance of our demo is at about 742 megabytes a second, 42 frames a second. Just note those so that when we change things we can compare those back.

Okay, so let's open up my source code here. Now I'm going to scroll down to where I load my textures, down at the bottom. And like I said in the slides, there's two lines of code. There's line one and line two. I'm just going to add those two lines of code and see what behavior change that gives my application. So the text param i, I added that line, and the GL pixel store i. OK. So let's run it.

So now you can see I'm achieving my 60 frames a second. You can also see that my CPU burden has gone down substantially. So not only am I achieving frame rate, but I'm freeing up quite a bit of the CPU. I'm consuming less power for portables. I'm freeing up CPU for doing other types of processing. And I'm achieving higher performance. So let's go through and take our samples again, just to take a look at what Shark looks like now.

So, we're spending 37% of our time blocking. That's pretty interesting. It's interesting because this is actually on a timer, and the timer is firing at 60 frames a second. The rest of the time, my application is just blocked. So if I was to modify my application to fire at a much higher rate, the time that would be blocked would go down.

But this is actually, if you learn to recognize what Shark is telling you, this is actually a really important symbol to recognize the ML set interrupts enabled, simply because that tells me that somewhere I'm blocking, and then it becomes a process of analyzing where that blocking behavior is coming from. So now, in the OpenGL Profiler, let's go ahead and once again attach to this application... Look at my stats again just to see how I'm looking.

So, what I see is that I used to be spending 97% of my time in GL text sub-image 2D. That has dropped down to 0.04% of the application time. So I'm hardly spending any time there. And GL flush drawable has become the top item. Not a bad place to be. I've gone down from 97% of the application time being spent in OpenGL down to 30%.

So let's go back to slides. OK, so that's basically all I had to talk about. For more information, there are these sample code up on the WWDC 2005 website. And TechPubs has a number of documents up there that have been updated for Tiger for you to review that give good information about OpenGL.

So I encourage you to go up there and look at that website for any of the follow-on information you may be looking for after these sessions. I also encourage you to look at the OpenGL org website. It's the website maintained for the OpenGL organization. It has a number of links and interesting articles, documents, and developer events that are happening.

Events after this session. An important one is we're going to be in a lab, in the Tiger Lab, right after this session. So anybody that has a question about anything we've talked about or about your application, they can come there. Myself and a number of OpenGL engineers will be there to answer your questions.

And tomorrow at 9 o'clock, we're having also a drop-in lab where you can come by and talk to OpenGL engineers. So, who to contact? If you need to contact somebody at Apple, you can contact myself, [email protected], or Travis Brown. Travis is very helpful for going out and a good person to contact. So, we're going to go into questions and answers right now.