Optimizing OpenGL Applications for Maximum Performance - WWDC 2006

Graphics and Media • 1:03:02

Learn the latest techniques for optimizing OpenGL code in your application. This in-depth session covers all aspects of optimization including the best way to submit vertex and texture data, tips for eliminating unnecessary data copies, the ideal pixel formats to use, and advice on threading. You'll see how to use OpenGL Profiler and Driver Monitor to diagnose and solve real-world performance problems. This is a must-attend session for anyone who wants to harness the power of the GPU to increase application frame rate.

Speakers: Chris Niederauer, Dave Springer

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Okay, I think I'm going to start off this talk. So I'm Chris Niederauer. I am a senior 3D graphics software engineer in the OpenGL team at Apple, and I'm going to be talking today about maximizing your performance of your OpenGL applications on the Mac OS X platform. So, in introduction, what I'm going to be talking about in this talk, I'm going to be talking about basically moving your applications code, your OpenGL usage, into modern API. And as part of this, I'm going to be talking about how to basically optimize your vertex and your texture throughput. And then I'm also going to go over basically how to find bottleneck and the pipeline, the OpenGL pipeline.

So the main point I want to make today is that there's always a fast path, and I'm going to hopefully tell you how to get there. So part of getting there is knowing which tools to use, I mean, the performance tools that we have. And Apple provides a really great suite of some tools that can help you basically pinpoint where the problems may be in your application and maximize your performance on our platform.

So the first tool, OpenGL Profiler, is a tool that lets you basically look at all of the OpenGL state in your application. You can set breakpoints, look at state. You can also... Get statistics like the amount of time that's being spent in certain functions. It's a very useful tool. The second tool we're going to go over is the OpenGL Driver Monitor. And this gives you more specifics about the GPU such as the amount of time the CPU is waiting for the GPU. Also, it tells you how much free VRAM is available, things like that.

And then finally, there's Shark. And I'm not going to be going over Shark today. But there's a lot of great documentation on that on the web. And basically, that tool lets you pinpoint hot spots in your source code of where you can hopefully improve the performance of your application.

So some key concepts about OpenGL. OpenGL is a state machine. And what this means is that you've got a state that you set. You do multiple draw calls. And throughout those draw calls, your state will remain constant through those calls. And you have to change the state at some point. And when you're changing the state, there's actually an overhead to doing that. And so basically, I wanted to say today, try and minimize the amount of times that you're changing the state.

Because it is overhead, it has to be done, yes, but try and minimize it. And so you do this by grouping similar state draw calls together. And I also want to mention that if you're going to be changing a lot of state, there's also objects can encapsulate state. So you should be using objects like if you're changing a lot of vertex pointers, vertex color texture pointers and whatnot, you can use the vertex array objects extension and also frame buffer objects or another state, another object that encapsulates state.

One of the most important things about this talk, however, is the fact that OpenGL acts in a client-server model. And basically what this means is that you're going to be having to balance the CPU and the GPU usage of your applications. And one way of doing this is, since OpenGL is a pipeline, round-tripping can be expensive. So doing like gets of both data as well as state can cause stalls in this pipeline. So what we want to do is we want to maximize asynchronicity.

So here we see the most common way that we might think about OpenGL as a client-server model is between the CPU and the GPU. And basically, we're trying to always send data one way. In the OpenGL pipeline, it works most efficiently when you're trying to send data one way.

But OpenGL also is a client-server model between your application and the OpenGL API itself. Doing gets and so forth will cause potential stalls in this pipeline, especially important with the new multi-threaded OpenGL engine on Mac OS X Leopard and available on the Mac Pros. So what is a multi-threaded OpenGL engine?

It's basically a switch you can set that will automatically start running the OpenGL engine on a separate thread than your main thread. And what this allows you to do is if you're using asynchronous calls, maximizing the use of your pipeline, you can basically get very large performance increases simply by turning on this multi-threaded OpenGL engine.

So I was saying to take advantage of the OpenGL multi-threaded engine, the most important part to that is basically maintaining asynchronous behavior, trying to keep the client's server model in a straightforward pipeline. So obviously we've been saying past forever, we always have been telling developers, never call GL Finish. There's no reason whatsoever that your application should be calling GL Finish. And then as part of that, GL Flush as well is usually unnecessary.

So there's only select cases where you want to be using geoflush. And if you don't know that you're supposed to be using it, then you probably are not supposed to be using it at that point. Basically, whenever you call geoflush, it's going to cause a stall in the pipeline. GeoGet and other calls like that where you're trying to get backstate will also cause synchronization between OpenGL and your application.

And so what you want to try and do is shadow as much state as you can when you can. And if you do the shadowing, you don't necessarily have to do these gets back. So only use GeoGet in the cases where you can't shadow state, such as when you are looking up the extension string for the first time on your video cards that you may be using.

Also, to maximize asynchronous behavior, there are points at which you do need to synchronize using fences or occlusion query or things like that. So when you are going to be using something like fences or occlusion query testing, what you want to do is you want to try and defer, use the most time that you can to defer the actual set and the test. If you do this, you're able to, the more time that you utilize outside of this set, between the set and this test, the more time that, the less time that you'll likely stall the pipeline by doing so.

Let's see. Then there's also, again, I'm going to mention, you know, state setting in general takes overhead, so trying to avoid that. Again, group similar state draws to cause together. And if you do this, you're basically optimizing for the multi-threaded engine. And... Hopefully you'll get a big benefit out of that. So later I'm going to go over how to do dynamic vertex and texture data, and then I'm also going to go over frame buffer readback, how to do that in an efficient and asynchronous method.

So I was talking about call overhead. Every time you call into OpenGL, there's a slight amount of overhead that has to occur because simply you're calling a function at that point. But there's also a little bit more that's involved in the process where it has to look up the current context. So one way you can work around this is you can use macros. So by simply including CGL macro and Glue macro, if you're using Glue code, you can avoid this overhead.

And so I recommend that you use these. Also, make sure that you move, try and use the OpenGL calls that move a lot of data with the fewest number of calls. So for instance, you want to use drawRays or drawRangeElements with VBO as opposed to doing immediate mode calls where you have GL begin, GL vertex, GL vertex, GL vertex, GL end.

Obviously, it's a lot easier to just call one GL drawRays call. And secondly, as an example, the EXT. GPU program parameters extension, which is rather new, allows you to upload multiple parameters for a fragment program or vertex program in a single call. And the same API, equivalent API is available already for GLSO. And then, again, I can't stress this enough, try and group your similar state objects together.

So, vertex buffer objects. I wanted to go over vertex performance. And basically, for getting the most out of your vertex performance, you want to be using, instead of compiled vertex arrays, you want to be using vertex buffer objects. And what vertex buffer objects are, are a way to encapsulate, basically an efficient method for uploading the vertex data.

And so I have an example here of using vertex buffer objects. Hopefully, a lot of you are already using this, and so probably very familiar with this code. But we have in this code sample, we just have a, we bind to this vertex buffer object. We say how big it is with the buffer data call, and then we map it. And at this map, we can then put in some data in there, unmap it, and then draw with that.

So Apple Vertex Array Range is what we used to be saying that you should be using, and that is a good way to push vertex data up. However, it does cause synchronization points in the multi-threaded OpenGL engine, and so we're recommending now that you switch to vertex buffer objects.

And so one last thing about vertex buffer objects. I wanted to make sure that you guys know that if you know that you're going to be, you want to make sure that you provide the minimal hints that you can for these buffer objects when you're allocating them. So if you know you're going to be only reading a vertex buffer object, as opposed to reading and writing it, then do map it as read-only. And if you know, similarly, if you know that your vertex data is only going one way, you want to be using geostatic draw rather than draw.

So you can do geostatic copy and so forth with those parameters.

[Transcript missing]

So I wanted to turn it over to Dave Springer here, who's going to go and show some examples of using vertex buffer objects. All right. Thanks, Chris. I'm going to show a quick demo of using VBOs. And... What happened? Okay.

I, uh, no rabbits were harmed in this. I found these, and I saw these rabbits in the hay, and I won't tell you what I was doing, but these... This is not a shader. So what you're looking at here is every frame I'm updating all the colors at all the vertices. So this is immediate mode and you've got a GL begin, GL color, GL vertex, GL normal, texture coordinate. So that's what's happening here. No shaders involved. Immediate mode, I'm getting about 26 frames per second. I don't know if you guys can see that number up there.

Let me flip on CGL macros. And you'll see that I do get a slight increase because I'm not making those function calls. We're going through about, we figured, a million or so GL calls to render these four rabbits because there's the number of vertices and so on. So that's with CGL macros on.

Now, with VBOs, I jumped to about 110 frames per second. So huge, huge increase, like 5x. So that's the advantage of using the VBOs. Again, I've taken the static parts of the data, pushed it all onto the card using VBOs, and then I've taken the dynamic part of the data, the color, which I'm refreshing every frame, and that's another separate VBO. And so that's how this is working. And you can see the kind of performance improvement we're getting.

Also, there's a CPU monitor down here, and you can see the difference in what's going on in my balance between CPU and GPU. When I'm in immediate mode, you can see that the CPU is, one of them anyway, is pegged. So it's working really, really hard all the time, forming these commands, setting the buffers down, so on and so forth.

When I flip over to VBOs, my CPU usage drops almost to nothing. So it's showing me that I really am offloading. everything onto the GPU to render these rabbits. So that means my CPU is freed up to do a lot of other things. Like I could do a physics simulator and make the rabbits hop. I don't have that in this demo, but I could. Okay, that's it. Turn it back over to Chris.

Thanks, Dave. So now that we've gone over vertex performance and modern API being vertex buffer objects, I wanted to go over textures. So optimizing texture throughput. The four parts that I'm going to go over are I'm going to talk about using the fast, you want to make sure that you're using fast texture type and format pairs.

You want to minimize the number of texture copies that you have when you can by using like, for instance, using the Apple Texture Range and Apple Client Storage Extensions. And I'm going to go over a little bit over our pixel buffer object and API very similar to vertex buffer objects but for textures instead. And then after that, I'm going to go over asynchronous, basically speeding up your frame buffer readback by doing it asynchronously.

So the format and type pairs with textures are very important that you pick something that is capable by the hardware to try and get the most performance out of it. So I have here three examples of texture type format pairs that are supported by the hardware. And so they are basically BGRA with unsigned 8888 rev, BGRA with short 1555 rev.

And for YCBR textures, you need to use unsigned short 8888 either with or without the rev apple, depending on your usage of it. And I have a picture here of the shark application. And you can see here we basically did a time profile of an application that was not using one of these types.

And it just happened to be going down a type that the GPU was unable to deal with. And so what we see in this case is we see GLG process pixels. And occurring at this point. So we can get rid of this completely by simply using one of these texture type and format pairs.

So, second part about optimizing texture format, or textures, is texture performance, is by trying to minimize the number of texture copies that you are using. So, using the normal OpenGL engine, geotech image, and so forth, in the Mac OS X implementation of OpenGL, we actually keep around three copies of the texture in addition to potentially your fourth copy of the texture held by the application.

And so, by using some Apple-specific extensions, we can get rid of two of these copies very easily. And basically, you have it so that the application and the VRAM are the only copies of it. So, here, what we've done here, and this is basically showing what happens. There's some sample code on the developer webpage called Texture Range that has an example of doing this.

I'm going to also show a little bit more about this sample code later. And basically, what it does is it's using these three extensions, Apple Client Storage, Apple Texture Range, and EXT Texture Rectangle. And by using these three together, we're able to get rid of these extra textures. So, how does this work?

Well, the Apple Client Storage extension basically allows the application to tell OpenGL that it is going to keep around a copy of the textures. That OpenGL can look at at any point. So, this means that after you call your tech image, you have to keep your copy of the texture in your application's memory around and try and not modify it while the GPU is using it or while OpenGL is using it.

Let's see. So basically, that is different from the normal behavior in that normally when you call geotech image, you can free your texture immediately afterwards because of the fact that OpenGL is making for you. So that's what client storage does. And then we have, in addition to that, we have Apple Texture Range and EXD Texture Rectangle. And by using those two together, we then get rid of the OpenGL driver copy.

And here, in particular, we have Apple Texture Range specified with a storage cache tint. And by using this cache tint, this is telling it that we want to copy in VRAM. And we can actually get rid of this copy as well and have it so that the video card is texturing directly from our system memory copy if we were to replace that geostorage cache tint with a geostorage shared tint.

So now that we're texturing from client storage and using all those extensions, we are going to have the issue of now we need to maintain that we're not modifying the texture data while the OpenGL is using it. So this means that we're going to be potentially stalling on our data synchronization. So the way that we can get around this is by double buffering our data that's going through OpenGL.

And we will use the fences and object testing as needed in order to make sure that we're not modifying this texture data while it's in use. So here, again, this is showing stuff from the texture range sample code. And here we have the CPU is generating a texture and then drawing it and basically drawing it on the GPU and then modifying it for a second frame and then drawing it.

And then drawing it again to the GPU. And what we see here is that the CPU is unable to modify the copy of the texture at this point, circled in red, because the GPU is still using it at that point. And so we basically get this stall in the pipeline right here while as we wait, we call geofinish object, waiting on that texture object. We wait until the GPU is done and we have this huge stall here. And also we get the stall on the GPU as a result of that stall on the CPU.

So how do we get rid of this synchronization or at least make sure that it's as efficient as possible? We simply double buffer it. So here we see that we have the first texture modifying it on the CPU, drawing with it. Second texture is modified with it and drawn.

And at that point, we call finish object on the first texture. And the first texture is able to, at that point, we've minimized the gap in the pipeline at that point. By deferring the gap. By deferring that testing of the fence until a later point. And by double buffering, we've filled in the gap that was otherwise there.

So now I'm going to defer to Dave again. And he's going to show an example of texturing performance using texture range and pixel buffer objects. Okay. What I have here is a demo that really shows different texturing procedures. Texturing performance is really application specific. So, your texture performance is going to vary depending on how you're actually using them. There's no real one boilerplate method for saying if you do this, your textures will just always go faster. It really depends on your usage. Although, I can tell you that nearly always immediate mode will be the slowest.

So, here we are in immediate mode. I'm uploading a new texture. There's five frames of--this is Danny's stove. And we're uploading five frames--five different textures, one, two, three, four, five in a tight loop like that. So, this is immediate mode. We're getting about 43 frames per second. So, Chris talked about the texture range cached.

attributes. With that turned on, my frames per second jumps to about 300, and I'm also pushing about 1.3 gigabytes across the bus. Now, what's happening here is that I'm telling the GPU, here's your texture bits. Go ahead and render it. I'm not going to touch them. And it doesn't make those extra copies that Chris was talking about earlier. So this is pushing the texture straight into VRAM.

Now, if I turn cached to shared, I'm going to You'll notice my frames per second now jumps up again, and I'm almost at the theoretical limit of the throughput in megabytes per second of the bus because what's happening here is I'm telling the GPU there's going to be one copy of this. The GPU can access it directly, and so can the CPU. Now, what's interesting about the shared attribute is that if you're making a very small amount of accesses to the texture across the bus, you're going to get huge performance wins.

So I've jumped up to 1,200 frames per second. I'm at 5 gigabytes per second across the bus. But this is because I made the texture really small. So this is a 1K by 1K texture originally. Now I'm grabbing about every, oh, I don't know, 20th texel or something. So I'm not really making a lot of accesses across the bus. So that's where this method really shines.

The final method I want to show you is static PBOs. I happen to really like PBOs. I didn't really know how to do texture mapping very well before I wrote this demo. And let me tell you something, the PBOs are a snap. So if you guys haven't played with PBOs yet, I really recommend that you get in there and do that because they're easy to set up and use. You get great performance wins. And understanding where your texture data is relative to the CPU and the GPU and across the bus is easy.

So here we are, static PBO, the contract I'm making between the CPU and the GPU is here's your data, GPU, I'm not going to modify it. So I upload it all to the GPU and the GPU just spins. And again, I'm getting great performance, 500 frames. So over here in Driver Monitor, I just want to show this tool really fast because this is a way for you to look at what's happening with your texture data. So if you're noticing that you're getting texture performance problems, Driver Monitor is a great tool to help pinpoint where those are.

If I flip down to immediate mode, You can see there's a green line there, right here. And this is the amount of texture data that I'm slamming onto the GPU. So this, you can see it jump. It jumped right here when I went to immediate mode. Also, the CPU right now is not doing its... Idling waiting for the GPU.

So if I go now to static PBOs, Here, I've got this drop in texture page on data. So I'm not slamming a lot of texture data across the bus, which is better for this particular case. And again, guys, this is really application specific. So when you're going to do your texture optimizations, you need to be prepared to really look into it. There is no just boilerplate solution to making your textures go faster. But you always can make your textures go faster, depending on what your usage is. That's it. Turn it back over to Chris.

Thank you. Thanks. So now that we've just gone over how to upload your textures pretty fast, we still haven't gone over downloading, so reading back the pixel data from the video card. So those of you familiar with the old method of doing this, using async read pixels, will be happy to hear there's a much easier way to do it just by simply using pixel buffer objects.

And so using a vanilla geo read is not the right thing to do. This will cause a synchronization. It will cause a stall, basically, in that it stops everything until-- you can see in this line of source here that we're calling geo read pixels. And as soon as we enter this line of source code, we don't get back from OpenGL until the pixel data has been read back.

So like I was talking about earlier, you want to try and defer the amount of time between your settings. So if you're cutting your testing of your fences, and you want to do the same thing with your read back, and you can do this by using pixel buffer objects.

So here with pixel buffer objects, what I've done is I've bound to a pixel buffer object. I stated the size of the pixel buffer object and told it that I wanted a static read to say that I was going to read this data back. And then in the read pixels, I pass it a offset into that pixel buffer object instead of a pointer, like we were doing before. And we notice we don't synchronize at this point with geo read pixels.

But instead, we get time to do other work on the CPU, such as physics or other types of things that are unrelated to this particular read back. And then when we're done with all our work, we can go back, ask when we need this data back, we ask OpenGL to map that pixel buffer object as read only. And then we can get the data out and map it. And that's it. It's pretty straightforward. And it was a lot more complex. asynchronous pixel readback before.

Yeah, so now that I've gone over texture and vertex throughput, I want to talk about finding the bottleneck in your application's performance. So as I was saying earlier, OpenGL is a pipeline. And it's made up of certain parts. And if you stall any one of those parts of the pipeline, you will cause your application to slow down as a result of that. So I'm going to go over, trying to identify if that bottleneck may be on the CPU, if it's due to the bus bandwidth, also if it's due to the GPU vertex processing, and finally, if it's due to the fill rate of the GPU.

So I'm going to go over how to identify all these things. And basically, when you are trying to identify these things, there's two parts. There's two ways of doing this. Either you can try and change the workload of that stage to minimize the amount of work that it's doing, lower the amount. And if you see an increase in speed, then you know that. That's likely the bottleneck. Conversely, if you know that it does not change the frame rate by lowering the amount of work being done on a stage, you can rule out that particular stage.

So first, CPU bound. This tends to be pretty common for a lot of the applications that are being written these days. So here, the way to identify if you've got a CPU bound application is pretty straightforward. You can simply look at the CPU monitor, for instance, in Top or in Activity Monitor. And you can basically see if you're using 100% of the CPU at that point.

And another way you could do it as well is by if you take out some sort of processing step that's unnecessary for the OpenGL particulars of the application, such as taking out, say, the sound. If you see that your frame rate goes up as a result of that, then you're likely CPU bound. So what's the trick to fixing this?

Well, other than using Shark and trying to optimize as much of your application, and then also trying to follow the steps I've been going over to try and optimize your throughput with OpenGL, what you want to try and do now is probably paralyze your work as much as you can. So if you have things like the sound engine or physics engine, if you can offload those onto other threads, then do that. Alternatively, you can try and offload OpenGL onto another thread.

But do keep in mind that when you offload OpenGL onto another thread, OpenGL itself is not a thread-safe API. So what this means, basically, you can only access one context, from one-- you can only access a context from one thread at a time. Doing otherwise would be a violation of this and is not acceptable in the OpenGL-- in OpenGL implementation.

So the way that you can work around this if you do need to use it for multiple threads, but I don't recommend this because it's very difficult, is you can use a CGL lock and unlock context API. But here we have an example of an application. We see two CPUs here, and in green we have the application, the time being spent in application code, and in blue we have OpenGL being squished on that first CPU, and we see that it's not, you know, we've got an entire CPU there that's not being utilized.

So what we simply do is have our application try and offload some of that work on another thread. Here we have the main CPU doing some work that doesn't have anything to do with OpenGL, and then it's synchronizing, you see the green arrow pointing down, it's synchronizing with another thread that's doing the OpenGL work, and then we call OpenGL from that second thread, and OpenGL gets a lot more time as well as the application gets a lot more time. We're just taking much more advantage of the dual processors that we've been shipping in our systems for, almost six years, about six years.

So, alternatively, new to Leopard, there's the multi-threaded engine. So, the multi-threaded engine, you know, as long as you can apply those techniques I've been talking about, try and maximize your asynchronicity of the application and the client-server model, can hopefully get you some big wins. But do keep in mind that this will add more work.

You should only be doing this on dual processor systems. It would probably be slower on a single processor system because of the fact that it's doing more work. But the reason this is so good is that it's doing the work for the synchronization for you, and so you don't have to do any of this stuff. All you have to do is call a CGL set parameter with a KCGL MP Engine. And that turns it on. And as I was saying, this only works... Well, this works really well with some well-behaved applications.

So here we saw the application was using only one CPU before. And simply by flipping a switch, we see now that OpenGL adds a little bit of work on the first thread. It moves the work that OpenGL was doing on the first CPU is now being done on the second CPU. And we have a small little portion on the first CPU that the application is interfacing with that then synchronizes with that second thread automatically through OpenGL. And all the application did was call this one CGL set parameter and turned on the multi-threaded engine.

So I wanted to give Dave a look at-- he's going to show you how to basically use the OpenGL Profiler to see how your application performs with the multi-threaded engine. All right, thanks, Chris. New for Leopard and actually in your seed, the DVD that you guys got at the show here, the version of Profiler has a new feature which allows you to turn multi-threading on and off right from Profiler. So you don't have to do anything to your app. In fact, you can take somebody else's app and see what happens. So I'm going to do that right now.

Okay, so this is Doom 3. I'm going to run it first with multithreading off. So this is just out-of-the-box Doom 3. And what I'm going to do is run the time demo because I really suck at Doom 3. You don't want to watch me play it up here. That would be brutal.

What's happening here in this demo is Doom renders about 2,000, 2,100 frames and then times it. It has its own frame counter. Profiler also has a frame counter which is counting frames too. But in this case, we're just going to use Doom's built-in frame counter to see what kind of frame rate it thinks it's getting. And, uh... This again is on the single-threaded engine.

OK, so 62 frames per second. Is that legible to you guys up there? While this was running, you noticed the CPU monitor as well was pegged. So now what I'm going to do is flip over to Profiler. And to use this new feature, what you have to do is-- as with many things in Profiler, you have to be stopped at a breakpoint in order to change any kind of state in the app, which is only fair to the app.

I mean, really, you don't want Profiler in there mucking around with your app while you're live changing states. So it's better to stop at first, then change the state. So I'm going to stop the app here at CGL flush drawable, go down here to multi-threaded control, say Force On. Then I remove the breakpoint, say Continue, and that's it. Doom is now multi-threaded.

So let's see what happens. I'm going to run that time demo again. I think you can already see that it looks like it's going faster. The kind of speed-ups that we have been seeing in apps where all we did was do this kind of a trick with Profiler, just go turn the multi-threading on behind the scenes on it, we're getting anywhere from 20 to 40%. And that's just a freebie. So, you know, why not? Chris has talked about some different times when you don't want multi-threading, but here we're getting, boy, that's hard to read, 88.5 frames per second, up from about 60. So about a 40% increase just by flipping a switch. So that's it. Thank you.

Thanks, Dave. So as you saw, Doom 3, you know, we just turned the switch on. We didn't work with the developer at all, and, you know, we're already seeing a 20 to 40% increase. If they were to follow some of the synchronization, asynchronicity stuff that I've been talking about, potentially we can get even more performance out of it.

At least that's the hope. I don't know, but we did the same thing with World of Warcraft, and we had a 90% increase after working with the developer. So it just shows that you can get some big wins if you are CPU-bound like Doom 3 and World of Warcraft are just by simply turning on the switch.

So next part of the pipeline I want to talk about is when you get bound by the bus bandwidth. So the way that you can identify if your bus bandwidth is the limitation of your pipeline, you can basically try and use smaller data sizes, such as if you have large textures that you're trying to send up to the GPU, you can try sending up smaller textures that are simply like one-by-one textures as opposed to 1,000 by 1,000 textures and see if that increases your frame rate. If that does increase your frame rate, then you know you're bus bandwidth bound. So ways to address this, use smaller data sets in general, smaller textures, smaller vertex data, that type of stuff. Use shaders to generate your texture and your vertex data when you can.

reduce your texture and vertex uploads with data caching and VRAM. So you can use the pixel buffer object and vertex buffer object extensions to basically say that you want to have a static vertex buffer object and so forth. Or with the texture range, you tell it that you want it to be cached and that will try and keep it in the GPU as opposed to having a streaming or dynamics texture or vertex data.

That would obviously be going over the bus every single time. And then conversely, I wanted to mention if you do see VRAM paging, say running out of video memory, the way that you want to address this is by trying to reduce the amount of data that's being cached in VRAM.

And you do this by using the stream or the shared texture types instead of the static and the cached ones. I overheard this in the hallways. The graphics bus is not a big truck. It's a series of tubes. Keep that in mind. You know, you can't just take everything at once. It is limited.

So now that we're past the bus bandwidth, let's take a look at the GPU vertex transformation. And so how do you address this, or how do you identify this, excuse me? You can try using some trivial vertex shaders and see if that increases your speed. Alternatively, you can try and submit less geometry. But do remember that if you're submitting less geometry, you're potentially affecting other stages of the pipeline such as your fill rate if you're only drawing part of the cube.

You're not drawing the fill rate as well at that point. So that's a little bit harder to figure out, but that's a couple ways to identify it. And ways to address it, you can reduce the number of vertices. And so you could, for instance, you could use normal and bump mapping. Or if you were at Nick's talk earlier with GLSL, you saw how ATI was using parallax occlusion bump mapping or ray tracing or something. I don't know.

And, you know, with a single quad, they were able to have these tremendously high-resolution-looking... that looked like they were millions of vertices, but it was really only a few. Another technique, you know, very common to Quake, Quake 3, those types of games, they use multipass rendering so that they only submit the vertex data once.

And they use multiple textures and do the multiple passes with only one transformation of the vertices. You can also try and reduce the processing by your vertex shader processing. And finally, by caching your vertices in VRAM, you get faster access to those vertices. So this will have a little bit less stall on that vertex processing.

And then finally, I want to go over the fill rate, basically the drawing of the pixels themselves. And so one way to identify this, the main way, the easiest way to identify this that I've found is simply by trying to reduce the frame size. And so if you reduce frame size, you see the frame rate go up, you know that your fill rate pounds. So ways that you can address this, first you want to try and reduce the depth complexity, both from an algorithmic standpoint as well as, well, for instance, not using GeoClear unnecessarily.

For instance, if you are drawing over the entire scene, there's no reason to call GeoClear with the color buffer every frame since you're already overriding it every frame anyways. Another way is to call the back-facing polygons. You want to remove pixels as early as possible from the pipeline.

So you can use the GeoAlphaThunk, GeoDepthTest. Using these sort of tests, you can trivially reject fragments before they've been actually doing fragment processing or so forth that may occur on those fragments. Alpha testing is really useful, by the way, for if you have a texture that has a lot -- say you have a texture of a star where you have only stuff in the center of the texture, but then the outside is zero alpha and not actually being drawn. By using alpha test, you can trivially reject the outsides of those textiles and speeds up the processing tremendously, the fill rate at least.

Let's see. You can reduce fragment processing, shader processing, obviously, if you have a complex shader. Try and do something a little bit more simple. Again, reducing the frame size. And this is especially important if you have off-screen buffers. Don't forget about those. If you have an off-screen buffer that's being rendered at a huge resolution, and then you're only seeing it on, you know, far away distance at, you know, say you were to be using frame buffer objects and rendering a tree, you wouldn't want to be rendering that at 1024 by 1024 if they're only going to have those billboards in the background at, you know, 20 by 20. Try and use a smaller texture for that. and David Koehn. And finally, you want to use simple texture filtering. As simple as you can get. Like, for instance, don't use tri-linear and isotropic filtering.

It's not free. If it was free, it would probably be a lot easier to pronounce. So just gone over all the bottlenecks that are very common to run into. And there's a few more things I want to go over before I end my talk. So one of the points I want to go over is software fallback. Software fallback, it's not something you want to be hitting. It is a sink point for one thing. And so what software fallback is is when you're using, say, a shader that is too complex for a video card to support.

And you may actually fall back to the software render at that point. And then the software is doing that rasterization for you because the GPU was unable to do that, fulfill that program at that point. So you want to try and avoid this by basically testing it on all your supported graphics cards that you have.

And the way that you get around this is you simply reduce your shader complexity if you are falling back to software. So for instance, the noise function is not supported by any hardware at this point, this shipping. And even though there's a noise function specific to GLSL, so whenever you call that noise function, it has to unfortunately fall back to software in order to support that.

So instead of using that noise function call, you can simply try and use a texture such as we use in the GLSL showpiece example. And by using noise textures instead of -- like a texture to simulate the noise, instead of just calling noise from the shader itself, we can reduce that complexity of that shader to the point that it's hopefully going to be accelerated in the hardware at that point.

So then one other thing I wanted to mention. If you see GeoRenderer float as a library name in Shark, this means you're hitting the software renderer at some point. Also, one thing new to Leopard, the Leopard Profiler version 4.0, is that it's going to allow you to, similar to how you can set a breakpoint on if a GeoGet error is set, you can set a breakpoint on if you ever fall back to software.

I also wanted to mention a little bit about framebuffer objects. Hopefully you saw some of the really cool demos that you could do this morning at John Rososki's talk with Alex Eddy. He wrote some really cool applications using framebuffer objects. Basically what they are is some simple, intuitive, fast render-to-texture methods for OpenGL. It's similar to pbuffers, not to be confused with pixel buffer objects.

Pbuffers is an older way of creating a context that has a drawable, and so you can basically render into a texture using pbuffers. But framebuffer objects are an actual new GL state piece that is cross-platform and very simple to use that will do the same thing without the complexities of, for instance, having multiple contexts because it's simply a state object. So because it is a state object, it's not another context. It also avoids requiring any flushing that pbuffers did require for synchronizing. And of course, we want to try and make our applications as asynchronous as possible.

And then one really cool part about framebuffer objects as well is that they are able to be used in sort of a trifecta. Framebuffer objects with vertex buffer objects and pixel buffer objects. You can basically render directly to a pixel buffer object, use that pixel buffer object as a vertex buffer object, and you can do this all on the GPU seamlessly without having to round-trip any of that data.

But one thing to keep in mind if you are going to use framebuffer objects, which are much preferred over pbuffers, but the one thing that's different is although pbuffers work on everything, framebuffer objects currently are only supported on hardware that's capable of running ARB fragment programs. So that means like the ATI 9600 and above, the GeForce FX 5200 and above, and of course the Intel Embedded graphics. So, those all support fanbuffer objects.

Finally, I wanted to make sure that you guys try and optimize for the user's hardware. If a user buys a high-end graphics card, you've got a very capable GPU in addition to the capable CPU, but you might as well try and take advantage of that GPU resource for things that might not even be related to OpenGL.

So if you aren't using much of the GPU in your application, try and offload some of that work. There's some work being done in that field, like just a lot of research in terms of trying to use these GPUs for things other than graphics, so check that out if you haven't looked into that yet.

I wanted to talk about integrated graphics. So the Intel integrated graphics, it's a very capable hardware. Core image is capable. It has fragment programming. It runs all the really cool apps. It runs like front row and the little special effects when you drop things into your dashboard. It runs all those things. But do keep in mind that it does not have hardware TCL. And so whenever you're doing complex, if you're doing a lot of vertex data or complex vertex shaders, it will be going through the CPU on those types.

So keep that in mind when trying to create the content that you're going to be delivering on those systems. And then finally, I just want to mention it does not have floating point texture support. Also, next part, I wanted to talk about, you know, now that we're shipping the Mac Pros, we're currently able-- we're shipping actually built order now with multiple video cards in a single machine.

You could, in theory, have both ATI and NVIDIA cards on your system, and you want to make sure that your application is able to switch between these two GPUs, say, if the user were to drag a window from one video card to the other video card. And so when you do do this checking to make sure you're supporting both the video cards, do remember only try and check those capabilities once per render. And I finally want to mention, you know, don't turn off features just by checking one feature. Just don't turn off features just by checking, like, one extension or things like that. If you're trying to use one extension, check for that one in particular.

Okay, so before you guys go, I've got a huge grab bag. There's quite a few hints I just wanted to throw out there. And so I'll just take my go at it. Let's see. So vertex data. For optimizing the amount of vertex data, you want to try to minimize the state changes that you make between those calls to the vertex data. You want to do things like try and maximize the vertices that you do per draw call.

Let's see. Display lists. Display lists are still useful even though I've been talking about vertex buffer objects. Display lists are still pretty good for when you have objects that are static. And it has an internal optimization scheme that it does that will optimize whatever you pass it. Well, not whatever you pass it, but if you pass it some good stuff, it'll be able to optimize it. So for instance, with immediate mode, it can optimize your immediate mode if you pass it consistent vertex data.

So I have an example here of some code that actually would not be optimized by the display lists optimizer. And as you can see here, it's calling geo begin followed by vertex, vertex, color, vertex, and then ending. If you were to try and make a similar draw call, there would not be a similar way to convert that into a geo draw call. What I mean by that, this is a code that's going to be set to one of those colors.

So instead of doing this, calling this color in the middle of it, we add a color at the very beginning of it as well to specify that we know what the color is going to be as we enter that immediate mode drawing. And that makes it so that that second color call is acceptable. is acceptable at that point.

Another hint, vertex data. You can try and use degenerate triangles to connect your strips, your triangle and quad strips. This helps you, as I was saying earlier, you can maximize the vertices per draw command, and this helps with that. Look it up if you aren't familiar with that. It's a pretty neat trick.

You can provide, the most efficient method I've seen of providing color data is actually as bytes, as unsigned bytes. So when you're able to do that. And finally, for your vertex data, try and align it to the 16 byte, because by doing so, SSE and AltaVec can take advantage of, you can do some faster processing of that data if the CPU needs to work on it.

So, some grab bag hints for the tools. You can find out where you might be blocking in your application by using Sharks, Time Profile, All Thread States, and also System Trace. They work pretty well for helping you find those blocks. And also, my favorite way of doing it is by using OpenGL Profiler's Trace, and you can actually see timings on a per-call basis.

So, if you see, for instance, that your Geo Map Buffer is taking a very long time at some point, you know that you might, and if you were not using the Apple Flush Buffer Range extension, and you see the Map Buffer's taking a long time, you know that the GPU must still be using that data at the point where you're trying to map it. So, you might try and push that Map Buffer. down until a later point.

Another bit of information. The driver symbols that you get in Shark can often be misleading. Usually you'll see GLD page-off buffered, GLD get string. Symbols like this, those are usually false alarms. And if you see a symbol in the driver, keep in mind that it might not be in doing the work you have because obviously we can't have the symbols available of what's in the drivers.

Also, some things to look out for when you're in Shark. If you see GLG process pixels, again, this means that you're swizzling textual data to something that the GPU can handle. If you see georenderer float as a library, this means that you're falling back to the software render. And finally, with the multithreaded engine, if you ever see GLI finish command buffer, this means that somewhere in your code, you are synchronizing with the multithreaded OpenGL engine. So hopefully you can optimize these parts out or minimize the parts where you have to do this.

Some hints for textures. You want to try and reduce the VRAM to avoid any texture paging, obviously. But some ways to do this are by, you know, you always want to try and use minimal texture formats. If you're only using, say, 8-bit per pixel textures as your source data, there's no reason to use a float data as the type unless you're doing some computations after the fact that require that.

extra precision. Also, you can compress your textures at the slight cost that it will take a little bit longer for the GPU to render it. Decreases the fill rate very slightly, but it will increase the amount that you can actually upload to the video card by a large significant amount. And then finally, you can always use the shared and streaming textures types with pixel buffer objects and the texture range stuff to have large textures simply being drawn straight out of your system copy.

Finally, for textures, if you ever have large textures that need to be uploaded over time, during a real-time application, you can space out those uploads by only uploading sub-images of it using GeoTech sub-images. Otherwise, you may actually see stalls where you have too much data going up to the GPU in between your frames, and you may see hiccups as a result of that. So with shaders, make sure you only compile your shaders once. And then try and use your attributes when you're changing the data per vertex, but use uniforms otherwise.

And then finally some hints on compile time. So again, use CGL macros. Dave showed you it was getting like 10% or 20%-- or excuse me, like few percent increase in frame rate in an application that had a very significant amount of OpenGL calls. And so if you can use that, great.

And then finally, for your release builds, again, never try and call a geo get error in your release builds. You can use OpenGL Profiler for that. You can use it-- same thing with seeing if you're falling back to software. You can use OpenGL Profiler for that if need be.

Try and minimize on your synchronizing calls. Also, if you're going to have-- make sure that your VBL is turned on when you release it. But you may want to have VBL turned off for development purposes simply for benchmarking. But do remember that the reason that people tend to like Mac OS X a lot better than some other platforms is we tend to make things look nice in VBL sync, not tear.

And then also, if you're using the OpenGL Profiler APIs, which are basically there's a header in the OpenGL framework that has some ways that your application can interact with OpenGL. If you are using those, try not to use those in the release builds as every little bit counts. And then finally, just wanted to mention that if you're profiling your application, if you're in O0, you may get different results. And if you're, say, in your optimized OS or O2 or whatever, you may actually have some release builds. So sometimes you don't want to profile that way.

So, in conclusion, I just hope that you guys can move all your applications to be using some of the more modern techniques in OpenGL that we make available on Mac OS X, such as the vertex buffer objects, pixel buffer objects, and texture range, and also the frame buffer objects. And I want to make sure that you guys all use the tools because they are very, very good at what they do. And if you get good at them, you can make your applications scream.

So... More information, Alan is our 2D and 3D graphics evangelist. Also, we have Michelle Castajon, he's our OpenGL DATS engineer. We have a lab right after this, an OpenGL lab that will be in the Graphics and Media Lab. And also, a OpenGL Quartz Composer drop-in lab will be tomorrow morning in the Graphics and Media Lab at 9:00 AM. So stop on by if you have any questions.