Graphics • 1:00:11
This session focuses on techniques to optimize your application's use of OpenGL and 3D graphics hardware. A variety of optimization strategies, such as display list and vertex array optimizations, are discussed. A great session for all OpenGL developers covering the latest in high-performance graphics.
Speaker: John Stauffer
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Okay, so my name's John Stauffer. I manage the OpenGL Engineering group. So let's get right into it. So what we're going to talk about today is a brief overview of what's new in OpenGL. This will be what we have done differently or optimized since WWC last year. Got to give you a yearly update as we go through this session.
We're going to go into some basic tips. We try to keep this short. We want to get into the more advanced optimization techniques for OpenGL. Then we're going to get into some detail on optimizing OpenGL texture uploads. And then a new optimization technique for asynchronously reading pixels back off the graphics card, which is very important for people who are trying to get pixel data back off. Vertex data throughput, how to optimize your vertex data uploads, making sure that you're getting optimal performance, uploading your vertex data up to the GPU. One-shot images. Sometimes you have images that are small.
Or aren't going to be reused. And you just want to draw them once up to the screen. So we'll talk briefly about how to optimize that. Pixel copy operations, how to optimally copy pixels around. There's certain ways that you can optimize that, making sure that they are VRAM to VRAM copies. And using threads. There's a lot of people out there trying to use threads. We see a lot of problems with that. Want to briefly cover that and make sure that people understand the limitations and how to make that work optimally.
So briefly, optimization strategies. There's two basic ways, two basic things that people are trying to do. They're either trying to maximize performance of their application, or they may be trying to minimize the CPU burden. And depending on what your application demands are, you'll want to focus on one of those two types of strategies.
And as you'll see in my presentation, that effectively using the CPU can lead to greater optimizations than just simply just offloading all of your applications. So there's two different types of techniques. Don't be sure that you understand that the CPU is still a very effective processing device for getting performance. performance.
[Transcript missing]
Okay, so what's new? Well, we've spent a year optimizing OpenGL. Some of the highlights are that we've been focusing quite a bit on immediate mode performance. There's a number of applications out there that have been ported that use immediate mode performance, immediate mode drawing modes that make it a key optimization, and we've been spending a lot of time on that to try to help those types of applications coming over to the platform.
Pixel transfer paths, and what I mean by that is any kind of copying of pixel data, RGBA, RGB data, what have you. We've been spending a lot of time optimizing those paths and continuing to improve that. So if your application is sending a lot of pixel data, we're working on those paths.
Vertex program emulation for any kind of application that needs to run across all of our CPUs and is relying on the vertex program feature. We are continuing to improve the emulation of vertex program on the CPU. So you can run a vertex program on all of our platforms and get the best performance possible.
Asynchronous texture downloads, as I mentioned before. That is also a new feature since the last time we talked, something that's very important. So as you can see, we have a list of extensions that we've added since last year, quite a few. We're continuing to add features regularly, and as fast as they get approved and made a part of the OpenGL standard, we will fold those into OpenGL.
Okay, so some basic tips. Things that you don't want to do or avoid is you want to avoid geoflushes. Geoflushes, what those do is they truncate the command stream going to the processor and flush that command up to the GPU. Now, the reason you want to avoid flushes, one, is that it's a command, it's a kernel trap, so you want to avoid that kernel trap. Secondly, you only have a limited number of command buffers.
So if you keep issuing geoflushes back to back, you will run out of that resource, and that will be a synchronization point where we will have to wait for the graphics processor to finish processing those command buffers in flight before we can get a free command buffer to start working with again. So we tell people just to avoid geoflushes. Now, there are points, there are times when you'll want to use those, and we'll go into that a little bit later.
I tell people never to use geofinish. Geofinish is a truly asynchronous call. What geofinish does is it submits the command stream to the GPU. So if you keep issuing geoflushes back to back, you will run out of that resource, and that will be a synchronization point where we will have to wait for the graphics processor to finish processing those command buffers in flight before we can get a free command buffer to start working with again. So we tell people never to use geoflushes. Now, there are points, there are times when we'll want to use geofinish. It submits the current commands, stops, waits for the graphics processor to be done with those commands before it will return.
So it is truly a serialized synchronization point that will cause the CPU and GPU to stall against each other. So I tell people just don't call that at all. Avoid GLREPIXELS if you can. You would want to use some of our more modern ways of doing it. One of the techniques for replacing GLREPIXELS is to use COPYPIXELS.
And COPYPIXELS is useful for getting VRAM to VRAM copies. So for instance, if you wanted to save off some pixel data, some depth data, some stencil data, instead of reading it across the bus, saving it with the CPU, and then uploading it back, what you want to do is you want to use COPYPIXELS to store it on the VRAM somewhere else. Don't read it across the bus.
Save it somewhere in another buffer up in VRAM, and that way you can get the high bandwidth of copying it and restoring it when you need it. So texture downloads is also a good way to replace GLREPIXELS, get an asynchronous readback of your data, and not have a stall of waiting for the read pixels to finish.
So again, immediate mode performance, we've been optimizing it, but it is still one of the slower paths. So we tell people when possible, avoid immediate mode drawing. Instead, use some of our more advanced extensions. Now, there's one exception to this, and that is in display lists. If you use immediate mode drawing in display lists, we will take that immediate mode data, we will convert it into a more optimal form, and prepare the data and then upload it into VRAM, cache it in VRAM for you.
So display list is an acceptable place to use immediate mode, and it turns out that's fairly convenient for a lot of people who already are using immediate mode, but they realize their data is static. If your data is static, you wrap a new list, end list around it.
We will post-process the data and stick it in video memory, and then you'll get the benefit of that optimization. So minimize state changes. Most people that have been working with Omnigil know this one. State changes are expensive. They do cause a revalidation of the hardware state, which can be slow if you do it a lot. So avoid redundant state changes, and do your drawing in groups of state. So what you want to do is you want to coalesce your drawing under a given state setting, which allows you to minimize your state changes.
Okay, so let's get into more detail. Texture uploads. So what we're going to talk about is we're going to talk about the texture pipeline overview, give people a brief description of what the pipeline looks like. We're going to talk about some of the optimization basics, and then we're going to get into some of the extensions. The extensions can be different depending on whether you're talking about a Power 2 or non-Power 2 texture.
So we'll differentiate a little bit between those two types of textures. For people who aren't familiar with that, there are Power 2 textures, which is more standard OpenGL, and recently, over the last few years, there's been the non-Power 2, which allows you to have a texture of any size, which is very useful for general image data, video, pictures, what have you, will use non-Power 2.
So here's a basic diagram of the OpenGL pipeline. The part that we're going to focus in on for this section of the talk, we're going to focus in on the Pixel pipeline. And looking at just a block diagram of what the pipeline looks like, standard OpenGL on Mac OS X, you can end up with, at any time, if you're passing a texture through the system, you can end up with four copies of the data. So, obviously, that's a lot, right? You want to avoid those. So, what we're going to talk about is we're going to talk about how to eliminate each one of those copies and get you performance increases, obviously, as you do that.
But in the default setting, you can end up with four copies of your texture as it passes through the system. One copy is going to be the copy that you have. One is going to be what the framework has. One is going to be a copy that the driver keeps. And then one is going to be in video memory. So, let's get into some of the ways to optimize that.
So, again, minimizing CPU copies is the key here. We don't want to give the CPU redundant things. We want to optimize its time. So, correct setup will minimize the CPU copies. And what we mean by that is that you're going to use the right texture formats, the right pixel formats, which will ensure optimal paths. It will also ensure that the graphics processor accepts that data type. So, you know, OpenGL supports a very large number of pixel types. And the graphics processors also accept quite a few pixel types.
But if you stay, if possible, you want to stay in the middle, you want to stay on the confined set such that you are guaranteed that the particular graphics processor you're on has native support for that type. And it won't have to go through some kind of conversion to a type that is compatible for that graphics processor. So, here I've got listed three types. BGRA and the 8888 reversed and the 1555 reversed.
Now, those are the native Macintosh formats. So, when you set your monitor to 32-bit pixel mode or millions of pixels or thousands of pixels, those are the pixel types that the screen is running in. And that will give you a compatible type. It also turns out that the graphics processors understand that type natively.
Also, you'll see a YUV type there for people who are doing video or want, have a YUV source. They can use a YUV texture and that will be accepted as well. So, when I usually put these up, some people ask, "Well, what about RGBA?" Which is the standard OpenGL type.
RGBA isn't natively accepted by all graphics processors. Sometimes it will have to go through a copy and get swizzled into a different format. So, usually it's fairly optimal copy and sometimes it might be natively supported. But, in general, you have to be a little careful of that type.
So let's talk about the extensions. So client storage is an Apple extension. That extension is a way to eliminate the framework's copy of a texture. What it does is that instead of having the framework make a copy of the texture, the framework instead keeps a pointer of that texture into your memory. So if the application has retained a copy of the texture, you can just tell us, use my copy, don't make a copy for yourself.
And that will eliminate one CPU copy, that will eliminate the memory associated with keeping that copy. Apple Texture Range is another Apple extension. This extension eliminates the driver's copy. And there's two different ways to drive this. It's cached and shared. What those mean is cached means keep a copy of the driver. It's telling the driver to keep a copy of the texture in video memory. Shared means simply point to the copy in system memory when you're doing your drawing. And I'll get a little more detail on this in a little bit.
But keep those concepts in mind. They are important. So now, EXT Texture Rectangle is an extension required by some hardware to allow texture range to work properly. And the reason for that is that some hardware requires the power of two textures to be formatted in a certain format.
So if you don't use that extension, you're not necessarily going to be guaranteed on all graphics processors to eliminate the graphics driver's copy of the texture. So keep in mind that Texture Rectangle is a very important extension. And Texture Rectangles tend to be more widely supported for eliminating a driver copy when using texture range.
OK. So let's go back to the block diagram. So what we see is that Client Storage, using that, we eliminate the framework copy, as I said. And looking at a little bit of source, it's fairly simple. All you do is you enable the Client Storage option when you are building your texture. So before you load it, just call PixelStoreI, enabling the Client Storage option.
And that will eliminate the framework copy. Do remember that when you do that, that you are now responsible for the memory. So if you go and delete your copy of the texture, the framework is pointing to that. And if you try to do something that requires us to access that memory, you'll crash.
Now, looking at the Apple Texture Range and Apple Texture Rectangle extensions, as I said before, this eliminates the driver's copy. And I'm showing a block diagram here of it running in cached mode. So what happens is, using these two extensions, the driver will be pointing directly to the framework's copy and DMAing it directly from the framework's copy up into video memory and keeping the copy in video memory. That's running in cached mode. Okay, and again, here's a little code snippet showing how to use Texture Range. Very simple, one call. Text param i, texture rectangle extension, target type, storage hint apple, and then cached apple for the hint type.
Now, using all these together, what we get is we get that the graphics processor now is going to be pointing the GPU directly to the application's copy of the memory, and it's going to be DMing it directly into video memory. So what you get is you've eliminated the CPU actually making a copy.
We point directly to your copy of the texture, DM it directly. So the CPU never actually makes a pixel-to-pixel copy. The graphics processor is DMing it directly into video memory. And looking at those code snippets together, it simply looks like this. It adds basically two calls to this. You'll see that I'm using a texture rectangle type.
I've inserted the two previous code snippets between the bind and the text image 2D call so that I'm getting the direct DMA transfer as I just showed. Now, switching gears a little bit and looking at the shared option. Now, the shared option, as I said before, makes it such that we are not going to cache a copy of the GPU.
Instead, what we're going to do is we are going to set it up such that the graphics processor is going to look up the textiles at rasterization time directly from system memory. So while it's drawing the polygon, each time it goes to have to fetch a textile, it'll go across the AGP bus, look it up out of system memory. So that eliminates the copy that's in video memory.
And there are some uses for that, and I will show a demo of that shortly. So here's what the code looks like. It's the same thing as the cached. Text param i, except for you would pass, instead of the storage cached apple, you would pass the storage shared apple, hint, type, and that will mean don't cache a copy in video memory. Okay.
So, looking at the block diagram again. So, this is what it looks like when you run in shared mode. There's no copy in video memory. As you're rasterizing, it is looking the textiles up directly from the application's memory, and you end up with one copy, your copy in the application. There's times when this is useful, times when it's better to use cached.
Okay, so same code snippet. All I had to do to get the shared option in there, again, is change the cache to shared. I'm going kind of fast. So what I should point out is that this example is actually available on the website. You'll see at the bottom that there is the sample name. You can look this up, and it has this code in there, so you don't need to be copying this down.
Okay, let's talk a little bit about cache and shared, when one is appropriate versus the other. So cache mode is better for textures when you're going to use it multiple times. You don't want to be reading a texture across the AGP bus multiple times. So if you're going to use it a lot, you're going to want to cache it in video memory and then use it from there multiple times and not require it to go across the bus. So it's also best when you're using linear filtering. Linear filtering is a little bit higher bandwidth usage because it's having to look up neighboring textiles to do the linear filtering.
And now shared is -- talking about shared for a second. So shared is better for one-shot images that are very large. And the reason that I say very large is that if you have low video memory cases and if you wanted to upload a large texture, what you don't want to have happen is that texture to, say, evict everything else out of video memory. So if you're running in a low video memory case, you're going to want to have a texture that's going to be able to run in a low video memory case.
So if you're running in a low video memory case, you're going to want to have a texture that's going to be able to run in a low video memory case. you're running in a low video memory case, it is possibly a benefit to run in shared mode where you're not going to consume any video memory. You can leave what's resonant in video memory there and just look up your image straight, DM8, as you're rasterizing as opposed to DM8 into video memory.
There are some caveats to that, and that is that the shared mode runs best when running in nearest filtering. If you're running in linear filtering, again, as I said, as it's rasterizing, it's going to be looking up those textiles. Well, if you're running in linear filter, linear filtering requires neighboring textiles and it'll fetch more textiles from the neighboring part of the texture, which will cause you some performance degrade. As well, shared works really well when you're scaling the image down. For the same reason I just said, if you're scaling it down, it's not going to have to pull all the pixels across a bus.
So for Power2, briefly talk about that. All the same extensions I just talked about are actually applicable to Power2. This is the same code snippet I had, and all I did was replace the rectangle texture option with the texture 2D enumerant type. And the difference here is that, as I mentioned before, Power2 sometimes won't get you a direct DMA. So not all graphics processors support direct DMA, and instead what will happen is the driver will make a copy. You can still use the options, all the same extensions I've been talking about, but sometimes you won't get the direct DMA.
Okay, let's talk about how to manage texture range. Now, as we saw in the diagram, the graphics processor is going to be starting to look directly at your memory. Okay, so the graphics processor and CPU now are going to be sharing the same piece of memory as it's rasterizing.
Now, there's a problem with that, and that is that you are now going to have to synchronize the CPU and the graphics processor such that they don't collide. You can't have the CPU and the GPU reading the same piece of data at the same time, right? It's a standard problem when you have multiple devices. It's looking at the same piece of data.
So what you want to do is you're going to have to double buffer it, and between double buffering, you're going to have to issue a flush. So I've got a diagram here. I'll show this. So if you're running single buffer mode, and you have the CPU just generated texture, let's say read it in or decompressed it, and now the CPU is going to want to flush that up to the graphics processor, so it issues a geo-flush to get that command in flight and get the transfer of that data up into video memory, and then the graphics processor is going to do its work of processing it and swapping it to the screen.
So there you just did one frame, right? And when single buffered, I have to synchronize my CPU and GPU serializing the processing because I only have one data set. Only one can work on it at a time. So as we build through this, this is how the frames go. I can only have the CPU and GPU working one at a time. Now, if I go with double buffering, let's see what happens. So let's say we start this sequence. CPU generated a frame.
It flushes it. Now, in the next frame, I can see that if I had double buffering, the CPU can start working on the second texture while the GPU is processing the first, right? So now we can flush the second one and swap the first one. So I just showed one frame while I'm submitting the next one to the graphics processor for processing, and likewise, it continues on, right? Now, the CPU can start working back on texture one, and the GPU is working on texture two, and so on and so on. And basically, if you had, you know, this is an exaggeration where you have perfect parallelization, but it does make a difference. It does make a difference where you are getting asynchronous behavior between the CPU and GPU.
Okay. Fence. So let's talk about how to synchronize. How do we synchronize the two commands? How do I know when the GPU is done processing my data, for instance? So it's pointing at my data set. How do I know when it's done accessing that data? Well, if you're using texture range, what you need to use is the Apple Fence extension.
And the Apple Fence extension, what you can do is you insert a token into the command stream, and you can query for the token to determine when it is done reading your texture, so that you'll know that it is now safe for the CPU to start touching the data again. So there's a couple ways to do that.
You can use it by inserting a token, or you can actually use it straight by accessing or referencing a texture object. And a texture object is just your standard texture ID, and what you'll do is in the fenced object test object command, you just send it the texture. the geotexture object type.
So looking at a little bit of code for that. So the first two commands here talk about how to set up a fence. So you just do a GL set fence apple. You can pass it any name you want, just a token that you pass into the command stream. And then when you are ready to start touching the texture again with the CPU, you would then test to make sure that the GPU is done.
And that's a synchronization point where the CPU will wait for the GPU to be done reading that data, and at which point then you can start touching it again with CPU. Now the last command up on this screen is the way that you could use it, the fence extension.
Without having to set a fence explicitly, you can just test for a texture object. And all you would do is just call the finished object apple with the GL texture target type and the name of the texture. So if you bound to a texture, you would just test against that same texture ID number. Okay. So we're going to switch to demo machine two.
So what I wanted to show here is an example of texture range. So you'll notice that the CPU is doing very little here. First thing I wanted to point out very quickly is that the CPU is continuing to do very little. And that's a sign that the CPU is not making copies of the pixels, right? Because if the CPU was copying the pixels, I would be seeing a big spike in CPU processing time. Instead, what's happening is the graphics processor is talking directly to the memory controller, getting a direct DMA. So the data is not going through the CPU.
So now I'm going to turn on my Infinite button. It's a button I recommend everyone put on their app. Make it go infinitely fast. So even though now I'm doing 240 frames a second, I'm getting a gigabyte a second across the AGP bus, I still have no CPU time. Because again, the CPU is not doing anything here. The CPU is orchestrating this and not copying the data. It's simply directing the traffic, as you might think of it that way.
Now, what I really wanted to show with this example is using the shared option. So as I said before, the cached option is good for drawing multiple times. But the shared option is good when you shrink an image down. Now, it turns out this image actually is 1024 by 1024. It's a lot larger than my window.
So if I switch to shared mode, now you'll see I'm getting three gigabytes a second. Well, that's not even possible. The application thinks it is, though. And the reason is that I've shrunk the image. It's nearest to filtering. Some of the textiles of that image aren't actually going across the AGP bus. Because the graphics processor is skipping scan lines and skipping pixels and only selecting the ones it needs to draw the image.
So now I'm getting 700 frames a second, as opposed to one gigabyte a second and 240 frames a second. So quite a boost. It doesn't take any video memory, because I'm not caching and copying video memory. I'm reading it directly across the bus. So there's times when this type of technique can be a large win. I'm going to switch back slides.
Okay, asynchronous texture downloads. Let's talk about how to set up texture range. I'm sorry. Asynchronous texture downloads are basically the same thing as uploading a texture, where we just talked about, where you set up a texture as an AGP texture for direct DMing. Asynchronous texture downloads is the same setup, but in reverse.
So you set up the texture the same way, and then you use copy text subimage to copy the data into that texture. So the way it works is that the copy text subimage is the call that initiates the transfer from video memory back into your texture and system memory. The reverse. And that is an autonomous call that will happen asynchronously.
So the next time you issue a flush, there will be a copy text subimage call in there, and the flush will issue a copy, a DMA transfer from video memory into system memory. Okay? And that's autonomous. The CPU doesn't need to wait for that event to happen. Now, there needs to be a synchronization point, because the CPU needs a new system.
Now, there needs to be a synchronization point, because the CPU needs to know when that's done. So what you use is you use, at some later point, you use the get text image call, and that's a synchronization point that will wait until the transfer is done. Now, hopefully you've done enough processing between your copy text subimage and your get text image, where the transfer is done, and you don't have to stall and wait. So the idea here is that you separate those as far as you can, maybe double buffer them, triple buffer them, do some processing between those.
So the basic setup of those is that, again, it's the same. The get text image will take the same pointer as you originally passed it for the texture, and the parameters must match the setup of the texture. So however you set up the texture, those same parameters will be used in these calls as well. And, again, you do this as late as possible to get the maximum asynchronous behavior. So let's look at a little bit of code. The setup, you'll notice, the asynchronous setup is the same as it is for a texture upload. It's exactly the same.
Now, the download is the key part of this. And all you have to do, the two key calls, are the copy text subimage and the get text image. If you issue those calls on a properly set up texture, you'll get an asynchronous download. And on my systems at work, I can get about 500 megabytes a second download performance, which is usually pretty acceptable for most people, particularly considering that can be an asynchronous operation.
Okay, let's talk about vertex data. So you'll notice in this part of the talk that vertex data setup and optimizations is about the same as texture, it's just a different data type. So a lot of the parts of the discussion are the same. And all we're going to do is walk you through and point out some of the differences and peculiarities of this. So we're going to go through a pipeline overview, we're going to talk about the basics again, and then we're going to talk about the extensions. Now, we're going to separate dynamic and static data, talk about some differences, and get a little bit more detail on display lists.
So in this part of the talk, we're going to point out the geometry part of the pipeline, not the pixel part of the pipeline. And let's talk about some basics. So the first thing is that you want to pick data types, like in the pixel talk, you want to pick data types that are most optimal.
And the most optimal for the vertex paths are floats, shorts, and unsigned bytes. If possible, stick with those types. Most graphics processors will handle those types natively, so you will be able to get optimal upload performance. And in some of the cases where the CPU might be making a copy, we've spent time optimizing those paths. So these are the ones that will give you optimal performance.
The other basic point of optimizing vertex upload performance is that you want to avoid function call overhead. Now, obviously, immediate mode, where you're sending one vector of data per call, is pretty inefficient as far as a copy routine. What you want to be able to do is use a vertex array calls, draw arrays, draw elements to get the data through the system with minimal function calls as possible.
Another good technique is to use CGL macros. CGL macros is a way to directly reduce function call overhead. And I'll show an example of that. It can be pretty dramatic how efficient you can make the function calls when you start using the function parts. So you can see that the function points are in the top-level library interpoints. So that's a concept that people may want to keep in mind if they are seeing or if they're making a lot of function calls.
So another key concept is if you are -- when you're drawing vertex data, passing vertex data into OpenGL, you want to maximize the number of vertices per draw command. So if you're using arrays, you want to maximize the number of vertices per draw command. So if you're using arrays, you want to maximize the number of vertices per draw command.
You want to get as many vertices per begin end as possible. And that can make a significant performance improvement if you can do that. Another optimization technique is to offload CPU processing using vertex programs onto the GPU. So if you have computational processing you're doing on the vertices, think about trying to offload that work to a vertex program on the graphics processor.
Okay. So how do we eliminate CPU copies? Another key concept is almost the exact same as the textures. What we can use is we can use the Apple Vertex Array Range, which is a parallel API to the texture range extension. And you could also think about using the new ARB standard, which is Vertex Buffer Object.
Those are nearly the same type of API. One is cross-platform. The ARB Vertex Buffer Object is a cross-platform API that will allow you to optimize your vertex data. It's a cross-platform API that allows you to optimize your vertex data throughput. Caching static data in VRAM is a key concept here.
Just like textures, you want to be able to cache that data in VRAM. Use display list for VRAM for static data. Again, we'll process that data and cache that in VRAM for you if you want to use display list. It's a very effective way to get your data processed properly and into VRAM. So looking at the pipeline, like textures, there can be multiple copies of the data as it goes through the pipeline.
This is showing a media mode drawing. A media mode drawing, we are required to keep the current vertex state before we pass it on to the graphics processor. So using immediate mode, you get one extra copy immediately. So if I switch to using vertex arrays, I can eliminate that one copy just by using vertex arrays. So I'm saving myself some processing time by using vertex arrays immediately.
Now, let's go into some talking about these extensions. So, Apple Vertex Array Range for dynamic data, what you want to do is you want to pass it the storage shared apple, just like on the textures, we were using the shared hint for how we want the data to be treated. And for the Arb vertex buffer object, we want to use the dynamic draw Arb constant, which is the equivalent to our storage shared constant. So, it will give you the optimal treatment for dynamic data.
And what happens when we use these extensions combined with vertex arrays? We end up with the same thing we had for textures, and that is we get a direct DMA from the application's copy of the data directly into the graphics processor's pipeline. So it will read it directly into its pipeline and process it. No copies in video memory.
So, looking at some sample code, how to set up for dynamic data using vertex array range. It's very simple. There's two calls that are key to this. Vertex array range, Apple, you just set it a pointer and a size, and that tells us how big of a piece of memory you want to map.
You malloc the data, you give it to us, tell us where the pointer is and what size it is. We map and AGP prepare it for a suitable storage area for direct DMA. Now, you need to make sure you flush that data. So, you need to tell us when the data's changed.
So, you have to call the flush vertex array range Apple call anytime you change data. That includes initially when you first set it up or when you modify some sub-region. You tell us a pointer and a size that you want us to flush. It can flush sub-regions. It can flush the whole thing. You tell us anytime you change the data, and we'll make sure that all caches, all copies, all are synchronized with your copy of the data.
Okay, Vertex Buffer Object, a little more setup. Not much more, though. You'll see that we bind. Buffer Object has an object-type binding where you will bind to a name, and that will give you the ability to switch between buffer objects. So what you want to do is you can create many of these, and you can bind to them as you need to.
And what you do is you pass in your pointer and the size, just like you set up your data size, basically, using the buffer data. And in that call, you'll see that we're passing here the dynamic draw, and that tells us that this particular buffer object is going to be set up for dynamic drawing, and we're going to be changing the data frequently.
And then you call Map Buffer. Map Buffer then actually is where you get the pointer back. So instead of you allocating the memory, OpenGL is going to allocate the memory for you and hand back the pointer. Then you fill out the data, and then you unmap it. And on Unmap is the equivalent of our flush. So Unmap, we flush the data out, and now the GPU is, all the caches have been synchronized and ready for using that data.
So again, you'll see that the sample code that shows this is listed at the bottom of the slide. It's going to be available tonight up on the server so that everyone will be able to download this. We've updated it with vertex buffer objects so that you'll be able to see how to use the new extension.
Okay, static data. So static data is almost like dynamic data, it just uses different constants. So we can use, for Apple vertex array range, we use the storage cached Apple hint. Or for arb vertex buffer object, we use the static draw arb constant. And the rest of the setup's the same, basically.
You just pass the different constants. Display lists, again, between a begin end, you can use immediate mode between a begin end call. We'll post-process the data to put it in an optimal format for uploading. A couple key things to remember for display lists is that you want to keep the data in a, pass it consistent vertices.
So, and what I mean by that is that if you pass a GL begin, and then you pass a GL color and a normal, and then you pass it a GL normal, I'm sorry, pass it a color, a vertex, and then you pass it a normal and a vertex, you're passing different types of data per vertex.
And what that does is it'll actually confuse our optimizer, make it such that it won't optimize the data, and you won't actually get any benefit from it. So what you want to do is the first vertex you call, you want to make sure that you're passing all the data that you're going to be required per vertex.
So if I need a normal and a color, pass a normal and color for the first vertex, then you can call anything you want as long as you don't call, say, something besides a color and a normal. I'm not sure that's 100% clear, but we can talk about it more.
Okay, so vertex array range and display list. So what that gets us for static data is, like textures, we can cache the data in VRAM. And there's three different ways to do it, as you'll see. There's the vertex array range. There's the vertex array range display list or vertex buffer object.
They get you basically the same behavior, and that means that you're caching static data in VRAM, and now when you reuse that data, you're getting the full bandwidth out of the graphics processor bus, and it gets you a significant performance boost. Okay, looking at static data setup, similar to the dynamic data setup, where you are passing the cached hint instead of the shared hint for the vertex array range.
and again, for the vertex buffer object, all you would do is change it from being the dynamic to the static draw arb constant for the static data setup for vertex buffer object. and parallel talk about how to synchronize your data. So if you're using vertex array range, the application and the graphics processor could be sharing the same data, so you're going to need to be able to synchronize and flush between, manage the synchronization and flush between your draws if you're going to double buffer the data.
So it's the same type of operation, right? CPU's generating data, flushes it to the GPU. GPU now is going to process the data. It's going to flush it up to the screen for single buffering. And the CPU and GPU are going to be running in series-- serialized. They're not going to be in parallel-- operating parallel. So let's see. Same all the way through the frames.
And for double buffering the data, now again, what we can do is we can have the CPU process the data, flush it to the GPU, and as the GPU's processing, CPU can start processing the data again, and so on. "Through the frames, we can theoretically get up to double the performance for something that's perfectly parallelized." Okay.
Similar to the texture range, you use the fence for synchronizing the vertex array range data. You'll need to know when the GPU is done processing the data. So what you do is you would set a fence or you would use the test object mechanism for referencing a vertex array object.
And that will let you know when that processing has been completed so that you can start touching the data again with the CPU. So the fence extension is what you'll be looking for to use for vertex array range. And vertex buffer objects don't require this. They have their own synchronization mechanism. And it's the -- you map, you change your data, and you unmap. So you don't need to use the fence extension for the buffer object, only for the vertex array range extension.
Just like the textures, looking at some sample code. First two are the same, where I'm setting a fence and finishing against that token I've inserted, the finish fence. And then the third line of code down there is, instead of using a GL texture type, I'm using a vertex array type for testing against a vertex object type. And that will allow me to set a synchronization point where I can be guaranteed the graphics processor is done touching the data in the vertex array range, and allow me to synchronize the graphics processor and CPU.
So here's a little bit of history here. So last year I showed this slide. Not quite. I showed it going out a little bit further. But I wanted to show what we've been doing. So I talked about some optimizations. What this slide shows is what exactly the data I showed last year. And with all the hardware changes and software changes we've made over the year, here's where we are this year. So it's a huge increase in performance.
And looking at some charts here. So if we look at a medium-load performance, for eight vertices per begin-end, we've gone up 800%. Vertex arrays, 1,100%, 1,000% for vertex array range, and 1,700% for display lists. Now this is using very small, only eight vertices per draw command. So it's a very small data set per draw command. So this shows some of the functional overhead associated with small drawing batches. Looking at another chart, another point in that chart, this is using 42 vertices per draw command.
So this is a little more optimal setup. But you'll see that we're still making quite a bit of performance gains, up to 477% for immediate mode. So as I said, we're working quite a bit on immediate mode. So since last year, we've almost increased performance of our systems by 500%. And this is not only a software change, but a hardware change, right? So this is comparing state-of-the-art hardware, software last year, state-of-the-art hardware, and software this year. 500% faster almost. Okay, switching demo two. please.
Okay, what I wanted to show here is just some of the effects I talked about. So here I've got just a simple mesh. This only has eight vertices per strip. You can see that I'm only getting 1.5 million triangles a second with immediate mode. Now, if I increase the detail of this mesh by selecting this option, so now I'm up to a mesh that has 198 vertices per strip.
You'll see that my performance jumped dramatically. Now I'm up to 12 million triangles a second. Now, as I said before, you can reduce function call overhead using CGL macro. So I've got an option here to turn on CGL macros. So I just select CGL macros, and you'll see that I went from 12,000 to 17,000. So I got 5 million triangles a second in immediate mode performance by enabling CGL macros.
It's a really large difference. But now let's see what we get when we use a more optimal form of drawing. So I'm going to switch now from immediate mode to just draw arrays. Now I go to 24 million. So draw arrays is more optimal than the best you can make immediate mode. Now if I try using some of the extensions we talked about, let's switch to vertex array range.
So draw arrays with vertex array range. I go to 50 million. Okay, so we're making some pretty good strides here. Now let's say my data's static, which this happens to be. Now let's go switch to display list. Now again, display lists are set up to cache the data static in video memory. I'll be using the bus bandwidth available on the graphics processor.
So I go from 50 million to over 100 million. So we started off at just a few million, and now we're up at 100 million. So using the proper extensions, understanding how to optimally pass your data through the system can make a very, very large difference. OK, back to slides, please.
Okay, one-shot images. The best way to pass up one-shot images that are small is using draw pixels. The reason being is that the overhead of small images is not going to be the copy of the data, which draw pixels does. Draw pixels will always copy your pixel data. It's the functional overhead of getting in and out of the system.
So you have to weigh off the functional cost of the driving OpenGL versus the expense of copying the pixels. So draw pixels works really well for small images. And I recommend that you experiment with this. If your images are smaller than 128 by 128 pixels in size, Now, one of the keys for making DropPixels go fast is you want to disable any complex rasterization state.
And the reason for that is that DropPixels goes fastest when we're not going through, really, the 3D pipeline. As much as we're allowing the graphics processor to use its 2D pipeline, we can just get a straight blit into the frame buffer. So we're not doing blending, we're not doing dithering, no stenciling, alpha testing, nothing that the 3D pipe needs to do so we can stay on the 2D pipeline. So disable complex state will get you the best performance.
and again, this demo is available on the website today, so people can look at the example I'm going to be running here. Okay, so simple little code snippet. Disable complex state, and then you issue draw pixels. Again, same with textures. You want to use a pixel format that is supported by the graphics processor, so we don't have to do expensive conversions. Because if you pass in some type like a float, we're going to have to convert it to something the graphics processor can handle, and that may be slower than you would like to see.
Okay, so back to demo two, please. So this is a demo showing draw pixels. Again, I put the fast button on there, something I do highly recommend. So if I zoom this down to something very small-- and you can't almost see it, and I apologize. But it's two pixels by two pixels.
So you can see I'm getting a million draw pixel commands per second. Now, you can see I'm only getting 15 megabytes a second of bandwidth of actual pixel copying performance. I'm getting a million draw commands. Now, as I move this up in size,
[Transcript missing]
So, and this is kind of the boundary of which I was describing, is that small images are going to go really fast, larger images are going to start hitting a memory bandwidth limit, and you might want to start considering some of the other techniques for uploading textures for doing this operation. Okay. Please back to slides.
Okay, so let's talk about pixel copy operations real quickly. So the key to pixel copy operations is to get VRAM to VRAM performance. You can get extremely high bandwidth if you don't have to come across the bus. So anytime you're storing data that you want to have temporary stashed off and you want to be able to restore it back, copy pixels is a great way to do that.
Now, where you want to store the data, there's a couple options. One is to use an auxiliary buffer. Auxiliary buffers, you can create an auxiliary buffer. Apple has extensions where you can have auxiliary buffers that have depth and stencil associated with them, such that you can copy a depth buffer, a stencil buffer, or a color buffer off into a temporary location and then use copy pixels to copy back to restore your data.
So if you wanted to refresh any one of those types of buffers, an auxiliary buffer will work well for you. John Stauffer Okay, so let's talk about pixel copy operations. So the first option is to use an auxiliary buffer. Auxiliary buffers are a great way to store data. So you can use an auxiliary buffer to store data.
Just like with DropPixels, CopyPixels, you'll want to have the state in a very simple form because you want to use a 2D pipeline. When you're copying from one memory location, the graphics processor, video memory to another, you want to have the 2D engine do that operation if possible.
And to do that, you want to minimize your state, have your state in a very simple settings. And it turns out it's basically the same thing as DropPixels. It has the same basic restrictions. So you'll want to disable as much of the state as you can to try to get that VRAM to VRAM 2D blit.
Okay, so looking at the little piece of sample code for that, you just disable your state, you set up your read buffer and draw buffer, and then I copy pixels. So here you can see that I'm going to copy data from the auxiliary buffer, maybe where I stored the data temporarily, back into the back buffer for restoring, say, a depth buffer, or in this case, a color buffer, and getting a very fast restore of that image.
Okay. Threads. Let's talk about threads a little bit. So rules for threading are, well, first off, let's talk about what I'm going to talk about. Rules for threading is what I'm going to go over. And then I'm going to talk about divisions of work. How you can, what kind of strategies you can use for dividing up your open gel processing onto multiple threads.
What are some of the effective techniques? Sharing data between contexts. How do you effectively share data? You can set up multiple contexts to have them share some common data set. Synchronizing your threads. We'll go over a little bit about what is the proper mechanisms for synchronizing multiple threads.
[Transcript missing]
Let's talk about division to work a little bit. So possibilities are moving OpenGL onto a separate thread. So you can have your application on one thread and OpenGL on a separate thread, a very obvious way of doing it. That's not always the optimal way. So another thing you can think about is splitting OpenGL vertex and texture processing.
That's very useful for when you want to have video data or you're generating some pixel data coming from a disk or coming from some source that you want to load into OpenGL, and then you want a second thread to be drawing it. So you can have OpenGL have multiple threads, one for loading and one for drawing.
So, what gets shared between contexts? So, a lot of times, people don't clearly understand when you have multiple contexts set up to share each other's object state. The things that get shared are display lists, textures, vertex and fragment programs, and vertex array objects. That data gets shared when you share two contexts. So, that data set will become common between multiple contexts, if you set them up properly, and we will manage the mutex locking of accessing that data.
Okay, and like I said, you can share an OpenGL surface. So you can also set it up such that a multiple context can talk to one VRAM buffer. So let's look at some diagrams of how that looks. So here in the red circles, I've got threads. And on the left, I've got the application doing some CPU processing. It passes that data off to the thread two. Thread two then takes that data and uses it to draw some OpenGL. Very simple. Simply using one OpenGL context. OpenGL is on its own thread.
Here's an example of splitting OpenGL across multiple threads. Now, what I've got here is I've got two threads, one OpenGL context per thread. I've got them set up such that they're sharing OpenGL state, and I've got it such that they are talking to the same video memory surface.
So they share state, they share the VRAM buffer, and we manage the object's shared state. And what this shows is just using texture data on one thread and vertex data on another. Obviously, those are arbitrary. You can obviously mix those up, have any kind of inputs from either talking to the shared object state machine.
Slight different variation on that, if people want to use P buffers, obviously you can have one talking, one thread talking to a P buffer, and then link that P buffer into the shared state for as a texture, and then referencing a P buffer for drawing, and using thread one to draw using that P buffer, draw some scene that is using generated textures as a P buffer.
So, looking at a little bit of setup code here. So this is using Cocoa. So this is how to set up a shared context using Cocoa. So you'll see that I create a context, and then I knit with a pixel format, and I'm passing in a shared context. So the third line down is a shared context, and that's the way you can link two contexts together to have a common object shared data structures. And that allows you to share textures, display lists, programs, vertex object data.
Okay, so synchronization between threads. The way you want to do that is you want to use standard OS thread locking. Use NSThread, NSLock, for instance. One example, obviously you can use any other type of OS-level facility for managing threads. And I guess the main point of this slide is that there's nothing in OpenGL to manage your threads. It's synchronization. It's standard OS tools, facilities that do that.
Don't use the Apple Fence extension for managing your threads. Apple Fence extension is for managing synchronization between CPU and GPU, not between two CPU threads. So that's an important point to remember if you're going to start dabbling in multiple threads. And by the way, just as a point, if you mess up threads and you have multiple OpenGL contexts, multiple OpenGL threads talking into the same context, you will cause all kinds of bad things.
And bad things can go as far as hanging your system. You'll introduce a bad command into the graphics processor. The graphics processor may hang and your screen will wedge and the CPU will block up against that and everybody will come to a halt. Okay, so let's switch to demo machine two, please.
So in the beginning of this talk, I talked about effectively using CPU and the GPU. And one of the things I wanted to show is, first off, we wrote a Altavec routine of this little sinusoidal wave simulator here. And you'll see it's going pretty fast. It's generating 18 million triangles a second.
So what we've got here on this chart is we've got the red time at the bottom is time spent in the system outside of just the application or OpenGL. The green is time spent calculating the wave, and blue is time spent in OpenGL. So I'm going to just multi-thread that.
I'm going to split it across both of the CPUs that I have on the system. So that bumped my performance up a little bit. I was at 19. Now I'm up to 21-something. And you can see that I've improved performance a little bit. Now the thing that is surprising, this actually has a high-end graphics card in it.
And now what I'm going to do is I'm going to move the wave calculation into a vertex program onto the GPU. So before I do that, you'll see that the CPUs are very busy. There's a lot of time going into calculating this wave with the CPU. And now I'm going to move the wave calculation off onto the GPU. And the thing to watch is the performance. or 21 million triangles a second, 338 frames a second.
And look at that. It dropped down to 15 million. And I know there's people out here from the hardware vendors, and they're saying, no, that can't be possible. Our hardware always can outrun the CPU, but it's not true. The CPU is really good at some things. And you can actually write really efficient code sometimes that'll outrun the graphics processors.
And you'll see that, by the way, notice that the CPU is now barely doing anything, but my performance went down. So that kind of is what I'm trying to point out here, is that if your goal is maximum performance, sometimes you want the CPU to be doing work.
But if your goal is to have the CPU do nothing, for sure offload all the processing onto the GPU, and the CPU can be free to do something else. But that won't guarantee maximum performance. Maximum performance will be found by experimenting what the optimal combination is. Okay, back to slides, please.
Okay, let's wrap up. Okay, so after this session, there's a couple more OpenGL sessions that are really good and I recommend people go to. There's the optimization live session, which is going to have a live session talking about our tools, using our tools, the OpenGL profiler, OpenGL driver monitor, live on stage showing people how to use it. Really good session. I find that I'm always using the OpenGL profiler for analyzing applications, figuring out where the bottlenecks are, what I need to be optimizing. It's a similar tool for OpenGL as Shark is for the CPU.
And then on Friday, we've got the introduction to the OpenGL Shader Language. For those that don't know what the OpenGL Shader Language is, it's a good introduction to what that language looks like, some of the capabilities it has, and highly recommended for people that are interested in programming the graphics processor.
And who to contact? Contact myself or Travis Brown. If people want to talk to me, they can come up afterwards and I can give them a business card, so you don't need to write that down too quickly. So for more information, you can go to the Apple website. So it's developer.apple.com/opengl. That's a good resource for OpenGL information from Apple.
Or you can go to the OpenGL.org website, the www.opengl.org. That's OpenGL's official website. It contains specifications, pointers to a variety of resources that people will find useful. and reference libraries. So we do have some references out there. You might want to take note of these. A couple of documentations that are out on the system.