Digital Media • 53:19
Delve into OpenGL performance tuning at this in-depth session on optimizing OpenGL code for Mac OS X. This session provides information on the fast paths through Apple's OpenGL implementation and key optimization and profiling techniques. Specific API performance insights are presented.
Speaker: John Stauffer
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good morning, everyone. To present session 408, OpenGL Advanced Optimization, I would like to introduce OpenGL manager John Stauffer. Hi. So today we're going to talk about advanced optimizations in OpenGL. So hopefully we'll learn a few things about how OpenGL works and things you can do to try to tune your applications.
So, So, what we'll learn is the key components that you need to look at when you're trying to tune your application for higher frame rates. The thing that I always like to start talking about is application component. And the reason for that is that about 75% of the time is spent in the application in a common OpenGL app. And so therefore, since 75% of the time is spent there, that's where you have the potential for getting the most benefit.
So, if you don't tune your application, obviously you're not going to get a lot out of OpenGL because you'll spend too much time in the application. So, we'll spend some time just talking about techniques for tuning your application to drive OpenGL better and some hints on and tips on how to do that.
The second thing is setup. So, how to properly set up OpenGL, how to get some machine information, how to properly configure and scale your application such that it will run well on the machine that you're using. The third thing is state management. So, state management basically is where a large percent of the time that is spent in the actual OpenGL timeframe is spent in state management. So, state management actually is more important than a lot of people think. If you do a lot of thrashing of state in OpenGL, you can actually decrease your performance quite a bit. Texture management. Texture management is important.
Texture management is important to keep your application correctly scaled for the hardware so that you're not paging a lot, you're not spending a lot of time running out of video memory and paging on and off. Vertex operations. So, vertex operations are important obviously to be able to get a lot of data to the card, have an optimal format for sending the data, keeping the data flow moving quickly to the card. Bye.
Perfragment operations. So perfragment operations are the operations that the card itself is going to do. So it's not CPU related, but it's what the graphics card is going to have to do to generate your final image. And there's some tips there to offload some of the work the graphics card is going to need to do.
Extensions. So a lot of times there's extensions that you can utilize that are either directly geared towards optimizing your application or will help you get the animation effect you're looking for with a simpler path. So you won't have to do, you can simplify your CPU work by utilizing an extension. Multi-CPU or multi-thread utilization. Obviously if there's a machine that has two CPUs, it's an ideal situation to spawn another thread and maybe move your graphics off to that other CPU. And lastly what we'll talk about briefly is where to look for more information.
So, starting off here, just to get the image of what OpenGL looks like and how data goes to OpenGL. It's important to think of OpenGL as a data stream. So OpenGL fundamentally is a data stream going to the card, and how the data is organized in that stream is very important.
Because it will give you hesitations if you have too many operations of one type, or if you're flushing and breaking that stream and causing discontinuities. So the fundamental type of data that goes to OpenGL is vertices, which is your 3D data, and state. So you can fundamentally-- this is a simplistic view-- but you can fundamentally break it down into those two types of data sets that go to the card. And how that data, again, gets organized and sent to the card can make a big difference.
So, application. So, the thing to remember when you're looking at writing an OpenGL application is first, you have to decide obviously what type of performance you're looking for. And to do that, you need to obviously decide what type of user interaction there's going to be. You know, whether you need high frame rates because the user needs a fast response time on the graphical feedback, which may mean you need 30, 50, 60 frames per second. You need to decide what type of quality display in your application you're going to need. And obviously, those two things can be related.
So, adjusting the right quality with the frame rate is going to give your user the best experience. So, it's important to keep those in mind. Your target platform. So, deciding what your ideal platform is going to be and what you're going to run best on is going to be important.
So, that you can potentially scale your application to run well on those target platforms. And the things to remember about the target platform are video memory size, how much system memory you're going to be needing for the application, and potentially what graphics cards in the system so that you can have the animation effects that you're looking for.
So the thing that a lot of applications provide obviously is a mechanism for users to adjust the quality settings within the application. And this is usually important such that a user can himself or herself select the tradeoff they want between performance and quality such that they can have some influence on their preferences as to how fast the application will run or what the quality will look like.
So the first thing we do, and we do this a lot at Apple, is we'll take an application to try to analyze where the time is being spent. We'll take the application, and we will run it with a null layer of OpenGL. We'll try to figure out how fast OpenGL, if OpenGL was infinitely fast, how fast would that application run? And this gives us an upper bounds, and this helps us understand what the application itself is doing and what profiling may be needed to be further done in the application to tune it.
So two ways to, depending on your programming environment, and just a reminder, actually, all the code that I'm going to be showing today is Mac OS X Cocoa-based. I'm going to, since we have limited slide space, I'm going to stick to those function calls. So to no-op out the OpenGL layer, there's a couple ways you can do it in your application very easily.
For the CGL layer, if you're programming straight to the core OpenGL layer, you can simply set your OpenGL context to null, and what that does is that actually, internal to OpenGL, that will have OpenGL set all the entry points to a no-op, so they will do nothing. And if you're at the app kit layer, then you can use an app kit call just to clear the current context, and it's equivalent to setting it to null, and again, that will just set all the entry points to null, to no-ops.
And so what you want to do, once you've done that, is you want to, measure the time that's spent in your application to get a feel for what level of performance your application's at. And here we see a little code snippet using get time a day to just quickly calculate time spent in the application. So once we've done that, we can calculate an open loop, OpenGL no-op frames per second that your application's capable of.
So obviously if you're, once you've gotten to this point, you realize that you've no-opped OpenGL out, it's infinitely faster, and you're able to do that. So that's a little bit of a tip there. It's infinitely fast, and you're not achieving the frame rates that you would like to be at. You can immediately start, you know, thinking about going into your application and tuning your application.
What we do is you can do a quick calculation. Assuming an average application spends about 25% of the time in OpenGL, you can take that open loop frame rates per second and just multiply it by .75. So lower that frame rate down and get an estimate of what you're going to run, what your performance is going to be once you enable OpenGL.
And If this estimated frames per second isn't where you want it to be again, you're going to have to start looking at either OpenGL or you're going to have to start looking at your application. So to start tuning your application, on OS X there's a variety of tools to do this. One tool that's very useful is called Sampler. For anybody that hasn't used Sampler, it's a tool that will--.
So, it's a thread, it's a threaded tool that will go out and look where your call stack is at any given time. And it will generate a sample, a heuristic of where the time is being spent in the application. So, this tool actually is very useful. It works for CFM apps and Mac OS apps. And it's part of the developer install. So, it's on your disk at developer application sampler.
And it's a very useful tool. We suggest everybody become familiar with how to use it. And it will show you where all the hot spots are in your application code. It will even show you where the hot spots are in the operating system itself. But you may want to run this without OpenGL and just run your application open loop and just stress your application and find out where the hotspots are.
Okay, so that was enough talking about the application. So setting up OpenGL, the first thing you need to do obviously is to go out and query for devices and find what your--what devices you have, how many devices and such. So I've got a couple code snippets up here that will show you in core graphics how to get your device--get your main device and how to, from the main device, generate an OpenGL display mask. So the first code snippet here is just the main device.
If you wanted to go through all the devices, you could get all the active display devices from core graphics, loop through them, generate a display mask that represents all of the--all the devices on the system. And really, it's going to depend on whether you're a full screen or a windowed application as to what you're going--what the right thing for you to do is.
So what we can do with this information is we can find out how much video memory is in the system on each graphics card. So here we've got a code snippet that will query the renderer for video memory. And it goes through the loop and it will look at each device, querying it for video memory size.
And this is going to be important because as we start to try to adjust or tune our application, we're going to want to make sure that the amount of textures we have, the resources we're going to be consuming on the card are going to fit in video memory. So we're going to want to know this usually up front if we have a texture intensive application.
OK, so when we look at the video memory size, there are several things that we may want to adjust. Again, we may want to adjust texture sizes, but we may also want to adjust the screen resolution. If we're going to be switching into a full screen mode, let's say, we're going to have the opportunity for picking a screen depth and a screen resolution.
If you have determined that your application needs more video memory than is potentially available in the current display mode, then you'll want to switch it down to a 16-bit color potentially, or you'll want to switch down the resolution, give the application more breathing room on the graphics card, and that will help with keeping your application out of a texture paging mode and give the-- a higher frame rate during the running of the application.
So the other thing you'll want to do is to find out what CPU you're on. One thing that we find very useful is internally to OpenGL obviously is using Altevec. And Altevec can give you substantial performance boosts if you utilize it. So finding out if you're on a G3 or a G4 is very useful.
And tuning to that condition can be very beneficial. The other thing to remember about that is that typically the difference between a G3 and G4 is that G4 systems are going to be faster and you may want to think about adjusting your data set size to accommodate the faster systems.
So quickly talking about state management. So state management, again, is the process of switching what mode OpenGL is running in to get your proper configuration for drawing your graphics. The thing to remember with state changes is you want to minimize those. What we have found is that in a lot of applications, the amount of time that's actually spent in OpenGL, a considerable portion of that is actually doing state management.
And if you... Unnecessarily changed state, you can cause a lot of thrashing down on the card because OpenGL has to go through a lot of setup to properly configure the graphics card for each state change. Some state changes are obviously more expensive than others and we'll go through a few of those, which ones to avoid. But in general, you want to group your data to minimize state changes and that will have a significant impact on what performance you can ultimately achieve. So, Some general calls you want to avoid. GL flush. So you want to avoid GL flush.
Because what it actually does is, if you, again, think about OpenGL as a command stream going to the graphics card. GL flush tells the graphics card, terminate the current command stream, send it to the graphics card and start me a new one. So you've just chopped that command stream and sent it on its way.
And the reason that you don't want to do this necessarily is because there's only so many command buffers that you can have allocated to your application at any given time. So if you sit there and call GL flush a lot, you will use up the buffers that you have available to your application and your application may be starved for available space to stick, you know, put data on the stream.
So unless you have to, don't call GL flush. And there's actually very few reasons they would call it. Usually you can find some other way to do what you're looking to do. If you want the user to see something immediately, usually you just call swap buffers to get the data swapped to the screen. And swap buffers actually calls, is implicitly calls a flush. So when you call swap, it terminates the stream, sends it to the card and so you don't necessarily have to call GL flush yourself. Another call that's even more expensive is GL finish.
So GL finish is like a GL flush, except for it sends the data to the card and it actually will block there waiting for the graphics card to finish its drawing. So once all the commands have gone to the graphics card, finished, come back and return, it will send the data to the card and it will block the data to the card.
So once all the commands have gone to the card and it has finished, come back and return, then GL finish will actually return to your application. So an important performance thing to keep in mind is that GL finish is very expensive. It can be a blocking call that can take quite a while to return.
So you want to avoid reading data back from OpenGL. And when it comes to state management, typically what you want to do is you want to keep the data in the application that you will need later and not ask OpenGL for it back. Depending on the driver and what you're reading back, it can get very expensive.
Reading back data can actually be the same cost of calling a GLFinish because if you're reading pixels back, for instance, the pixels actually have to represent the current state that you are expecting. And that is you've drawn all these commands, you've issued all these drawing commands, you're expecting the pixels to be in the buffer. Well, so OpenGL realizes this, and when you try to read some pixels back, it's going to have to call finish, wait till all the commands are finished, wait till it's drawn everything, before it can give you the valid pixels back.
So you don't want to read the frame buffer unless you have to, you don't want to be reading state. Some state can be expensive to read, and you don't want to read textures unless you have to. They all can have varying penalties depending on what mode you're running in.
So what you also want to think about when you're writing OpenGL is avoid complex state settings. If you don't know what a state setting does, usually it's a bad idea to just arbitrarily throw state changes in there. What you'll want to do is keep the state as simple as possible, because this will help the graphics card run in its most optimal mode.
This will also usually lead to less state thrashing when you are trying to transition from one drawing routine to another, you won't have to do as much state setup and teardown, so it'll lead to less state transition. So keeping it simple is obviously a simplistic concept, but it's something to keep in mind.
So some basic complex states that you want to avoid are lighting, user clipping planes, and full scene, well, anti-aliasing, like anti-aliasing lines, anti-aliasing points, polyons. And the reason you want to avoid those is because they can be very expensive to do. With modern hardware, lighting and user clipping planes and even anti-aliasing lines are pretty fast.
So again, it may depend on the particular graphics card you're running on. But in general, lighting is very computationally expensive and unless you have a real need for it, you'll want to keep that disabled. Even on the high power graphics cards today, if you start enabling lighting, you will cause a graphics card to do more processing and you will ultimately lower the performance. Now whether you actually see that will depend on how fast, you know, what kind of demands your application is putting on the graphics card. But those are very complex operations for the graphics card to perform. phone.
So texture management. So this is a very important topic because a lot of games nowadays or applications in general are using a lot of textures and how to properly manage those can be, make a big difference in the application's performance. So several things to remember. Avoid uploading the texture more than once.
Ideally what you want to do is you want to give OpenGL the texture and not keep handing it to OpenGL. Don't delete it and then give it back to it later if at all possible. And instead let OpenGL do the management, the bookkeeping of whether the texture should be in video memory or not. Avoid, so again, avoid keeping a copy.
And that will save, Avoiding keeping a copy in the application will save system memory. The thing to remember here is that OpenGL will keep a copy. And so you're going to have two copies. If you keep one in the application and one is going to be kept in OpenGL, you're going to have two times that texture size. So it's best if you delete yours if possible. So ways to get data, textures, data into the graphics card or into the driver fast. There is an extension called Apple Packed Pixel.
So this is the fastest way to get pixel data into OpenGL. And it's a very flexible format. It will support all the standard OpenGL pixel types, Apple pixel types. It will also support a number of rather complex formats. So you can see here, I'm going to show you a few examples.
There are a few odd types that may be useful for you. You know, like 565 or 332, depending on what your quality requirement is or whether you don't need a high-rate, don't need a deep bit depth per component. You can get away with some of the smaller bits per pixel components.
Minimize the, how often you change your current texture. So changing your current texture is actually one of the most expensive operations you can do. And what that means is that changing your current texture is a GL bind call. And when you bind from one texture to another, you're basically just causing OpenGL to potentially reconfigure all of its texture combiners in the hardware for the new texture.
Because the new texture is going to require different blending modes. And it can be fairly expensive to do that setup. So typically what you want to do in the application if you have a lot of data is you want to group your data in groups with common texture types. So that's the best way to group the data, such that you minimize your texture changes.
So scale textures to your hardware size. So again, earlier we looked at finding the VRAM size. So what you'll want to do is you can do some basic rudimentary math in your application, and just fundamentally try to scale your application to fit on the graphics card. So if you have a lot of textures, you'll need to calculate how many you're going to need necessarily on the graphics card at any time. It's not terribly important to get it exact, but you would like to keep it within a reasonable bounds. OpenGL is very efficient at paging.
So what you'll not want to do is try to keep OpenGL always out of a paging mode. You don't want to try to second guess the exact size of the video memory available, and when exactly OpenGL is going to go into paging mode. Because if you do that, you're not going to let OpenGL grow and utilize some of the mechanisms internally to the driver that will try to optimally page textures on and off.
So OpenGL uses, internally, for paging textures on and off, is called the LRU/MRU algorithm. That stands for least recently used, most recently used. So depending on how committed you are, how many textures are committed per scene, whether you're over committed that scene, it will actually switch to different mechanisms for paging textures on and off, trying to optimally keep the right set on the graphics card, and not unduly page them. So that algorithm actually works pretty well. Also, particular to OS X, is we built a mechanism that causes almost no CPU work to page a texture.
So once a texture's in OpenGL, and it had to get paged off, back into system memory, let's say, it costs very little CPU work to get it back into the stream and back uploaded onto OpenGL. So while it will cost a little bit of memory bandwidth, while it's getting read and it's getting processed, it's going to cost a lot of time.
And if it gets read and it's going to cost some AGP traffic, the CPU cycles spent are going to be pretty minimal. So we find that letting OpenGL do the paging isn't expensive for the CPU. The CPU can keep on going. And as long as you're not causing too much bandwidth across the AGP, you can get away with a fair amount of paging.
So depending on what you're doing, you're going to also want to split your textures into tiles. For, and I've got a demo of this in a bit, where if you're doing, wanting to do smooth animations of some sort, trying to amortize the data stream as it goes to the card and trying to keep the drawing moving while large images are moving up the stream.
So again if you look at the whole process of OpenGL as a big data stream, if you have a four megabyte texture, that's a big block of data in the middle of your stream. So you can envision that under some circumstances it would be good to interlace that upload with some polygon drawing, maybe a frame here and there so that you can amortize the texture upload time going across the bus and keep animations flowing.
So here's a little diagram for texture management. One thing that we recommend on OS X is to split your texture loading off to a separate thread. If you're going to be spooling through a lot of textures, it's a good idea to maybe spawn a thread that will do that work for you. And the reason for that is that--there's a couple reasons.
One is you can utilize a second CPU, and two, you can utilize preemptive multitasking to balance out the loading, the act of maybe reading a texture from disk, the cost of loading it in OpenGL. You can use the preemptive capabilities of OS X to spread that cost out so you don't end up with a single point in your rendering stream or in your CPU cycles that are blocked trying to get this texture uploaded and processed.
So it's a good idea. So if we look at this, this is a basic diagram of how to set up a two-threaded--or what happens when you set up a two-threaded. Application, one loading the OpenGL textures and one doing the drawing. So what happens is the first thread is loading the textures, and those textures will get processed and put into the driver, into the kernel driver. So the kernel will have them at this point, and they will be sitting in the kernel waiting to be uploaded to the card. So you'll have done most of the work of CPU cycles on the primary thread of getting the data into the kernel.
And then you can load the textures. So you'll have the first thread loaded, and then you can have your second thread come along and issue the drawing plans. And as long as you have your thread synchronization correctly organized, then your data will be there by the time you need it, and everything will just flow much smoother. I've got a demo of this. And this demo shows this basic concept that the diagram had there.
So what this demo tries to show is a couple concepts. One is how to balance the-- requirements of your application with quality and smoothness of frame rate. So what we have here on the left is we have a slider that will adjust the quality of these images. So for instance, down here at the bottom I can get 64 by 64 textures and up on the top I get 1024 by 1024 and everything in between. So what's interesting to look at here is if you're trying to say write a screen saver, for instance, and you're trying to get these images up to the graphics card while maintaining smooth animation, you'll see that we get a hesitation.
And that hesitation is because one, we only have one thread doing the loading and the animation. So we get a large hesitation while we spool the texture off disk. We decompress that JPEG and we load it into OpenGL and give it to the driver. So we can see that this isn't going to lead to a very nice screen saver. So we start looking for techniques to smooth that out.
And one thing we can do is we can spawn a thread and we can give the, that thread the job of spooling the texture off of disk and loading it into OpenGL. So what we see now is we see that it's a lot smoother but it's not perfect.
So here's where you can start deciding whether frame rate and quality are important. One thing you can do is obviously, if you're not needing to achieve those kinds of rates of uploading and animating, you can slow it down. and the hiccups are almost gone. The other thing you can do, obviously, is if you want to stay relatively fast animations is you can lower your image quality. So, we're still going a little bit too fast to get absolutely smooth animation.
But so you can see with this technique, we've basically eliminated the pauses in the animation stream and we're able to get smooth animations while we're spooling through a large quantity of textures. This demo actually will spool through 200 megabytes of textures, simulating a fairly large scenario. And then the third thing we can do, after we decide a frame rate quality, we can also go to a tiled mode. So a tiled mode is an attempt to split the texture into many pieces and to amortize the cost of uploading that across the bus. Now I've had a little bit of a problem with the tiling mode, so we're going to give it a shot though.
So the tiling mode theoretically now is using the primary thread to load the images. And then the drawing thread is-- well, there we go. So I've got some thread synchronization issues. It's an attempt to try to amortize the cost of moving the data across to the card. So with the MP case, as soon as we went to the multi-threaded case, we offloaded the main thread, its job of loading all of the data from disk and then giving it to OpenGL. But what we were not able to do in the multi-threaded case is we're not able to amortize the cost of moving that image across the bus, across the AGP up to the video memory.
So we still see a small hiccup in the MP case. So as soon as we go to tile mode, what I've done here is I've taken a small piece of the tile, a small piece of the texture, and I've uploaded one small piece at a time. So I'm able to upload one small piece per frame and that way not see a big four megabyte chunk of data in the data stream as it goes to the graphics card. And done correctly, you can make... Get a lot of data up to the system with very smooth animations. So again, if you look at the different scenarios, looking at the stream case, so there it is. Multi-threaded, it's a lot smoother.
[Transcript missing]
The deficit is that if you are passing small amounts of data, there's a little bit of overhead of logic to get into the routines. So you're not going to want to call a compile vertex array with three vertices because you're better off going to GL begin end because that's lower overhead for a small amount of data. So if you have large arrays of data, let's say greater than 16, 16 may be pushing the smaller end of it, but say greater than 16 vertices per array, try using compile vertex array. It could probably get you some benefit.
So looking at a chart here that shows you primitives along the x-axis and number of triangles that you can render per second along the y, you can see that the type of primitive can make a large impact in the number of triangles that you can send through the system. So down at the very bottom is polygons. Polygons is the most rudimentary way to send data to OpenGL. And then near the upper end of the spectrum is triangle strips. So triangle strips is the best way to send data through the begin end immediate mode path.
And then at the very far right is compile vertex array. So you can see that compile vertex array, if fed correctly, can give you substantial boost in performance. Now the green bar shows what you can do on a G3 and the blue and the orange bar shows you on a G4. There's not a huge difference, but it can make a big difference ultimately in your performance.
[Transcript missing]
Okay, so looking at how to potentially optimize OpenGL, I've got a number of slides here to just basically walk through the process that everyone should look at when they're trying to figure out how to simplify their code and how to make it more optimal. So we start off with a basic loop that is going through setting up a smooth shaded color mode, setting up a color, and then drawing a triangle.
So we're doing this every time through the loop. So we're drawing one triangle, we're doing a state change per triangle, and obviously we're not going to get a lot of data through this because it breaks every rule we have, and that is you're doing state changes and you're not passing a lot of data per begin end.
So the first thing we do is remove the state changes out of the loop and that will obviously give you the benefit that now we're passing a lot of data and we're not changing the state and we're not causing OpenGL to have to do a lot of state management below. But we still haven't pulled any, done any optimizations with how we're passing vertex data.
So the next thing we do then is, well, actually we simplify the state. And we simplify it by just going to a flat shaded. So we notice that we're not passing a color per vertex, meaning the color is flat shaded triangles. So we're going to change that to flat.
But then we pull the triangles out of the loop and that's an attempt to maximize the amount of data per begin end. And by doing this we can increase the performance by quite a bit. And in fact after this I have another demonstration to show you the effect of this. So you can see the effect of that. It can be pretty dramatic just doing that step alone.
Then what we do is we try to simplify the API that we're utilizing. Instead of passing all the data through registers, we pass a vertex--a pointer to the data, and it allows OpenGL to potentially optimize how it's copying the data. You're not doing a lot of register setup to get the data through.
Then we take the step of realizing that what we're actually passing is a triangle strip. So we change the data type to a triangle strip and we reorganize how we're passing it. And so now we've just reduced the amount of data going to OpenGL by a factor of three. Again, getting a big performance boost out of doing a step like that.
And then what we do is we realize that we have all the data actually in an array, so we start using a vertex array and using draw elements to draw out of that array. So now we've eliminated the loop altogether, and we are simply making five function calls to handle all the drawing, whereas if we looked at the beginning of the slides, we were probably making hundreds or thousands. So we've eliminated all the function call overhead, and we have given OpenGL an opportunity to try to optimize internally for how it's going to want to get the data into the command stream. So now I have another demo showing some of that effect.
[Transcript missing]
So what this demo is is a spherical mapped mesh that is being animated with a wave motion. And where we start this application right now is in a mode where the application hasn't been tuned and the rendering hasn't been tuned. And the way we can tell that is that the red bar represents the time being spent in the application. The green bar simulates the time being spent calculating the wave motion. And the blue bar is the time being spent in OpenGL. So we can see we're spending quite a bit of time in all these, we're spending most of the time in the application.
So a little experiment that's interesting to run is if I take this application tuning slider and I bump it all the way up so that the application becomes tuned, we can see we get about-- we go from 20 frames per second almost to 40. So we almost double our frame rate by doing that.
So now if I move this slide over here, which simulates optimizing OpenGL through the basic steps I just went through, the first one is individual triangles. The second one here has now moved the begin end outside the loop, and it's passing as much data as possible per begin end. So we can see that we immediately get some performance out of that. We can see the blue bar is changing by about a factor or two.
But our OpenGL performance hasn't changed a whole lot. It only went up about five frames. So by doing that step, we didn't get a whole lot. Now if we go to the top one, that's using vertex arrays. And again, it didn't change a whole lot. The interesting thing to learn about this is that if we take the slider and we move it up for the application, now we realize that we have gotten 100% improvement on just optimizing the application.
So optimizing one or the other only got us a marginal improvement to a 2x improvement, but if I optimize both, I go from 20 frames a second to 60, so I get a 3x improvement. So the combined effect is very important. So it's important to realize that where the time is being spent can be the application or OpenGL.
So the second thing we can do then is, like we've been showing here, is to spawn a thread. Now if we spawn a thread and we move the green bar onto the thread, we can see that now we are utilizing both CPUs in this machine. This machine is a dual 500. So now we are animating at 200 frames a second, and we started off at 20. So we got a 10x improvement out of this.
And so now it's animating silky smooth, whereas before it was barely crawling along at 20 frames a second. So this is a good example of, a basic example of where you can start from a pretty dismal performance and do some simple things, and all of a sudden the whole application comes alive and you're getting 1.5 million triangles a second and able to deliver a much better application.
Okay, so now we've kind of talked about the application setup. Basically how to drive OpenGL. And all those things are fundamentally CPU-oriented operations. And that's a process of optimizing how effectively you're utilizing the CPU. So now we're going to talk about per-fragment operations a little bit. Per-fragment operations are fundamentally what the graphics card's going to have to do to convert your data from a triangle to the image that you, into the frame buffer. And what types of blending or texturing operations need to be done.
And there's a few things just to keep in mind while you're doing this, while you're programming. One is to utilize multi-texture instead of multi-pass. So basically all the graphics cards on OS X that are accelerated on OS X today have multi-texture, multiple texture units. So if you want to apply two textures, let's say, you can do it in one pass. You can load two textures, one in texture unit zero and one in the texture unit one, and you can apply both textures simultaneously. And this actually has two benefits.
One is that, again, lowers the CPU overhead because your application's not having to loop through the rendering twice and reissue drawing commands to do the second pass. But the second one is, is that it helps the graphics card optimize its memory traffic because you're not writing to the frame buffer on one pass and then having to come back on a second pass and write the pixel again.
Instead, you're allowing the app, the graphics card to read two textiles out of the frame buffer and then come back on the second pass and write the pixel again. Instead, you're allowing the app, the graphics card to read two textiles out of the texture units, out of the textures you've defined, combine them and write it once out of the frame buffer. So it lowers the ultimate bandwidth that you're consuming on the graphics card.
So you can, you can get performance a couple ways by going to a multi-texture instead of multi-pass. So avoid, when possible, obviously anything I'm gonna say here is just a suggestion of things to try to work your way around and sometimes you're looking for an effect and you have to do these operations.
But, you know, I'm not gonna say that you should do this. I'm not gonna say that you should do this. I'm not gonna say that you should do this. I'm not gonna say that you should do this. I'm not gonna say that you should do this. But avoid read, modify, write operations on the frame buffer. So read, modify, write operations are anything that will, requires the frame buffer value to participate in the value that, participate in the color calculation that will result and be finally put back into the frame buffer.
So things that do that are blending. So if you're blending with the final destination of the frame buffer, it's gonna have to read the frame buffer, it's gonna have to read its text. It's gonna have to read the text. It's gonna have to read the textures. It's gonna have to combine it all and then write it back out. So you're gonna get 2x the bandwidth.
[Transcript missing]
Okay, another thing that's important to keep in mind is that modern graphics cards have the ability to do high-level culling for you. So if you're drawing lots of triangles, it can, ahead of time, cull those triangles out. And it's called Hierarchical Z. And Hierarchical Z will
[Transcript missing]
So, some of this is a review. We're going to talk about OpenGL extensions and what OpenGL extensions can help you.
We already talked about CompileVertexArray, but just to review really quickly, this is good for a large number of vertices. It reduces the number of transformations. It reduces memory traffic. It allows OpenGL to pre-compile data into an AGP buffer ready for transmitting to the card. So, whenever possible, again, use CompileVertexArray.
Texture compression. Texture compression is also a very good extension to use. It allows you to minimize the system memory bandwidth of moving that texture around. It saves you system memory itself. It saves the bandwidth of moving that texture up to the graphics card. And it also saves, can benefit the graphics card itself by lowering the bandwidth it takes to read the texture, a text will out, and to render with it because it will do on the fly texture decompression and it will better utilize the on cache memory on the graphics card. So, texture compression can be effective. It's really going to depend on where your limitations, where your performance bottlenecks are in your application, but it's a good one to keep in mind.
Multitexture is the extension for doing multitexturing, like we'd mentioned, utilizing more than one texture unit. Apple Pack Pixel, again, is the extension for the best way to pass pixel data to OpenGL and allows you to get the most data, the best bandwidth utilization of moving pixel data to the OpenGL system.
And the other thing it does actually is it saves system memory. So if you're able to store a texture in a more optimal format for your application, let's say 1555, obviously that's going to be half the memory utilization of an eight component, eight bits per component texture. So that will give you some system memory savings, some bandwidth savings, and it'll also save video memory on the graphics card.
So, quick summary. So, again, going over some of the priorities. So, the thing we always tell people is they, again, need to optimize their application because 75% of the time is typically spent in the application and 25% in OpenGL. So, optimizing your application is going to be important and you won't get good performance until you've gotten that to an acceptable level.
Scale your application to the target platform. Try to determine your VRAM, how much video memory is available. Determine how much system memory is available. Try to stay within acceptable bounds that won't cause the system memory to go into paging. Determine maybe the number of texture units on the graphics card so that you can do multi-texturing instead of multi-pass. Look at your CPU type. Try to utilize a number of OpenGL extensions that will help simplify how the data is being passed to OpenGL. So, that's the first thing. As well as potentially give you better effects.
Allow the user to adjust the graphic settings such that if the user is experiencing problems on a particular platform for one reason or another, allowing the user to vary the quality settings such that they can get the performance that they are looking for obviously is going to be a friendly thing to do for the user. It will give the user control over some of the aspects of how the application runs. Okay.
So for more information, there's two good books on OpenGL. And anybody that's doing OpenGL programming should have these books. One is the OpenGL programming guide, and the other one is the reference manual. So these books are invaluable. They're very well written. And if you're just starting OpenGL or whether you're an expert, these books are always sitting right next to me on my desk.
So for online help, there's some good resources. You'll want to go to www.opengl.org. This is the official OpenGL web page, and it's got all kinds of neat news announcements, resources. It has lists of applications that are utilizing OpenGL, documentation. It's got all the resources you'll need for finding out what's the latest in OpenGL.
And then there's the lists at apple.com where you can join the Apple OpenGL list, and there's lots of Macintosh-specific discussions going on in that list where you can participate or learn from some of the discussions that are going on there or send an email of your own and ask a particularly difficult problem that you need answered.
And lastly, we met Sergio at the beginning here, so if you have any questions about OpenGL at Apple, you can contact Sergio. And here's his contact information. Sergio is our product representative at Apple, and he can direct you to somebody in Apple if he's not the right person or help you with some of your product needs.
And lastly, we have--after this session, we have advanced OpenGL rendering techniques. It's a very interesting presentation that will go into utilizing some OpenGL extensions for doing advanced rendering. I highly suggest it for people that are looking for new techniques and some capabilities of graphics cards today. It'll show you some interesting demos and some nice effects.