Tune Your OpenGL Application - WWDC 2007

Graphics and Imaging • 1:02:23

Discover the latest techniques to maximize the performance of OpenGL. Learn not only the methods to increase frame rates, but how to ensure that your application makes the most efficient use of the OpenGL pipeline. See how to best submit vertex and texture data, eliminate redundant data copies and use optimal pixel formats. Find out how you can enable multi-threaded OpenGL execution to dramatically take advantage of multi-core CPUs. This is a great session for application and game developers alike.

Speakers: Kent Miller, Dave Springer

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Good morning; this is the Tune Your OpenGL Application session; my name is Kent Miller. I am the Manager of the Desktop OpenGL Software team at Apple and I'm joined here with Dave and our idea for this session, we're going to loop back probably over some of the topics that have been covered in some of the other OpenGL sessions but our take on this session is to show you how we work so a lot of what we do is trying to help people get the best performance out of their applications or you know, they'll come it with specific application performance problem and we try to help them diagnose it and we're going to show you the tools that we develop and use; Shark and Profiler are the two main ones we turn to first and we also have Driver Monitor and Shader Builder to help us do some other specific things to the applications; so I'm going to introduce Dave Springer, awesome graphics dude and he's going to take us over a little tour of Profiler.

Thanks Kent. As Kent said these are tools that we've developed that we use every day and they are shipped, you have them on your DVD; so I'm going to just give you a quick tour of OpenGL Profiler.

Profiler is an app that is made for debugging and for profiling specifically GL app; so it hooks directly to the GL library entry points really by themselves. I'm going to start with launching and app here, cylinder, and what we're going to do here is take a look at some of the things that we can see in the app using Profiler.

So the first thing I'm going to do is set a breakpoint by going to the breakpoints window and I'll set a breakpoint here on CGLFlushDrawable and this is the kind of a knot hole function that if you're a Cocoa programmer you're used to calling it NSOpenGLViews flush buffer or if you're a Carbon programmer it's AGL swap buffers; so both of those calls routines come down into CGLFlushDrawable.

This is actually the frame end marker, really, of your OpenGL app so generally speaking what we do is we like to capture a single frame of information and see what's going on in the app and this is where we usually break, that's why it's actually set out in bold because make it easier to find.

Okay, so here I can see that I've got a call stack of where I am stopping and some simple stuff like that; then what I can also do is like I said I can capture a frame and the way I'm going to do that right now is get up the trace view here and what I can do is go back and continue; now what I've done is I've run one frame of the app and when I go back and look in the trace view here what it's giving me is all the OpenGL calls that are made in that one frame and I can just browse through that like this.

You can see it makes about 3,000 calls to do that one frame which sounds kind of like a lot just to make a few quads and some textures; so right away I've already gathered some information about this app that tells me, "You know what, maybe just in looking at the kind of graphic that I've got here on the screen that there's possibly some ways that I can optimize this already to reduce my function call overhead or something like that" and that's just with, you know, if you, operations within OpenGL Profiler.

( Pause )

Other things I can do in Profiler too is I can look at the state changes. Now state changes, Kent is going over this a little later on to talk a little more about state changes.

Generally speaking you want to keep your state changes at a minimum. There usually fairly expensive so what I can do here again is I'll continue on another frame, and take a look at the state; so again I'm stopped at a breakpoint, at the frame end breakpoint and this pane over here on the, well, I guess it's my left and your right, is all the GL state that's in the engine. So every single GL get call you can possibly make is listed in here and there's two ways you can look at this.

When Profiler starts your application the first thing it does is it goes and grabs all the default state that's there so as you run it's doing differences from that state to where you're at; so this is going to tell you all the state you've changed since the genesis of OpenGL in your app and it shows those in blue over here so you can see that, for example, I've changed transfers. I mean there's quite a few things that have been changed. I've changed the transforms, the view port of course is changed; you have to do that; projection matrix things like that, the model view matrix.

Another thing that we can do with state is look at the difference from my last breakpoint to another breakpoint. So I want to see, for example, how much state I'm changing between when I end a frame and when I call my first GL begin because that's interesting to me; I want to know how much and do I really have to make all these state changes so I'm going to find out how much state I really am changing and the way I'm going to do that is by scrolling down in this list and you can see in this list here by the way it has every single GL entry point; there's about 900 of them; there's quite a few.

They're alphabetized so it's at least a little easier to find them but, whoops I don't want to break there, I want to break on GL begin and then what I'm going to do is hit continue and...

( Pause )

when I hit continue it says, "Oh, since the start of the frame or the end of the last frame in GL begin I've changed some pieces of state.

Here I've changed GL texture 0 and I've changed the width and height of some textures and some things like that, and then I can combine the two and see that where there's blue that's state I've changed from default; where there's red only it's state since last breakpoint and where they're purple that's where both of them have changed; right because purple is the (inaudible) red and blue; everybody knows that. Okay, what you want to do in here is try and get everything to be blue because idea being that you change, you set up all your state once from default and then you run your frames and keep your per frame state changes to a minimum.

Okay another thing I can do with Profiler is gather stats, let me get rid of all these breakpoints here first...I had GL begin and then I'm just going to double check, there's a...break after none and a break before none so I've cleared all the breakpoints. I can also run if I want to I can ignore all breakpoints which is handy if you have a bunch of breakpoints set and you just want to run a few loops of your app and then turn the breakpoints back on you can do that with that checkbox. Now what I'm going to do here with this app, whoops I should have of course got it running again, right.

I'm going to go to the statistics window now and watch what's happening with my GL calls; so here I'm sorting on the number of calls being made and you can see there's quite a few; there's 2 million to GL vertex, 3F and color forward. I also noticed that GL began an ender being called; again I set a breakpoint on it stop there so obviously it's being called and that means this app is running in immediate mode; so another thing...a little orange flag there, take a look at that; generally speaking immediate mode is great for certain, for getting started for prototyping and actually good for certain apps but not all. In here you have things like percent of GL time which means the amount of time that you're actually spending in the engine; so this CGLFlushDrawable relative to all the rest of GL only is spending 88% of your engine time.

Percent app time is the amount that this GL call spends relative to the whole app so about half you time is spent in CGLFlushDrawable in this particular application; then there's other stats you can gather down here; the amount of time spent in GL of course if you add this and the app time together they should be equal to about 100 which they do.

Okay one last thing to take a look at here quickly, I'm going to set another breakpoint. You see that breakpoints are pretty, pretty important; you use them all the time. I'm going to stop again at the frame end and what I want to do is take a look at some buffers; I actually use this feature myself just last night to debug some threading issues.

You can look at the back buffer and one of the things that happens often is we get...developers that have the wrong pixel format and they think they have double buffered but don't so there is no back buffer; well that'll cause flickering and all kinds of other issues. Well here you can say, "Well what is my back buffer and is there a back buffer?"

So this is a buffer that's about to be flushed forward. Another view that we can look at that's actually really useful is the depth buffer and okay this looks not very interesting; in fact it doesn't look like much of anything, right but notice this orange bar up here.

This is actually the Z values, scaled Z values that are in the Z buffer right now for this image and if you have an application that's doing a lot of Z fighting which means that you have polygons that get close together and they start flickering on top of each other; I'm sure you've all run into that. What's happening it that your precision in your Z buffer is starting to get to low so the compares are not working.

Well there is a way to fix that; if you click on this magnifying glass what it's going to show you is a kind of Z min, Z max bound on Z buffer and so Profiler has gone in here and picked on and says, "You know what, you're really just using this little tiny squashed up area of the Z buffer; so to fix your Z fighting problems what you can do is go back to your GL (inaudible) call or your GLU perspective call and change the Z near and Z far clipping planes; the farther you...the closer you put those to your model the closer together you can get those the better Z position you're going to end up having.

So that's how you would use that tool and the idea of course is to get this orange bar automatically to be as close together as you can by using that magnifying glass. Okay, so to sum up this, you want to use this tool to get in and take a look at your app, don't guess at things.

We get a lot of folks saying, "My app does this or that because, you know, your graphics do so on and so forth." Well that's not always the case; so you definitely want to get in there, take a look at what your graphics are doing with the library, make sure that you're using the library correctly.

You can use break on GL error, you can use break on thread conflict and all kinds of other things in Profiler, all the stuff we're going to look at later and really understand what your usage of the library is, of the GL engine, of the GL library before you make decisions; it's just like using a good debugger and then you have a lot more information; so don't guess, measure first and then make your app perform. With that we'll turn it back to Kent.

( Pause and applause )

Could we go back the slides please?

( Pause )

Okay so let's talk for a second about just the basics and Dave showed you how to use Profiler to examine how your apps calling up in GL and one of the first things that you ought to look at is to minimize the number of times you call up in GL.

Immediate mode as you saw adds up fast, you know, it doesn't take very many frames of that application to get you up in the millions of function calls and instead a better strategy is to use the functions that allow you to move a lot of data at one time which means one function call submits lots of things and vertex arrays, vertex buffer objects are a good way to do that. If you have, if you're using programs or Shaders there's these two extensions that allow you to change the program parameters or uniforms all at one time so you don't have to make multiple function calls to do that.

Bindable uniforms is new for Leopard, for Shaders, the other one we've had out for a while and Dave and I went out in preparation for this talk and ran Profiler and gathered some statistics from several of you know the latest, popular games and I thought that might be interesting to take a look at and we see that they spend, you know, anywhere from, you know, 35% to 60% of their time actually calling up OpenGL for different things and you know, some of them for various reasons.

You know, some of them call a lot of program parameters to put inputs into their vertex programs. Some of them call matrix mode to change the change the matrix a lot but they spend an awful lot of time calling up OpenGL but we also noticed that they don't call glGet and they never call glBegin so that means that, you know, high performance applications aren't doing those things and these applications spend the majority of their time in these routines as indicated by Profiler but really what's going on here is when you call something that actually physically does drawing that's when all the state changes that you've made in between, that's where they get resolved; so if you see these calls show up it may not necessarily be that one call that's causing you the problem; it could be the things that you did before that and that's where examining the trace comes in handy. So it's said a lot the OpenGL is a state machine so what's a state machine.

It's just a concept for something that takes a fixed number of inputs; you know, does certain operations to them and adds a fixed number of outputs. Well your graphics card only has one current state so when you flip back and forth between states it gets expensive so a strategy is to group your similar drawing together to minimize the number of times you change state.

Another great thing that's easy to find with Profiler is to watch the trace for times when you continue to set the state to the same thing over and over; we see that a lot and it's just overhead and it doesn't really buy you anything; so instead a better strategy is to use objects in encapsulate state, these can be switched between with one function call and change a lot of things at one time. Vertex_array_object will do that for vertex array state; framebuffer_object does that for drawable so instead of changing OpenGL context you get to just set new drawable and continue to draw.

It's lighter weight.

( Pause )

So, when you're programming try not to stall the engine. Stalling the engine, the best way to do it is to start asking OpenGL what you told it to do so query the state; ask it what you set it to and in all these sessions we talked about sync points so a sync point is somewhere where the execution of the graphics pipe has to stop and reset itself or give back some information to figure some things out that you ask it to do and these are all classic reasons to sync; you know, glGet and glReadPixels sync because ask sometimes, sometimes glGets syncs for some things and they have to stop, you know, figure some things out maybe perhaps rendering before it can give you the answer back and instead a better strategy is to let it keep working as much as you can. So wait to get your results as much as you can and if you are using fences and you know, delay the test of the fences as long as you can so you don't want to be either pulling the fencing or blocking, waiting on it to finish.

Occlusion queries are the same if you can wait until as late as possible to get the result from that, that's going allow the graphics to keep going as much as possible and if you need to use ReadPixels, if you use pixel buffer objects then you can start the ReadPixels, do more work and then right up until the last second when you have to have the results of the pixels then you call matte buffer and then you'll block until they come back but hopefully they'll already be back by then.

Okay so some techniques to optimize your state and you inspect your trace, look to see if it makes sense and look out for redundant state setting and I think a lot of people are surprised; I know I'm surprised when I write a program and open up with Profiler and I go, "Oh, what did I do" and so, and you see, "Well gee, you know" it's really easy to see sometimes that you're making some stupid mistakes as far as redundancy and when you look at your immediate mode program you think, "Well boy do I have to set the matrix mode for every quad I draw" probably not and then also the state inspector helps you...helps you find changes in state easily because of the color coding and that's also a good debugging tool, by the way so if you get drawing results or not and not expecting or you know, "why isn't this drawing, why isn't this drawing" you can look at the state to see, "Oh gee, I've got blending enabled; okay, yeah, I didn't want that."

Okay so let's talk for a minute about glFlush; so we say don't call flush, don't call finish; so what does flush actually do? So when the graphics system is executing and it's buffering up commands to send to the graphics card and its filling buffer 1 up as you issue commands it fills up. When you call flush it sends that buffer on to the card and GPU starts to chew on it and then you start filling up another buffer with the next command you call.

So, what happens is if you call flush a lot these queues are relatively big and if you flush you only fill them a tiny amount and then when you submit it well first of all you didn't fill it, second of all you, each application only gets a fixed number of them so if you do a few things flush, do a few things flush, do a few things flush, eventually you're going to block; you're going to...the graphics card will not finish working on the buffer before need a new one back; it doesn't have any free and so then you're blocked on flush waiting to get a new buffer.

So instead, you know, the strategy should be aiming for is to fill the buffers. If you never call flush what happens is the buffer gets full and when it's full it cements it to the card, the card starts working and you get started on a new empty one and then you get to fill that completely. If you have too much flushing then you're making inefficient use of that resources; you know, aim for full command buffers.

There are some times you have to use glFlush and these are two. If you're single buffered, for whatever reason you might want to be you have to call glFlush to get your results to appear on the screen; otherwise they won't appear until the command buffer fills up and you know that can give you kind of half arbitrary rendering results; so if you're single buffered you're going to have to call glFlush to get the results you want and another case is if you have shared OpenGL context then you have to use flush to make sure that some rendering is finished before you can use up the other context. An example of this is if you are using pbuffers and you're going to use that as a texture in another context, if you have to, you'll have to call glFlush on the first context before the resource is guaranteed to be available in the second context.

( Pause )

So, Dave...a lot of times we have developers come in and I think that you saw this just yesterday, someone that said, "Oh well I was working on my app and all of a sudden I fell into software" and can you show us how to use the tools to work on some of the common cases for that?

Sure, if I can get the demo machine back, please. I did get this on the list just yesterday.

A developer wrote in and said, "I have this Shader and it's flying along in hardware." He added one line to his Shader and it tanked and he opened up Profiler and found out that GL vertex was going a thousand times slower all of a sudden; so this is one of these, you know, I'd fell off a cliff, what happened, kind of scenarios; not a graceful decline at all. This is definitely pushed you over the edge. So that's usually a symptom of what we call Suffer fallback which means that you were hardware accelerated and now you're not; so that's what we want to take a look at.

Okay, we have an app here, this spinner app and I'm going to go ahead and launch that; now this is a grociated quad; it's four vertices with this rainbow pattern on it; it's running at about 200 frames a second. Okay, I've got some really smoking hardware in here and 200 frames a second for a quad really sucks; that's...I mean this thing should be going like about 4,000 and this should be a blur so you can't see it. Why is it only going 200 frames a second for a quad?

It seems way too slow. So, here's what I'm going to do. I'm going to check and see if we are in fact are just falling off the hardware and it's really easy to do in Profiler; I just say "break on software fallback" and it stops and says, "Well, yeah, you did; you fell off the hardware" and you can see that up in here it says the fragment processing is in fact GL_FALSE; so this is...if you were to make the calls I think in Jeff's session if you're in that there's a CGL call you can make to get the fragment processing and also the vertex processing; these are those same calls.

Now I what OpenGL Profiler is basically doing is it's making those calls for you every single time you issue a GL command; so using this feature is not super speedy, but it will also definitely tell you if that's your problem; if you in fact fell off the fast hardware path.

So, okay let's take a look more at this problem, how we're going to solve it namely and one of the things that we've noticed is typically one of the main suspects in falling off hardware path are Shaders. There's other ways to fall of hardware path too but usually those are subtler and harder to get. It's real easy to fall off hardware when you're using Shaders.

So, I'm going to take a look at the Shaders that are in here; I'm sitting at a breakpoint so I'll go and look at the resources panel in Profiler and I want to take a look at Shaders; so Profiler has gone and reached into the application and said, "Well its got this vertex shader and this fragment shader and they're bound to this program."

So, let me take a look at the vertex shader here...so I've got that right up there and nothing really jumps out as causing any major problems. It's doing some pretty straightforward stuff. I have transform, it's building a color, reading a...putting a value into a varying, pretty lightweight. Let's take a look at the fragment shader and well here we are. There's the noise 3 function; that's not implemented in hardware at all. So no hardware that we ship has noise 3 functions; that is definitely software fallback, guaranteed.

So I'm going to say, "Well okay what if I just don't do that then, you know, doctor it hurts when I do this, well don't do that." So I'll take that out of the shader, say compile; now what Profiler has done is taken that fragment shader and replaced the old one in the app with this new one. Let me go down to my breakpoints here...and continue and get on the app

( Pause )

Now what happened?

I got it...excuse me, there we go and now you can see that I'm in fact running about 5,000 frames a second...not bad. I left the break on software fallback...break...whoops, after that whole thing about how Profiler is invasive I do it myself. You'd think I'd know, I wrote the program.

So there you go. So now from this stage you can take that shader code that you fixed in Profiler and paste it back into the original app and presto you're not perform...or at least you know so you can decide and make better decisions; so this is a good way to use Profiler to look into why you might be falling off of the fast hardware path and onto the slower software path.

Okay, with that, back to you Kent.

The newer hardware is much better about falling off but you know a lot of the developer problems that come to us are people that are trying to make their applications run on, you know, across the line and the lower end hardware is more difficult to make your stuff fit in it.

Okay so why are some reasons you fall off the hardware? So vertex processing on the older hardware some of the fixed function features will make you fall off hardware and two-sided lighting is an example and shaders that use too many resources, so they get to long or to many texture lookups or never or I guess any texture lookups currently but for vertex processing some hardware is always in software vertex processing until integrated that's in the MacBook and the Mac Mini are always in software vertex processing and for fragment processing some hardware and some features of non power of 2 will make you fallback into softer fragment processing and you know shaders that exceed the hardware limits or use features without hardware support make you fallback and fragment fallback is much more expensive than vertex fallback. I think that it would probably relatively easy to be falling back with your vertex processing sometimes and not be able to tell.

Now fragment processing usually just, you know, you can tell immediately if you do any kind of significant rendering; and even though the fragment fallback is much improved in Leopard you still...it's relatively easy to tell. It's interesting that when we were coming up with the contrived program and we were developing on our laptops that when we fell back it would be two frames per second and the we'd get up here on the super, super demo machines then it's 200 frames per second fall back that the lower end hardware it's really easy to tell.

So how can you tell? Use the break on fallback feature like Dave showed you and if you're using Shark you can look for the software render library to show up in your Shark trace and it's called GLRendererFloat. Inside your program, as Dave mentioned, there's these calls you can use and just like any GLGet they're expensive and these actually alter the flow of your, of the system when you make them so it's a complete and total sync when you call these so it's something you want to call every time you render; it should be used more like a development feature or you want to run your program one time on one fragment to see if it worked in hardware; that would be another time you could get loaded up and check but not every time you draw and there's an application on the disk already in developer example OpenGL called GSL Editor Sample that allows you to simply develop shaders and it shows you if you're in software or hardware vertex processing as you edit the shaders so you can be editing shader or you add something that makes you fallback, whoops, it turns red, you see it fell back and you can do something different and the new shader builder also has that feature along with a ton of others and that's going to be available in Leopard.

Okay so Dave this brings us up to multithreading and I had this application the other day that I was debugging for Leopard, it came up and it had 37 threats and I was like, "Oh my gosh;" so these apps are really starting to use multithreading and it's a great way, every machine we ship has multiple processors, multiple cores, and it's a great way to get some performance but can be difficult so why don't you show us the features of the tools that allow you to more easily develop multithreaded applications.

Okay, first I want to say that we feel your pain. multithreaded programming is just hard and there's no way around it, especially when you've got multiple cores and the threads really are assigned to run on different processors; so if you're not really careful with your locking and things like that then your thread competition problems and bugs are compounded a lot.

So, we've endeavored to come up with some ways and some tools and actually added things to Profiler to make the GL-specific stuff in multithreaded programming easier. I want to, I have to give my usual lecture, right, which is when you are going multithreaded in OpenGL generally speaking you want to think of one thread per drawing context; so if you've got a OpenGL context, CGL context with this drawing you want to have just one thread that's talking to it. It's difficult to have many threads talking to that one drawing context.

You can have your other threads in you app definitely doing other things, in your game, physics, all kinds of stuff like that or database access, that's a great way to use threads but generally speaking one thread per drawing context; so if you have more than one drawing context which is also a good idea especially when you've got very heavyweight drawing going on then try to match those two things up. I'm going to show you in a little bit how you can have many threads going to one context; there's some locking mechanisms that we provide and also just generally how to do that and I'm actually going to write code right here live which is, yea.

Okay that's the end of the lecture; two kinds of multiple threading programs that multithreading programming that we cover; one is yours which you're doing in your application and the other is OpenGL Engine is multithreaded as well; so last year at WWDC we showed, we took Doom and turned on multithreading and immediately got some, I forget the percentage exactly but it was...

Some.

Some. It was significant just by multithreading the OpenGL Engine underneath it and we get that for free. You can get that for free in Profiler by going here in the breakpoints view and down here on...you can turn the multithreading on and off regardless of what the application is set, okay.

So let me show that really fast with...I've got cylinder here and I'm just going to turn it on...so there's cylinder running and if again if I stop at the end at my frame end I can say, "Well I want to see what it looks like with the multi threading turning on" so I turn the breakpoint on, then I force multithread on, turn the breakpoint off so it'll keep running, hit continue and now go back to cylinder and there I've got it running, now it's running on the multithreaded OpenGL Engine and I didn't change any code at all, nothing.

So just automatically flip that switch for me; so that's something you might want to try first and just see because it's no always true that running on the multi threaded OpenGL Engine makes your app run faster. There's a lot of factors that go into that. multithread was kind of a buzz word for a while and it's "Well, you know, we're going to go just because we've got all these processors we're going to spread our job out" but you've got to be careful about things like hidden serialism and stuff like that and are you really CPU bound, are you GPU bound, these are all things you need to look at with Profiler and our other tools before you just jump in to multithreaded because multithreading is difficult to program and I'm going to show you why in a second. Okay and again if I want to I can stop at the frame end, force multithreading off and say, "Well what really happened here...not a great deal of difference; it is running a little bit slower but there's not a great deal of difference."

So here's an example where I'm not specifically bound up in any way that the multithreaded engine is going to help me; so, but let's take a look at some multi threading issues when you're writing them in your app and how Profiler can help you with that. I did this myself yesterday...here's a piece of code and what it's going to do is draw the famous Atlantis whales.

Everybody knows Atlantis, right, being the demo that we shipped; so...I just run this app by itself it'll lock this machine up because I have all kinds of threading errors going on in here. I've got three threads, one for each whale you'll see and they're all trying to talk to the same context; well when I do that I've got a command buffer that's partially full and then another one trying to come in and override it and the whole thing just, you know, falls apart; so I tried it; I'm not going to do here because, you know, why would you want to see my reboot my computer, you can get that at home, right?

So what I'm going to do is with Profiler is first of all let me build the app and make sure; I opened this in Xcode and I'm going to build this so I just, it's done, it's built and let me go in Profiler here and I'm going to say I want to break on thread conflicts so Profiler is going to tell me if this app had a problem with threading and there it is right there.

Now it didn't give me a call stack because the call stack was probably pretty messed up. When you have two threads smashing into each other it's pretty hard to get state out of the machine; so even with a gdb and tools like that it's pretty hard to see what's actually going on once you have the error but at least I know that yes I do have two threads competing for the same resource and I've done something wrong. So, and I've done it in a safe way; I didn't lock my machine up; that's a good thing, right?

So now I can say, "Okay look at where it's stopped, it's in color 3F." Well, that's sort of a weird place; I just call it, how is it possible that it's...well because my thread conflict happened as soon as I made that OpenGL call; that means that one thread was in the middle of filling a command buffer and another thread came in and tried to make a color call, boom.

So, okay, kill that. What we have is...let me find it here...CGL_LOCK ...this is a way to lock the CGL context, okay? So what we've provided is some API that you can use and it's in Leopard, right. I don't think it's earlier than Leopard; CGLLockContext and CGLUnlockContext; so these pieces of API allow you to lock the CGL portions of the context and I'm going to turn those on and see what happens with my app and see if I can get, I've bracketed my draw calls with these locks in the effort to make so that those threads don't compete; so let's see what happens there. I'll just turn this to 1 which hopefully will turn all that stuff on, build, and then go ahead and get my, make sure that I'm still breaking on the thread conflicts, I am and launch and uh-oh.

I got a little further but still having a problem. Remember this is my threads now, okay. I'm not in OpenGL threads; so this is my threading problem. I'm making the programming error. I take full responsibility, you know. I think we're doing some good work here but okay, so next step then is to say, "Alright I've used...I've just tried flat making GL calls from the different threads, that obviously didn't work."

Now I've tried using the CGLLockContext routines around my drawing calls and that got me a little farther but I'm still having problems so that means okay, I need to reinvestigate what's going on here and really think about what I'm doing hard with my threads and it turns out that I do still in fact need to lock more stuff myself and what I'm doing there is...let's find where this is...not only my draw calls but...this is in NS OpenGL view subclass and you'll notice in it's reshape method which it just calls automatically for you every time the window changes size and when a window first runs; Cocoa programmers are familiar with this.

It's going to go ahead and make some GL calls; well I need to be aware of that and I need to lock those and again in my drawRect as well where I'm going to do some other stuff, flush (inaudible), things like that. Okay, so I'm going to build this and let me see what happens; now I'll get my...again I'm going to make sure the thread conflict is turned on just in case this didn't solve the problem and run and okay. So now my whales are swimming.

Well, they're swimming kind of slow but the reason again is because every single GL call that's being made Profiler is going in and looking and doing either a p thread try lock and all kinds of stuff to see if there's a thread conflict so I'm in there really examining what's going on with your calls, it's very invasive; so if I just go ahead and turn that off, come back to here, there's my whales swimming and...we did that, we got rid of the thread conflicts.

( Applause )

Okay, so...with that back to you Kent.

Thanks Dave. So threading in OpenGL and switch back to slides please.

( Pause )

You probably, you know, if you're a frequent attendee to WWDC we go over this every year. You know, OpenGL context are not thread safe but you can use multithreading but you have to take care of your own locking and you know, as usual with multithreading programming it can be difficult to find a good division of labor to make your, the app actually get any speed up after you take care of that kind of thing.

One popular way that people do use threading with OpenGL is to for instance is to have thread that has a context, another thread that has another context and they're shared and then the first thread will be like loading resources, textures and things and then the other thread will be drawing and that seems to provide a pretty good speed up for people to do their own threading and as Dave mentioned, and I'm not sure it's clear, but NSOpenGLView may call you when you're in the middle of doing something else; so if you have spawned a thread that's doing drawing and it's OpenGL view at any time might call you to context; so if that's the case then CGLLockContext will allow you to prevent that from making a crash; that CGLLockContext only locks the CGL calls so it doesn't lock the GL calls.

So when that situation is appropriate you can use CGLLockContext; otherwise if you're doing your own threading you're kind of on your own for locking...or we also provide the multithreaded OpenGL engine and it's been talked about in some of the other sessions this week and like Dave said we showed it last year. It's great for a certain type of app and these apps, apps that have done a lot of work to minimize their sync points in their application.

One thing to keep in mind is that it does add some overhead; the way the multithreading works is it adds another buffer and when you call OpenGL with the command it just copies it and then the processing of that command buffer is done on the second CPU. So you're doing some more work and if your division of labor is not right or you have lots of sync points and things you can slow yourself down. One example of how there's more sync points is that if you ask OpenGL if lighting is enabled for instance; okay, if you're single threaded it goes in and checks the state, returns to you if lighting is enabled.

With the multithreaded engine if you ask OpenGL if lighting is enabled it has to finish the command buffer before it is up to date enough to tell you whether lighting is enabled so that adds additional overhead. That being said, it's simple to turn this on in your app or you can use Profiler or turn on you app with this simple CGLEnable call.

( Pause )

So, we have another topic to discuss and that's paging, so in the past 6 months we've had a lot of developers come in and say that, you know, they're not getting the performance that they think they ought to or you know their performance is different from other platforms or whatever and a lot of times it's due to paging so Dave why don't you show us the tools that you can use to diagnose this type of situation.

Okay...what we're talking about is specifically texture paging and this is where we see apps gracefully degrade or slowly degrade instead of just fall off a cliff like we saw before in the software fallback case.

So if you notice that your app as you start to load more textures especially or more resources into it, it starts to kind of get slower and slower and chug and chug and chug, we've got some tools here to look at that. The one I'm going to show is driver monitor and I'm...up, good; so this is OpenGL driver monitor right here and what I'm looking at are two parameters. One is called the current free video memory and I've got quite a lot of it and what that's telling me is how much memory is available on the video card for my textures and other resources.

The other one is the texture page on data and this is going to show me live pages as the go onto the card; so these two kind of relate to each other obviously, right because as I put texture pages on the card I should see gradually the amount of free memory decline and so let me launch that...spinning cylinder app again...okay and over here we see that right as soon as I launched it there was a little dip...right there in the amount of free memory that I have as I loaded a bunch of textures on the card and a spike in the page on data; so there was a bunch of pages sent over to the card.

Now what I can do with this app is add a bunch of textures to it...so every time I add a texture...you'll see I just put in a bunch more...

Why don't you change the skeletal linear? I think it might be easier to see.

That's a good idea.

( Pause )

Okay.

At the top.

You know what else...

Yea.

There we go. So you can see here this is when I first added some, now I've added more...and then let me add a bunch more and see if I can't get that down to zero...it's getting closer...it's kind of my challenge; there we go and you see the app is really chugging now, look at the amount of paging that's happening here; it's spiking up; so we're getting a lot of texture pages swapped on and off the card and I've basically used up all the card's memory now. In fact you can see it's starting to affect the rest of the window server.

( Pause )

( Laughter )

All it takes is 50, 2,000 x 2,000 pictures and the video card just...

That's it, yea and the video card is pretty done. Now I'm going to try and reduce the number of textures on this and see if it'll come back up...and you should start to see that green line rise back up. There we go...

Yea, throw more away. I think you've got the perfect working set now.

Yea...see so now we've got where the card is, the card is full but we're not doing a lot of paging activity.

( Pause )

So that's where you can use driver monitor to go in and look at what's going on with your app when it starts to chug like that.

So there's a lot of different aspects to where performance is going and...you know, this is one aspect of it; definitely the falling off the hardware fast path is an aspect but usually the symptom of that is that you get this instant change; you're going along at a very high frame rate and then instantly you're several orders of magnitude less.

This kind of a symptom is, or this kind of a problem with the symptoms of it are that you slowly degrade and then chug, chug, chug and then kind of the whole system starts to bog because you've got the one graphics card and you're using it all up. So that's where you'd want to pull out driver monitor and look at that.

( Pause )

Okay, with that back to you Kent.

Thanks. Okay, switch back to slides please. So let's spend a second talking about paging and video memory and driver monitor. One thing that's important to remember is driver monitor is shared through the whole system so it showing you just the stats from the hardware; it's not showing stats from just your application so the free video memory is, you know, shared between everything and there's a limited amount of it and what other apps do can affect how much you have available to you at any particular time and some things that can get you into a paging situation or you know using a bigger working set of textures than can fit at one time.

Dirty objects is what we call objects that have been created and modified on the video card; so one thing that piece of information that we can give you is that sometimes those get treated differently when you're paging so if you have a lot of modified objects, dirty objects and those...and you need more video memory those have to be copied back down to main memory before that video memory is available; it takes longer to page and sometimes some objects are preferred over dirty objects and that can affect your performance and a way to combat that is if you have these modified objects is to, you know, make sure that you delete them as soon as you're finished using them. You know with textures it's not to big a deal to just abandon them in memory because you have a copy back on the main memory, the CPU side we call it.

So you have a copy but with these dirty, dirty objects they have to get copied back so sometimes it'll say, "It's faster if we just boot this out;" So it affects your paging and so you have to be aware that if you use a lot of dirty objects that can affect that.

You can use driver monitor and Xray also has the same driver monitor stats in it and sometimes it, you know, we like this because it monitors the whole system all the time and you can run multiple applications but there's a time to use Xray when you know, you're trying to profile your app and you want to see it versus the other things that your application's doing at the same time. It's also interesting in that point of view.

So to reduce your paging, if you get in a paging situation and getting in a paging situation is not to hard to do on a video card that only has, for instance, 32 or 64 MB of memory. I don't know what the lowest configuration supported by Leopard is but I bet there's some 64s and there might be some 32s.

So, reduce your texture size; so with these textures an obvious thing I should have done with this application is taken the texture, you know, use your favorite method to scale it down, you know, they're being drawn this big on this screen; there's no sense to have to be 2,000 x 2,000 on the card so you shrink them down when you create them; that would be easy.

I can compress them, use the texture compression features of OpenGL and as I mentioned before you want to get rid of those dirty objects as soon as you can. For a usage pattern where you create something and use it once and then modify it, use it again, modify it and use it again or that you're only going to use it once ever we provided a lot of control for the different types of objects that you say, "Oh well this one is only used once therefore just use it in place and don't upload it to the video card" and for textures that's our clients storage extensions and for vertex and pixel buffer objects you can mark them with the stream draw hint.

There's a lot of documentation about this that's really good on the developer.apple.com website. There's a document called Optimizing OpenGL Data Throughput on Mac OS X and then in the OpenGL programming guide for Mac OS X that's also up there, it goes into extensive detail about client storage textures so if you're interested that's a great way to go check that out.

So, when you're faced with a performance problem and when we're faced with a performance problem in house, you know, what do we do? Well you have to use process of elimination. There's several places that your system can get bottlenecked. CPU, you know if you're doing too much work on the CPU and the graphic cards bored or vice a versa, you know. You might be doing little work with the CPU but the graphics card is extraordinarily busy.

You know, and texture creation is another bottleneck that we spend a lot of time addressing and another problem is once you fix one thing then you can move and you're bottlenecked at a different point then. You know, you may be CPU bound and you fix that and all of a sudden you're fragment bound.

So then you have to iterate on that. So diagnosing CPU bound is probably pretty simple. You know you hunch top or command line guys so we use top a lot but activity monitor works just as well. Some of the people that work in my group have the activity monitor gauges on their menu bar and they're just watching it all the time to see what the CPU is doing when they're running their code. You can use OpenGL Profiler to exam the percentage of time that you're spending in OpenGL versus the percentage of time you're spending in your application.

Shark is also a good way to do that; you can look and see what samples fall where and that gives you a good indication where to start, you know, and to address CPU bound applications, you know, multiple CPUs, multithreading is a great way to get some more. So we're going to touch on texture problems. I think it's been touched on some of the previous OpenGL sessions.

In a paging scenario like we showed, you know, you can diagnose that using driver monitor; it's about the only way to do it. I mean you might suspect but that can tell you for sure; if you're a green line the free video memory line gets close to zero you probably are going to start noticing some lagging performance and maybe if not with you application when you start switching applications, switch to other things that you're definitely going to start to notice the system lagging.

So, sometimes people have problems when they're trying to create their textures or read them back and these problems we call format commercial problems so if you run Shark like this and you notice that you're spending a lot of time in the GL image library; when you're in the GL image library that means that for whatever reason OpenGL is having to copy data, convert it from one format to another or copy it; so, you know, when it's spending a lot of time there it's converting.

You know, when it's spending some time there it's probably copying; so copying can be addressed by using extensions like client storage and the conversions you have to understand what the hardware, your target hardware can negatively process and there's a few formats that are well supported and most people stick to those formats.

There's also other formats that are supported but sometimes it varies across configurations, unfortunately, but it's the capabilities, the native capabilities of the hardware; so if you see this come up then that's a clue that you're converting data and you probably don't want to.

So what happens when you get GPU bound? Well, fragment bound is actually kind of easy to figure out; if you have your window, you shrink it down, your frame rates goes up, then you have, then you're spending a lot of time doing fragment processing.

So that's an easy one to diagnose, fixing is another story but it's easy to diagnose. So some things that can make you fragment bound is blending, for instance. It's, you know, probably 3X the amount of fragments you have to process because when you're not blending you just write on, you know. When you're blending you calculate one, you read what's there, then you calculate again and you write back; so blending adds a lot of fragment processing overhead.

Depth complexity also has an effect on your fragment bound performance; if you, you know, obviously if you can use depth test or something to early discard things, that's a good thing to do; it reduces the back end of your pipe. Vertex bound on the GPU is probably a little bit more difficult to diagnose because the vertex the amount of vertices you issue has a real effect on the amount of fragments you're processing but there are some things that you can do. You can try simplifying your geometry, you know, feed less, feed less triangles in and see what happens, see if your performance goes back up.

You can simplify your vertex process if you have complicated set up of fixed function or complicated shader; you can try using a simpler shader. I mean your drawing will be wrong but you can see if your performance goes up and see it that's your problem. One strategy is to reuse shared vertex elements and by that we mean that vertices that are shared between triangles you can...use strips and fans to combat that situation, I guess.

DrawElements is if you have a big set of geometry and you're picking parts of it out, if you use DrawElements then you're reducing the amount of data that you submit to OpenGL reducing the amount that has to be copied up to the card every time that you submit geometry.

So to sum up here, you know, think of terms of stalling the engine; try not to do it, you know, submit your data in big chunks; as few function calls to OpenGL as possible; you know, don't flush unless you're in one of the situations that requires flushing. Shadow your GL state; if you keep a copy of what the state is in your application you can avoid redundantly calling OpenGL with the same state and you can avoid calling glGet. It's a win, win situation.

So, and also use these tools we showed you and today if you're here and you have time we'll be in the lab from 2:00 to 6:00. We'd be happy to sit down, there's going to be a ton of us there, probably a dozen; we can sit down with you, run the tools on your apps; you know, we can just talk about how you're using OpenGL if you want to do that or we're there for specific problems or issues that you want to talk over.