Apple Performance Tools - WWDC 2002

Tools • 1:05:25

Want to improve your application's performance in Mac OS X? The Mac OS X Developer Tools CD includes a broad suite of GUI applications and command-line tools for analyzing performance. Learn how to use the tools to inspect and manage memory usage, object allocation, threading, and other factors that affect performance.

Speakers: Robert Bowdidge, Eric Miller, Nathan Slingerland

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Ladies and gentlemen, please welcome Technology Manager of Development Tools, Godfrey DiGiorgi. Good afternoon. It's very good to see you all here again. Have a nice full house. I hope you've been having a great show. Fantastic. Applause. In this session, we're going to show you probably what's considered the third major portion of our Development Tools package.

Number one, the integrated development environment, which we showed you yesterday. Number two, the interface design program, Interface Builder, which was in the session just preceding this. And in this session, the package of performance tools that we supply. And to do that, I'm going to introduce Robert Bowdidge, the engineer for Performance Tools.

Thank you, Godfrey. Thank you, Godfrey. Okay. So, before we start off, let's come down to the point of exactly why you're here. And we all know you're here because you want to learn about the tools. But why are you doing that? Well, the first reason that you want to learn about the tools or why you should care about them is that performance should actually be a selling point for your applications.

If you have an application that's slower than a competitor's but has the same features, that forces users to work slower or doesn't allow them to work with as big a data set, people are going to go over to other applications. And so you want to make sure your application is performing as good as it is, or as it possibly can. Secondly, most of the performance problems you're going to run into are going to be invisible.

Apart from a few really egregious bugs, you're probably not going to notice that your application is using much more memory than it should. And until you actually start diving in with the performance tools, you might not even notice that you have performance problems. And so you need to care about this, even if you don't think there's an issue.

The third one, which may not be a reason that you actually came here, is that your application is not the entire world. If you're anything like me, When your application is running, when customers are using it, they're not just going to be using your application. What else are they going to be running?

Okay, I think I know who's in the front audience here. OmniWeb. They're going to be running, well, maybe they'll be running Internet Explorer, though I hope not. Maybe they're going to be running Mail. Maybe they're going to be... Maybe they're going to be running Mail. Maybe they're going to be playing iTunes. They're certainly going to be playing iTunes. And the problem is that your application is going to have to fight with all of those applications for memory use.

And so you can't just judge how good your app is by how it's performing on its own, but you need to make sure it's as lean as possible, that it's using as little memory as possible, that it's as efficient with the CPU as possible, so that when I'm listening to iTunes, my music doesn't skip. This is very important, at least to me.

So how do you find these performance problems? How do you actually find out that there's a problem? And that's what I'm here to tell you about today. What you're going to learn is that Apple does have tools that will help you. It has tools--we at Apple actually have tools that will help you identify that you may have a performance problem and will help you track down exactly what that is.

I'm going to go through some very quick examples of how you might use some of these tools. The tools that you're going to see, first from me, are going to be the tools that are on the Developer Tools CD. After that, we're going to have some folks from the hardware group tell you about some tools for doing even lower-level analysis, for getting information from the PowerPC's performance register so that you can actually optimize every little bit of performance out of your application.

These are tools that are available for download. So these are all tools that you can go off There's a couple things you're not going to hear here. First of all, you're not going to get a lot of details on, well, if you use this API, you'll be in much better shape than if you use that API. If you want that information, you need to go to the framework talks. You should have been going to them all week. And they can give you those sorts of details.

Secondly, this is just going to be a teaser. I'm not going to give you tutorials. You're probably not going to get enough information here to really become experts at the performance tools. But hopefully you'll know that they exist and you'll have a chance to understand how they might work so that you can try using them on your own. Hopefully some of you will actually be doing that as we go along today.

So what are some of the causes of poor performance? Well, one case is you have to worry about memory use. Your application may be using more memory than it should. You could be executing too much code, either the wrong code or code that turns out to be more expensive than it needs to be. You might be waiting for other devices. You may be waiting for the disk to come back with events. You may be trying to draw and talking with the Windows server, which is actually a separate process, which actually does the drawing.

And one of the key issues on any system that has virtual memory is that pretty much all of these problems eventually become memory problems. Because the code that you execute is going to try to sit in physical memory. When you do drawing, you're going to be using physical memory.

When you allocate memory, you're certainly using physical memory. And as a result, the When you use too much, when you get to the point where you don't actually have enough RAM in your system, you're going to get to the point where the operating system is trying to be very nice to you. And what it'll do is what's called paging. It'll take some of those blocks out of memory and it'll throw them out onto the disk. It'll write them out onto disk and bring something else in that needs to be done right now.

And when your application needs that memory again, it'll go back off to disk, copy your stuff in, copy whatever else is out there. And as a result, as soon as your application starts swapping, as soon as it needs to page, your application will start running at the speed of the disk, not at the speed of memory. Because to get that memory that you carefully placed in RAM, you're actually going to have to go off and do several disk accesses. So you want to minimize memory use in all possible ways.

So there's a number of tools available. Here I've divided them into two categories. First, the tools that are used for monitoring, for understanding that you may have a problem, and the tools for doing deep analysis. And what I'm going to do is, for this talk, go through a few of the tools in each of the categories in terms of understanding memory use, in terms of understanding execution behavior, in terms of understanding resource use. Now, to do a good performance talk, we need a victim.

We need some program to analyze to actually look for performance problems. And this year, we are using a program called SimPhysics. And this is actually a project that was written by one of the people in our Developer Tools group, Scott Tooker. And I'd like to bring Scott up on stage right now to actually demonstrate this.

The two things to remember as you see this demo is that, first of all, this is a relatively realistic example. This is a program that shows you electric field interactions. It was intended to teach students on physics. Can we actually switch to the demo machine? Yes, please demo. Thank you. First of all, this is a realistic example. This is something that Scott actually developed for classroom use. And second, it was actually an app that was ported from Java to Cocoa. And so there's a number of inefficiencies caused there.

Okay, so here's the view. What you see here are three particles, positively charged and negatively charged in a normal, or a neutral particle. Scott can Display the information about these particles in several ways. Here we're showing intensity according to color. The first, it also can show in terms of a vector graph where you can see at each point where the charge is or which way the charge is going.

And one thing that we can see here is that there may be a performance problem, that as Scott drags that particle, the redraw rate tends to be relatively slow. And so this may imply that we actually have something that we need to deal with. Okay. Can we switch back to the slides, please?

Oh, and one other thing that you should know about this. This, although this is a realistic example, with some of the problems in here are things that you might actually find in your own code. Not all the problems are actually Scott's. There were actually a number that we actually salted into the program to make a better demo. So if you see anything that looks really nasty, it's probably my fault because I put it in.

Okay. So how do we actually find performance problems? Well, we found one, we had one way there. We looked at SimPhysics, we dragged and we said, gee, that drawing rate seems a little slow. So one way that you can track down performance problems is just notice things that are egregious, that are just too ugly to live, that you need to deal with.

But that's not the only way. In fact, what you probably need to do in order to really do a good job of performance analysis is you need to measure. You need to actually find out how much memory, how your application needs, how quickly it needs to redraw, and so on, so that you can at least compare.

So in the case of SimPhysics, one of the things we did which was not displayed there was we actually kept track of the frame rate. So we could see, as Scott dragged a particle, how many frames per second he could draw. We could then write down that number and actually compare before against after and decide whether we did a good job. Now, the normal way that I actually prefer to do analysis, though, to at least find the nastier bugs, is to use a command-line tool called Top. Could we switch to the demo machine again?

Okay, so here we have Scott actually running TOP. And this is a command line tool that's available on the system. The output from TOP can be divided into two categories. At the top is the information that's about the system. It tells us how many processors are currently running, how much of the CPU is actually being used. The most interesting line I find here is the bottom line, where we see the number of page ins and page outs.

That represents the number of pages that are having to be taken out of memory and written to disk because other stuff is needed to be put into that memory, and the number of pages that are being brought off of disk and into memory because we need that memory again.

So this number gives you an idea about how much paging you're doing. The main number, such as the 4785, represents the number of pages that have been paged out since the system was booted. The number in parentheses represents the number of pages per second that changed. And so if you ever see that number in parentheses ever go, let's say, above 100 or 200 for an extended period of time, that usually implies that your system is doing nothing but paging.

All it's doing is writing out pages and then immediately doing some code which causes another page to be brought in, and all it's doing is writing pages out to memory-- or out to disk and bringing them in off of memory. And that usually implies that your application just plain doesn't have enough memory for all the things you're trying to do.

The second part about Top that's interesting is the bottom part, which is focused on each individual process. And we see here that there's a column that shows the percentage of CPU used by each process, so we can see which ones are particularly expensive. The second column that's interesting is the Rprivate, two-thirds of the way over.

Rprivate stands for the amount of resident private memory used by that application. Resident meaning it's actually physically in memory, so it gives you an idea about the footprint of the application in memory. And private because that memory is only for use by that application. It's not used by any other applications. So we can see here that SymPhysics currently is not doing anything, so it's using 0.0 percent of the CPU.

That's very good. We would prefer that SymPhysics is not doing stuff while my iTunes is playing. and we see that its private memory is about 1.3 megabytes. And Scott, can you actually move something around while you're--just so we see it active? Actually, do the vector graph if you could. Okay.

Interesting. So we can see here that the private memory size is about two megabytes, which seems a little high. I mean, calculators shouldn't take more than like about a megabyte, so that seems a little odd. The second thing we can see is as we're redrawing, we're actually pegging the CPU. We're using 100 percent of the CPU to actually do that drawing, which implies that drawing is pretty expensive. Thank you. Can we go back to slides?

Okay, so most important things, look at the CPU usage for each process, look at the swapping and paging rate, look at our private and top. Try to keep it open. It's very interesting. And as you heard, our vice president actually uses it on our applications, so he must be seeing something interesting.

Okay, so we looked at SimPhysics there and we saw that it was using about 2.2 megabytes of memory when it was running. This seems a little high. However, we didn't see if it was growing or anything. You may ask, well, you know, was--did we have a leak? Were we actually allocating an extra 100,000 bytes every minute or so?

And you couldn't really see that on top. However, even though we don't quite have a tool for doing that, there's ways that you can do that. Because a number of the performance tools are actually Unix command-line tools, and that is very advantageous because that means that you can pretty much roll your own little performance tools as you need to for specific tasks.

For example, if we wanted to see if SimPhysics was actually leaking memory, we could write a little script that would get the output from top, it would search for the line for SimPhysics, save that to a file, and then do that every second. And then we could look at that file over time and see whether we actually saw a change in memory use. Now, these command-line tools don't seem Mac-like.

They're very Unix-like. They're very ugly. You're dealing with text. However, they do have the advantage that you're able to do customized data collection. And these tools still work even if you're running them remotely from a terminal because you don't want to upset the screen, perhaps because you're developing a game, or if your system is hung and the Windows server won't respond and you're trying to find out what's going on. So command-line tools do have a purpose in performance analysis.

So we think that sim physics may have too much memory. Why could that be? Well, one reason is that sim physics could be allocating huge amounts of memory. It could be allocating large structures. A second possible concern for applications is caching. So let's think of a hypothetical case here. I have an application.

It wants to read a bunch of data in off of disk. So it reads the data in off of disk, puts it in memory, and then says this memory is so important to me, this information is so important, that I don't want to have to read it off of disk when I need it. Instead, I'll read it at the start, save it into memory, and then when I actually need it, I'll go to memory and get it, because that'll be faster. Sounds good, right?

So the problem is, let's imagine that iTunes starts up, or Internet Explorer, or OmniWeb, or Mail, and a bunch of those pages get chased off to disk because that memory's needed by one of the current applications. So that memory you've carefully--that data you've carefully put in memory so that you didn't have to do a disk read suddenly gets written out to disk. And when you actually go to access it, you need to go to the disk again.

And so to save a single disk access close to the use, you end up doing three disk accesses over the--throughout the lifetime of the application. That's not very efficient. So you want to make sure that you're not out creating data, loading it in, and keeping it on hand just in case. Try to cut your memory use by getting rid of those caching cases. The third case is that you could be leaking memory. You could be allocating it and forgetting about it, increasing your footprint.

As I said at the intro, any of this excessive memory use will eventually lead to the fact that you're going to be doing disk accesses, and it'll slow your application and you don't want to do it. So keep your memory as small as possible. But how do you do that?

So the tool for doing that is a tool called malloc-debug. It's interest--actually, there's two tools, I should say. The first tool, malloc-debug, allows you to look at allocations according to where they were allocated, where in your program. The second program, called object-alloc, treats--analyses memory by looking at how many objects of each type were created. And what I'll do is I'll look at each of these in turn. Okay. First, let's switch over to the demo screen. Okay, so Scott started up malloc-debug here, and he will create a new window and select the SimPhysics program as the app that he wants to launch.

Okay, and he launches the program. Now, what you can see in malloc-debug is, first of all, up in the upper right-hand corner, we can see that we're using about 1.6 megabytes of Actually, go back a little. Or actually, no, that's perfect. Never mind. So according to this, SimPhysics is using about 1.6 megabytes of memory that it allocated via malloc.

Malloc debug is only showing us the memory that we're allocating on the heap. The middle section of the window here shows us a call tree. It describes where in the program we called malloc. So in this case, what we found is that 1.6 megabytes of that memory was allocated in main and below, in the things that are called from main.

Now, because this is an Objective-C program, main immediately calls the function NSApplicationMain. And NSApplicationMain calls several functions, each responsible for, or each allocating some memory below that point, with most of the memory actually allocated with NSApplicationRun. So malloc debug is trying to help us look at our program in terms of its structure, in terms of what it calls.

Now, this is really nice if you've written a program with a functional decomposition. That is, first you read in the data, then you process the data, then you print out the data, then you close the data. In that way, malloc debug will tell you exactly how much memory is used in each phase of your program.

Now, SymPhysics is actually a Cocoa program. And so the idea of an object-oriented program with lots of objects calling around means that looking from the top of the call tree down may not be so helpful. Instead, Scott can switch from showing the standard way of things, from a top-down view of the tree, to an inverted view of the tree.

So rather than looking at how much memory was allocated in main and below, we can see how much memory was allocated via malloc, and who called malloc, and how much memory was allocated there. And we can see here that about 1.3 megabytes of the memory on the heap was allocated in calls to NS--NXZoneMalloc, which is one of the wrapper functions around malloc. And 904K of that, interestingly, turns out to have been allocated in an NSBitmap image.

and of that 904K, which is about half of the memory in this program, all of that was allocated in two classes, the general graph and the color graph. Okay, so when Scott and I looked at this, we said, "Hmm, this is interesting." It turns out that those bitmaps are used to draw the image on the screen of the particles and the So this is the bitmap that you actually see drawn on the screen.

And actually, it turns out that what ends up--the reason why there's more than one here is that it turns out that SimPhysics, in order to be clever about caching, tried to allocate both of those data structures, or both of those bitmaps, at the launch of the program, even before we'd chosen which view we wanted, whether we wanted the general graph, which showed the most simple view, or the color graph, showing us that nice shaded view.

So we were allocating structures just in case we needed them. And as a result, we're using half of our memory in memory that we haven't even touched yet, probably. So this is bad. This indicates something that perhaps we should fix. Maybe we should only allocate those when we actually need them.

Okay, Malik Debug has a number of other features. One thing that you can do is you can actually set a mark by pressing the Mark button, marking a point in time. And then you can actually change from showing all nodes to only showing the things that have been created since the mark was pressed. And this gives you a way to actually identify all the allocations that occur during a specific point in time. And you don't have to do this.

Another mode that mallocdebug has is the ability to track down leaks. And so Scott can actually choose the definite leaks mode. Show me only the things that are leaked data structures. What mallocdebug does here is it goes scanning through the memory of the application, looking for things that look like pointers.

And for everything that looks like a pointer, it tries to decide if that pointer actually points to a malloc block. And if so, it marks that malloc block as reachable. And then what it does is it shows us only the things that aren't And now that it's done it, let's see what it found. Well, it finds that we've got about 900K of memory that's leaked.

And it turns out it was both of those data structures that we noticed. So not only were we creating them and then forget--and then--or creating them just in case, but we were then losing them. Because leak detection is not just saying, maybe you don't use this. We don't even have pointers to these data structures anywhere. And so there's no way we could even call free. So this memory is wasted and it will stay around until the application finishes.

So by tracking down leaks, we can actually cut our memory use, and in this case, we can cut our memory use by half. And that's very nice. Do you want me to show the memory view? Actually, why don't you do that as well? Thank you. You can also take one of the buffers that you found, and if you actually care about the contents, for example, you want to understand exactly what string's being leaked, you can actually double-click on something in the lower half, which represents the actual allocations at that point in the call tree, and you can actually get a hex dump showing you details about that. Thank you, Scott. Could we go back to slides? Okay, and one last thing I should mention, malloc-debug is only showing you the current allocations, so that as you free things, those will actually disappear.

A third thing that malloc debug does that's actually very helpful is it can help you with debugging, not just analysis. For example, pointer bugs can be particularly nasty. They tend to be insidious, they tend to be subtle, they tend to only happen in certain cases. And so you end up with cases where your program just crashes because suddenly you're trying to access some variable that's been trashed.

For example, you may have a case where your application frees memory but continues to write to that memory or read from that memory until whoever now owns the memory decides to read the value you wrote, which has corrupted their data, or writes to those values that you thought were so interesting.

Another case where you can have problems with pointer bugs is if you overrun buffers, if you write past the end of an array and keep writing on to the next data structure, trashing memory. Malictabug gives you some ways to actually track down such problems. The way that it does it is that mallocdebug actually uses its own malloc library, and it actually turns on a couple special features.

One of those is that every time that you free a block of memory, it overwrites it with a garbage value, 55 hex, so that if a program continues to access that memory, hopefully the program will get garbage values and will crash. mallocdebug also provides guard words. It places a special word on each end of a buffer so that mallocdebug can actually detect whether that buffer has been overrun or underrun, whether you've written before the beginning of it or after the end. In mallocdebug itself, you can track down the overruns and underruns by looking for trash nodes.

However, the other fix, or the other trick, of overwriting free memory will actually cause applications to crash when you do stupid things. And so if your program crashes in malloc-debug but doesn't crash normally, you should attach with a debugger and try to check out what's going on with that, because it generally means that your program has a pointer problem that you should try to track down so that you can make your program more reliable. malloc-debug also prints out a number of warning messages to the console in an extremely helpful way, or to the standard output. So check the standard output when you're using malloc-debug for tracking down memory bugs.

Okay, the second of the tools for analyzing memory use is called ObjectAlloc. And ObjectAlloc is interesting because rather than trying to describe objects according to where they were allocated, it tries to allocate or it tries to determine how many objects have been allocated. Because sometimes it's easier to reason about certain types of pointer problems by noticing that. So can we switch back to the demo screen, please?

So once again, Scott can launch SimPhysics in ObjectAlloc, and then he can start it running. And what you should see here is that, um, Immediately this graph starts appearing. What ObjectAlloc is showing us here is, first of all, how many objects of each type, each Objective-C type, currently exist.

That's what the current column is. It'll also show the peak number of objects of each type that have existed during the lifetime of the program and the total number of objects that have existed. And it also shows a histogram to give us a graphical way to actually detect that, so that we might be able to detect trends. For example, is the number of arrows increasing over time? So one question we could easily ask ourselves in terms of performance is, are we correctly actually freeing all the data structures when we close Windows? So Scott, for example, could create several SymPhysics Windows.

and then destroy those windows. Now, for each of those windows, we had to create several data structures. And Scott can sort the list by category, and then we can go and take a look at a couple. For example, the camera view, or the camera. There should be exactly one camera and one camera view per window.

So at a peak, we should have seen six objects, and currently we should see one. However, currently, right now, what we see is six objects of each type. This implies that as we were closing those windows, we were forgetting to actually free up those data structures. And so this implies that we actually have a bug in our cleanup routines.

Object ALEC also has another neat feature. If you're a beginning Objective-C programmer, you've often probably been caught by cases where you release an object too many times. And on the last release, when you're not expecting it, suddenly your program crashes when you go to look for an object that doesn't really exist. Object ALEC also gives you a way to log each retain and release.

And then you can go deeper into Object ALEC, ask about specific objects, and find every retain and release that's been done on that object so that you can try to understand what the retain count should be and why an object is being freed before it, before the end. Back to the slides, please.

Back to the slides, please. They really like my program. Could we please go back to these slides? Thank you. Okay, so those are memory problems. The second category of problems is CPU usage. So what are the possible problems you can have where you're executing code? Well, one thing you could be doing is you could be executing code you don't need to.

Well, how do you find that out? Well, you don't find it out because performance is invisible. How do you know that you're actually executing code you shouldn't? So this is one of the reasons you actually need to analyze what your program's doing. You may have an algorithm that you thought was cheap but turns out to be much more expensive than you thought.

For example, when you actually have the total of a thousand objects, it may actually be very inefficient compared to a hundred objects. You may have code that you-- or an API that you were calling that you thought should be cheap, but turns out to be extremely expensive. You can't actually detect that unless you measure.

And finally, you might be doing things like polling, constantly asking the mouse, "Hey, what's your position? What's your position? What's your position?" Instead of waiting for the operating system to actually tell you that the mouse has moved, cutting the amount of effort you do. In general, with CPU usage, you want to do the law of diminishing returns. You want to go after the functions that are the most expensive, because those are the ones that if you can even--that improving their performance will make the biggest difference on your application.

So to find those hotspots, find those bits of code that are extremely expensive, you need to use a tool called Sampler. Sampler is a tool that allows you to stop the program while it's executing, find out where in the code it's executing, and then continues. Can we switch over to the demo slides?

Scott can launch Sampler here. And we'll again go with SymPhysics. What Sampler does is it's sampling. It stops the program every, say, 20 milliseconds or 50 milliseconds, finds the code that's executing, and then lets the program continue. And it actually sums up that data. In this case, Scott can take a look at a specific part of the program, specifically the drawing in that vector graph, which we found was extremely slow. So what he can do is press Start Sampling for the app, do some drag so we can find out what's going on in that And then he can press Stop Sampling. And Sampler again gives us a call tree representation of the code that was executing.

In this case, the numbers to the left of the functions, rather than being the amount of memory, represents how often that code was found executing. The 500 represents 500 samples. 500 times the program was stopped, the program was in main. Scott can also switch it to seconds, which should be a little more helpful because that's a unit that we actually care about as humans. And we can see here that we sampled for about 10 seconds, and there were two threads that we found executing. Let's look at that second one, Scott. When we find out that that first thread pretty much was spending all of its time--actually, click down a couple layers.

During all ten seconds that we were looking at this application, we found that the program was running in this Mach message trap. Now, it could be that it was actually calling that several times, but odds are what was happening was it was sitting there waiting for a message and just never came back. So the second thread probably was doing no execution.

So as a result, we just want to pull that out of the graph because we don't care. Scott can press the Now we get to the main thread. And we can see here that we spent about ten seconds of time. Most of the time was, and the information on the left, excuse me, on the right actually shows a sample call stack, so we get some idea of what code was executing. And we can see here that all ten seconds we were in main. That's not surprising. But the Pretty much all the time we were in CF Run/Loop/Run, which turns out to be one of the APIs for doing run loops.

and of that time, about three seconds of the time was in Mach message, which is one of the kernel routines for doing inter-process communication. This is probably when the application was asking, "Hey, has anything happened?" has the mouse move. And so pretty much this was causing, this was when the application was just sitting. So we can prune that out as well.

The remaining seven seconds, we were down in CFRunLoop in some of the code for actually handling window displays. And Scott can go up that. And we can see that we were in NSView draw rec, so we were actually drawing the rectangle that represents the view. And all that code was in the camera view, which is actually in the code that is SimPhysics. And we find, of the six seconds that we spent drawing that view, 5.88 seconds of that was drawing vectors, that is, drawing those arrows.

Hmm, that's interesting. Ninety-nine percent of our time we were drawing arrows. And of that 99 percent of the time that we were drawing arrows, three seconds of the time we were in arrow, draw arrow for vector, and then we were in a number of other functions. So this tells us that the reason why we were having so much trouble dragging that, dragging that particle around was that drawing the arrows was extremely expensive. And there's a few reasons for this. If we actually click down, you can see here that some of the time was actually spent in the NSBezierPath class. That's the AppKit class that you use for just general string--or general line drawing.

The reason why is because the arrows are actually a little more complex. Not only are they just straight lines, but there's also a curved part for the arrowhead. We also fill in the arrow to make the nice arrowheads. And as a result, drawing those arrows tends to be relatively expensive.

So one thing we could do to improve this is we could actually try to simplify the drawing of the arrow. We could try to draw fewer arrows by spacing them out further. Or we could actually try to find the cases where we're drawing, let's say, very small arrows. Where we don't really care about having all that detail and instead just draw a little line showing which direction. So with this information from Sampler, we can get several ideas about how we should improve the algorithm to try to make the screen redraw faster. Thank you.

Can we go back to slides? Okay. The third category that we can look at is use of resources. And I can think of at least two cases of that. One is that we can go for resources such as devices. For example, are we accessing the disk incorrectly or accessing files?

This can be particularly nasty because when you go out to get a file, such as a preference file, let's say, you can't actually guarantee it's on a local machine anymore. It might be on somebody's iPod. It could be out on the network on somebody's iDisk. And so getting files could be extremely expensive.

And so--and unless you actually know which accesses you're doing, you might not have a way to track that down. The second case of resource use is that your application may not be just your code. There are other parts of the system that are involved with getting your application to run. Two examples of that.

One of those is the ATS server. So in order to actually render fonts, to create the fonts, your application actually talks with a separate process called the ATS server to actually do that work. When you're doing drawing, the Windows server's actually responsible for doing your drawing. You actually send messages to it and it actually does the work.

And so when you actually look at your program on top, you need to look at those processes as well, because your application may only take 50 percent of the CPU, but if the Windows server's taking the other 50 percent of the CPU, you may have a problem in your code.

So to track down--I'll tell you about one of those in particular. How do we track down information about how we're using the disk? For example, let's imagine that we did something in SimPhysics to load values on startup. And we noticed that loading all the information off of disk tends to be slow.

One way that we can track that down to decide if that's a disk problem is we can use a utility called fsusage. This is a command-line tool. What you do with fsusage is you need to run it as root and name a process, and it will actually give you a running total of all the file system accesses done by that process.

So it'll actually say you did a read, you did a write, here are what the parameters are, here's how long it took, here's whether it was blocking. And with that information, you should be able to tell which files you're opening and how long it was taking to access those files.

Once you've done that, there's actually another mode of sampler that will actually help you track down exactly where in your code you were calling that. Because you don't actually get the full call-- you don't actually know where you called it. What you can do with samplers, there's a mode that says look at file system accesses. And every time that you call open, read, write, et cetera, the sampler will automatically get the backtrace to that call from your code so that you can find out where in your code you might be calling open and how many times.

The second case of resources is, could we be drawing too much? And how do we actually track that down? Now, drawing is particularly nasty because it affects performance in so many ways. When you're deciding to draw, you end up using CPU time, you end up using memory time, you're dealing with devices, and as a result, it can be extremely costly.

Even just having Windows that are off-screen requires the Windows server to allocate memory for those. And so you're increasing your application's memory use by having the Windows. So how do you actually track down how you're doing drawing? Drawing is kind of hard, because although you can see some of what's actually changing, you don't actually know what's being redrawn. And luckily there's a tool that can help. Can we go to the demo machine again?

The tool provided by the Core Graphics team is called Quartz Debug. And Scott here can turn on an option in Quartz Debug called Flash Screen Updates Yellow and then can do some actions. Here we see, as Scott moves the window, we can see that the window server needs to actually redraw the area around the window and parts of the window title, I believe. So redrawing a window, moving it around, is relatively cheap. Now if Scott tries to resize that window, We can see that when we actually resize the window, SimPhysics is constantly redrawing the contents of that window to make up for the size change.

So this shows one of the issues that you need to worry about with your application. You may not have realized how much work was being done when resizing a window, but Quartz Debug makes it obvious. It makes the invisible visible. And in this case, perhaps some physics should be a little more clever about how it does redraws. For example, maybe it shouldn't automatically try to relay out the entire window every time that it gets a change event saying that the window size has changed.

We can also find out how much memory we're using for Windows by pressing the Show Window List button. And here we get a running total according to each application of which Windows exist and how much memory they're using. In this case, we'll actually see several Windows for the SimPhysics, even though we only have one main window open, because there are other things that count as Windows, such as the icon in the dock, such as the menu bar, and so on. Thank you. Can we switch back to slides?

Okay. So those are the three major categories of performance problems. And so we can ask ourselves, after looking at that and looking at some particular problems, what would happen if we actually fixed those? And luckily Scott actually did that. Thank you, Scott. Yeah. One clarification. It looks like I made things worse for the plane case. I'll explain it. Thank you.

So... So what we've done here is we actually measured before and after. And the before case is actually kind of interesting on the plain view. One of the things that we noticed as we were actually redraw--as we were optimizing this is that we were getting these insanely high frame rates when we were actually moving around with the plain view.

In this case, we saw 35 frames a second. On a faster machine, we saw much more than that. And because we're not playing Quake, we really don't care about high frame rates. And so, as a result, there's probably no reason to actually redraw at 35 frames a second for this program. So instead, we throttled it down and we made it so that the plain view was only going at 20 frames a second.

So we weren't saturating the CPU, but we were still giving enough of a frame rate change so that you actually got the feeling of direct manipulation. For the case of the vector graph, the intensity arrows, though, we went from a case where we were doing about two frames a second.

according to the measurements we were doing, up to about 15 frames a second. And that was because we changed how we were drawing arrows, changed to drawing only little lines when we were doing small arrows, and so on. For doing that intensive--or the color graph, we ended up going from about 2 frames a second to 7.

And by getting rid of those memory leaks that we found, we ended up--and by getting rid of several other data structures that were a result of porting from Java to Cocoa in a relatively simple way, we managed to cut the memory use of the application from about 2.5 megabytes before to 1.1 megabytes after. So if we could do this with only a few hours of looking at Scott's app, imagine what you could do with your application.

Now, I didn't go through all the possible tools. Here are some of the others that you may want to look at. For looking at the heap, there's three command-line tools that may be interesting. Heap, leaks, and malloc history. All of these have man pages. Go check them out. For looking at process state, there's a program called Sample, which is like Sampler, except it works on a command line and puts out a text file describing what it found running. This is really good when your application hangs.

You can go to a terminal, type sample, process ID or process name, ten seconds of sampling, sampling every ten milliseconds, and get a report like that. You can also, if you're doing command-line tools, use all the Unix tools that you may be used to for doing performance analysis, such as Time, where you can say Time and give it a command line, and it will tell you how long it took to actually get that task to For understanding resources, we have tools such as IOSTAT. This actually was not working on 10.1, but works on JAGUAR. And this will actually tell you how many file system accesses you're doing and that sort of thing.

Okay, the tools that I have shown you so far will work on any binary. You may need to launch them inside the tool for them to work, but in general they will work on anything. You don't need to recompile them. You don't need to include some special debugging library. It just works. And this makes it very easy to get their performance information. You have no excuses. If you're doing development with CFM or with Code Warrior, how many people are doing that?

Okay, so for all of you, make sure that when you compile your applications in Code Warrior, that you turn on the option that says that you want to compile your app with in-line traceback tables. If you do this, then all the performance tools can actually get your symbolic data out and can tell which of your functions was found executing or was found at the bottom of a malloc stack or whatever.

If you are using Code Warrior's version of the standard library, make sure to compile it with the option set so that you use the systems version of malloc rather than MetroWorks' version of malloc. In that way, you can actually use malloc debug to analyze stack accesses. Otherwise, we never actually see all the accesses.

There are a number of sources for documentation about these tools. First of all, they were all in the Developer Tools CD, so you can go off and use them. The documentation for the command-line tools is in man pages. The documentation for the graphical tools is within the applications themselves in a help menu. There's also a release notes document in each Developer Tools release so that you can find out about the latest changes.

There's also two books you should look at if you're interested in performance. The first is the Inside Mac OS X Performance Book, which will tell you about the tools and give you some details about understanding how the Mac OS X operating system works. The second book, the System Overview, will give you lots of just sort of general details about Mac OS X and its architecture. So in conclusion, you want to tune your app to make the best impression, so go and do it.

Most of the problems that you're going to encounter are going to be a side effect of memory, just because of the way that systems using virtual memory work. So make sure to cut your memory use in all its ways, whether that's the memory that you allocate via malloc, whether it's the amount of code that you're actually touching and executing, whether it's the amount of drawing you're doing.

Make sure that you actually can compare performance, so you can actually figure out if you're making a difference. Find some metric to measure, measure that on every build, and compare your results across every build. And remember that your application is not just your application, but you also need to watch the other applications that might help it, and you also need to find out if your application behaves well on a system that has lots of other things running. And go out there and create some great Mac OS X apps.

Now, if you have used all these tools, and you've gotten all the information you can out of Sampler, the question is where you go from there. And what I'd like to do at this point is bring on Eric Miller from the Performance and Architecture Group in Hardware to tell us about the Computer Hardware Understanding Tools to find out how you can get information from the performance registers to further tune your code. Thank you. Is this on? Thanks, Robert. Good luck. That's a lot of stuff to memorize.

So, Computer Hardware Understanding Development Tools, we like to call them CHUD tools. There's just a couple things we want to go through. So, well, first I want to kind of tell you what you can do with our tools. You can find out what's going on with the hardware through the CHUD libraries. You can find critical code segments with the tools we supply, and once you've found those critical code areas, you can then use other tools that we supply to help you find the right way to tune those segments.

And then also, the CHUD tools are system-wide tools where you saw a sampler sampling a particular process. The CHUD tools can do both a single process and they can also sample the entire system. So you can see if you have issues in your code or perhaps you're limited by some system facility or some other application using resources.

So, in order to find out what the hardware is doing, the engineers at Apple and our processor suppliers like Motorola kindly provide us with performance monitor counters built right into the processors and built into the memory controller. So what we do with the Chud tools is provide a mechanism to access all that information and bring it to you in a human-readable format.

So you can basically count what's called a performance monitor events. Events like instructions completed. So every time an instruction completes, one of the performance counters will count one or three if you set it up to do so. If there's cache misses, you can count the number of misses that occur. And these can be combined in different ways to tell you statistics about the system at the very hardware level. So in effect, you can read your system directly.

You don't have to worry about guessing or having a model or a simulation. You can use your own system. So the Chud tools provide an overall framework and some tools that bring this all together and make it a bit simpler. Otherwise, you'd have to write your own tools from scratch, a lot of assembly language, and calling out to other kinds of systems. That might be difficult. One of the tools we don't really have time to demonstrate is Monster. Monster is a very straightforward way to access the performance counters.

So you're going to be able to collect data from the counters in several ways. One of the things that Chud Tools provides throughout all of our tools is a hot key facility, which is part of Cocoa in the core foundation. We wrap it up in Chud so that you can access it through your tools when they're not -- they don't have to have a focus in order to start and stop the collection of information.

You can also use a timed interval mechanism that we have in the CHUD tools. We have it down to approximately 100 microseconds level. You can take a sample over time from the performance monitor system. You can also measure event counts in relation to other event counts and create histograms with Monster.

If you're computing performance metrics with Monster, like Memoryband would say, you can collect a number of bytes being transferred on the memory bus. Then you can also collect these samples over time. And then you can compute bytes over time, bytes per second per say. If you collect cycles per instruction by collecting cycles on one counter, instructions on another, and do the math between them. Monster has a very strong capability to do this. It has a built-in calculator that lets you create your own, what we call shortcuts.

These are not too hard to do. I'll just go through a couple of slides on it. So what's going on here on this slide is that Monster has been set up to count a number of things on the performance counters. In orange, you can see in the left column that we have the third and fourth performance monitor counter. Our measuring floating point unit instructions and instructions completed.

And then below the listing of PMCs, you see a section that says shortcuts. These are calculations that will be performed on the performance monitor counts after they're collected over time. So the second part of the screen on the right shows a single run with lots of intervals, perhaps 10 millisecond samples.

In these intervals, you can see that the third and fourth column are highlighted in orange. So we've collected the various floating point unit instruction counts and the overall instructions completed in counts. You can see about the sixth sample that it starts to count. It starts to kind of regularize. We have 2 million, 7 million, 8 million FPU counts. Again, 23 million, 35 million, 36 million overall instructions completed.

So then if you were to just slide over there to the rest of the...

[Transcript missing]

So, I don't have too much to say about Monster, except, oh yeah, you can graph the things that you plot by selecting columns. So what I did was the percent floating point instructions are selected so you can see the little graph over time of the percentage, how it varied.

So you can see there was some work done for a few samples, and then a drop in the amount of work and back up, draw a little drop, back up every four or five intervals. At this time, I want to bring up my colleague, Nathan Slingerland, so we can do a live demo of one of our other tools called Shikari.

Thank you, Eric. So, actually, can we go back to the slide for a second? Talk a little bit about it first. Okay, so Shikari is a system-wide profiling tool. It's similar in purpose to Sampler, except that besides time, you can sample on any of the performance events for those performance monitor counters that Eric was talking about.

So this lets you correlate these events to your source code. So if you want to know where cache misses are coming from or alignment exceptions or, you know, instruction counts, any of those things, you can find that out. And then using Shikari, you can get an annotated list of any function, and it'll tell you exactly how many cache misses come from that piece of source code. So, yeah, let's switch over to the demo machine.

It's probably the best way to see exactly how this is used. So in order to demonstrate Shikari and also to show you how you can solve a real performance problem with it, we're going to look at Flurry. That's a colorful, open source, OpenGL screensaver by Calum Robinson. That's what it does.

So unless you look at this with a sampling tool, you don't really know what's going on in the system. So because it is a screensaver and because it's OpenGL, we might be processor-bound, we might be graphics card-bound, we don't know. So in order to see what's going on, let's bring up Shikari.

So what we see here, this is the main Shikari window. And this list, this table, when you sample, is going to give you a profile listing from most frequently seen to least frequently seen function names. And we also have built in a bunch of very useful sampling configurations. You're free to create your own. There's a facility in here to do that.

But in the most basic way, we just want to see a time profile at first. And, you know, that's really the mantra of performance optimization is profile, profile, and profile again. You want to see, you want to keep seeing how, when you change your code, what really is happening to, in the system each time. So because the Shikari also uses the hotkey facility that Eric mentioned before, it doesn't have to be in the foreground. It can sit in the background. And that works well for a full system. screen app. So he's going to use the hot key.

And actually, the sampling itself is pretty low overhead. It's only when it starts looking at symbols and everything that it actually starts to get in the way. So what's interesting here, we see that square root, system square root, is taking up 40% of the time. And that's in the whole system, right?

If we actually -- we see that flurry is 99% of the system, and we saw the other half was .8% there, our other piece. So 50% -- or 40% of the time in square root. Now, Mac OS X now comes with a very fast and very precise square root routine.

But -- so it's IEEE 754 compliant and 53 bits of precision. And that's great, except this is a screen saver. So we might be able to get away with a little less precision there. So if -- let's take a look at the source. So this is the original source code. We can see that we have a little bit of a delay. We're -- we got a division there and a square root based on that result -- or a square root, and then we're dividing with that.

But -- What this actually is is a very good opportunity to use the floating-point reciprocal square-root estimate instruction that's available on any G3 or G4 processor. So there we are, in-line assembly. Now, I should emphasize that this is not always appropriate for your code, and usually you probably want the full result for square root. But in our case, this is approximately five bits of precision, and it turns out to be plenty. It's, you know, visually, that's all we care about. It looks great. So let's see how long it takes.

Oh, I should tell you exactly what our metric is here. So it took 12 seconds, and we're measuring how long it takes to render 500 frames. And we've modified the original flurry a little bit so that it's deterministic, and every time you get the same type of pattern on the screen and like that. So let's hit Go here. We'll see how long that takes.

Six seconds. So that's a 2x speedup, right? So not bad. Now, okay, so we want to profile again. We want to see where we're actually spending time now. And this is an iterative process. You keep just banging on it with your profiling tool to see what's happening. And we expect not to see square root in there.

Sure enough, it's not there. So now our functions inside of Flurry, updateSmoke and drawSmoke, are up. So because we know that Update Smoke and Draw Smoke are floating-point intensive, we can use another one of Shikari's features, and that is to tell us where floating-point issue stalls are coming from. And that's just another configuration here. Run the demo again. And it's the same thing. So you start with a hotkey.

So, like we might expect, Update Smoke has the majority of the floating-point issue stalls. And what Chikari will let you do is drill down inside of that function and see the disassembly of it. And besides that, it's going to visually show you where your hotspots are. In this case, a hotspot will be orange, and the colder spots, which weren't sampled, will be blue. So this is statistical sampling, so you're not going to see an exact execution sequence here.

So if we go down here near the bottom, we see, wow, Floating point move register. Yeah, it's dependent on that floating point reciprocal estimate. Yeah, okay. So most of the floating point issues still stem from there. So let's go see what is that. If we can bring it back. Shikari, you gonna... Shikari back. It's line 279 in the source code. So now we can go back to the source.

So that's this loop right here. And so what can we do with this? This is a scalar floating-point loop. One thing we might do is use software pipelining, or if you attended the AltaVec session, it's called data injection. And that is unroll the loop. Give the processor more to do that's not dependent on results, so you're not serially dependent there.

Or maybe we'd like to vectorize it. It turns out that this algorithm is quite vectorizable. So what we did is we did both. We unrolled and we vectorized. Here we are. That's the new source code. So let's bring up Shikari and ask it where--well, let's actually see Flurry first. Let's see how fast it is now.

Okay, so three and a half seconds. So, not bad. Almost another 2x speedup. So, again, maybe we could make Flurry even faster, right? Maybe we would want to look at, there's an Altevec issue stalls preset. Or maybe we want to try multi-threading this. I mean, if we have a dual processor machine, we might be able to take advantage of that second processor. But before we go forth and go do all the, invest in all that work, let's take another time profile during the Altevec rendering.

Okay, whoa. Driver. Can't do a lot about that. So the driver is actually dominating. I mean, it's taking about a third of the time. Our code is still important, but it's less so now, right? So the better the job you do at optimization, the less you're going to see that show up in the profile.

But it turns out, maybe now it's time to review how we actually implemented Flurry. Actually, we didn't. Kalem Robinson did. But, you know, it'd be time to go back and look at that. Because we've talked with our colleagues in OpenGL, and they've told us, yeah, probably there are better ways to render this, to be drawing this as is done right here.

So, you know, you'd go and try and fix how you're interacting with OpenGL instead. So at this time, could we go back to the slides? Just to give you a taste of what Shikari can do. And of course, you can use any of the performance events that are available on the memory controller or the CPU.

So underlying Monster and underlying Shikari are the Chet framework. And this allows you to write your own performance tools. All the functionality is there. We're not hiding anything from you from what we use. You can access the performance monitor counters, special purpose registers on the chip, collect information about the hardware. It's not that you couldn't do this on your own. It's just the framework tries to consolidate this into an easy-to-use, simple C API.

And--or the other thing you can do is to instrument your code. Like, if you know you have an important function, or a sampler or a shikari tells you you have an important function, and you really just want to know how many cache misses are inside of this function or something like that, you can put chud calls inside of there. There's also a remote facility that'll allow you to start and stop an application like Monster from within your application. So you can just capture a small bit of what you're interested in.

So here's an example of that, just to see. So call chat initialize. Everybody has to do that when they use the framework. Get some information about the processor and the bus frequency and like that. And then next comes the more important part. We're gonna set up the counters. We tell it what event we want and which PMC.

So these are on the CPU. Clear the counters, start the counters, and then do what you're interested in, right? So that's where--whatever you're interested in is there. Stop the counters right after that. And then have some kind of reporting function that's going to list out what was in the PMCs over time.

Okay, in the Chud Tools package, there are also a few other useful tools, especially if you're doing things like Altevec. There's Amber, which is an instruction tracer. It's going to record the exact sequence of instructions that your program goes through. ACID, or SIMG4, takes the output of Amber, and that we have a, it's a G4 or 7400 simulator.

Or ACID can give you information about the, like a newer processor like the 7450. It's not a simulator, though, but it can tell you about dependency stalls and how, you know, different kinds of latencies. And then Reggie is a tool to look at and modify special purpose registers.

So where to get it? This is the website where you can download the Chud Tools package. I should tell you right now that the package that's up there on the FTP site there, the package that's on the FTP site right now will not run in Jaguar. It won't hurt Jaguar, but the tools won't work. But you can expect within the next few days to see a package. You want the Chud 200 beta package. There's a beta up there right now.

And that has a lot of features that have been added very recently. And like I said, you can expect to see a version you can use with your new Jaguar CDs very soon. And any issues you can report to Chud Tools feedback. So at this time, I'm going to-- let's bring Godfrey DiGiorgi back up to run-- to have a few more slides.

Thank you, Nathan. And a big hand for our presenters. I think they've done a fantastic job today. So more information. I listed out the two manuals that are available. They're both available on your disk in PDF format in the Developer Documentation Essentials Directory. You can also order them from our website.

If you go to the TechPubs website listed above, you can go to our supplier for print-on-demand. We, of course, have the Connection Tools page and a link to the bug reporter. So if you find problems with any of the tools, we want to hear that feedback. We want to hear the bugs and see them in our system.

[Transcript missing]