Tuning Software with Performance Tools - WWDC 2003

Apple Developer Tools • 1:03:12

You've brought your application over to Mac OS X and now it's time to make it fast! The Apple Developer Tools include a broad suite of tools for profiling and analyzing applications. This session explains how to use them in combination with concepts of performance tuning to improve the performance of your Mac OS X application. Developers will take away a checklist of performance concepts and learn how to use the tools to profile application behavior, inspect and optimize memory usage, object allocation, threading, and other factors that affect performance.

Speakers: Dave Payne, Robert Bowdidge, Nathan Slingerland

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Thank you. So let's get the clicker working. It just worked when I was last up here. Okay, let's talk about what we're going to cover in this session. First I'm going to talk a bit about some performance analysis concepts. As you're going through the process of thinking about making your application fast, what are some of the things you want to consider in the process of that?

Then we'll take a look at some specific examples of uses of two different classes of performance applications on specific test cases. So we've got some high-level tools, Sampler, Malloc Debug, things like that. And then the Chud tools that have been talked about some earlier in this conference, we'll take a look at both of those in this session and where they're applicable. And also a little bit about the integration with Xcode with these.

Now one thing that this session is not really going to cover in detail is, oh, you should not use this API and you should use that API and use this one here. There's a lot of other sessions like the Carbon performance session, which I think was Friday morning if I recall, where you can get some details on that. And there's a lot of great performance documentation that can cover a lot of those details as well. Now this is going to focus on the tools here. But first let's talk about motivation. Why worry about performance? You know, it's a selling point.

We see it across the board. It's a selling point for Apple with the hardware. It's a selling point with your application. If you've got competitors that do similar things to your product and they're a lot faster than you, you know, that's a big competitive advantage for them and vice versa. So, let's talk about motivation.

performance problems may go unnoticed. I've seen a couple examples of this in recent days where you look at the system and it's just sitting there idle with a window up and maybe an inspector up. Then you look at the CPU usage and notice that 95% of the CPU is being used.

But it's sitting there idle. What is going on? Sometimes you actually have to look for these things and detect them or the issues of scalability that you don't see when you're working with your unit tests. But when it gets out into the field and people are really throwing lots of data at it, then problems occur.

Remember, unlike with Mac OS 9, where you had control at certain points, you are not the only app on the system. There's potentially a lot of other things going on, system demons in the background, etc. It's not nice to nice your process up to try to fool the unit scheduling and things like that. You really want to make sure that you play well with the other applications on the system. And finally, you want to start thinking about performance from the get-go with your application.

It can be really hard to finish up your whole development cycle and then say, it's too slow, and then try to come back in and graft performance on at the end. You might want to wait and do tuning later on. Don't obfuscate your code from the beginning with no purpose, but think about performance issues as you go along. go along.

[Transcript missing]

Once you've defined your benchmarks, then you want to add some instrumentation in, maybe specific API calls at the start and the end so that every time around it's real easy to collect the statistics of how long did these operations take. You want to be able to measure this on a precise basis time after time. And that's the key to, throughout your development process, tracking results as you go along.

People ask, how did Safari get so fast? How did they do this? They tracked performance throughout their development. It was a key issue from the beginning. And they never allowed performance regressions to get into their code. I'm sorry, that makes it slower. Each engineer was required to run the performance analysis test suite in Safari before they could check their code in. And if it made it slower, they weren't allowed to check it in. That's a great feature, but I'm sorry, performance is the number one feature, so it doesn't get in unless you make it faster.

And finally, once you've gone through all this effort, then out pop the hot spots that then you can go in and start to tune these where it really makes a difference. Because, you know, you've probably all observed that we can tend to be notoriously bad at guessing in advance where our performance problems are going to be. Gee, it was fun to spend a week optimizing that particular routine, but it made no difference. So when I say Safari, you know, it was really a part of their process.

They actually embedded into their application and into the development versions internally instrumentation in the form of a panel here that their engineers and their QA staff and managers could pop up and run tests at any point through the development process. It really made it an integral part of what they were doing because it was so important to them. They could do things like check for memory leaks and sample directly from here. So you might want to consider.

Adding that kind of thing to your application. So when we talk about benchmarking, what kinds of things might you want to look at? There's a wide variety of factors that play a major role in performance on Mac OS X. You've probably heard us talk time and time again about memory use. We have a limited amount of RAM space on the system. And once we eat through that, then we're starting to page out to the disk.

And that's a lot slower. So if you're using a lot of static memory or leaking memory, that can be a big problem. So measure that. Maybe you're not actually using that much more static memory over time, but your dynamic memory, how much you used during this particular operation really spiked up. And that can cause problems. So that's an area for measurement.

CPU use, I mentioned launch time. There's other things that are fairly obvious. Gee, this is one of the major operations of my system. How long does that take? If it's a fast operation, maybe you want to scale it up and run it 10,000 times and measure how long does it take to do 10,000 runs of this. Again, idle time.

You're not the only app on the system. If you're not doing anything in your app, it shouldn't be taking any time. And then the spinning watch cursor, the spinning rainbow cursor. That shouldn't ever come up on our system. Well, it does occasionally in my apps and it might in yours, but let's go fix that. We'll show you some ways to tackle that.

Drawing. It might not be obvious, but sometimes you're drawing too many times to the screen. We've got some great tools to take a look at that. Now that we're doing live resizing of windows or live resizing of split views, are you getting smooth resizing during that? There's a variety of things to consider for benchmarking. So once you've identified your benchmarks, then you need some tools to take a look at the issues.

So we've got a variety of tools on the system for both monitoring what's going on and then for getting in and saying, okay, I see that I've got problems with CPU usage. Where is the time actually being spent? And so we can look at memory use and CPU behavior and resource usage like file systems and system calls and drawing. So what's the best way to do that? So we'll cover a lot of these tools as we go through.

One thing to bear in mind as we think about performance is that there's actually a lot of different levels of performance in the system that can make a huge impact on your overall application. So let's think about layers of design abstraction. Your application architecture. If you're a multi-threaded application, do you get deadlocks between your threads? We have tools like Thread Viewer to take a look at that. Maybe you're multi-process and you're getting network hangs.

If you've got complex object-oriented architecture, are you sending too many messages between the various objects? Or maybe one object is acting as the bottleneck for everything, a god object that everything has to go through. These are sort of architecture-level issues that you might want to consider from the beginning. Then within a specific module, a specific class, you can think about things like your data structures, your algorithms. Are you allocating too much memory here in this process? Or is the algorithm itself a poor algorithm for scaling up? What's the interaction with the OS?

Again, the documentation covers a lot of things like this call in Carbon to enumerate the directory structure is slow. You might want to consider using this instead. Then bottleneck routines. Once you've isolated it down, okay, we seem to be spending a lot of time in this routine. So on the right of this diagram here, we we show that we've got a number of high-level tools that you can look at some of the higher levels of the design abstraction. Once you get down to things like the interaction with the OS and the bottleneck routines, then the sampler profiling tool and the shark tool from the chud package that we'll talk about later start to kind of overlap in their capabilities.

They both let you do profiling and look at things in somewhat different ways. So both can be helpful. When you really get down to trying to optimize the use of your processor, your memory, Shark is a great tool for that, plus other chud tools. And then Activity Monitor lets you take a look through everything as well.

So we'll be taking a look at a number of new features on the system. On the user CD, there's a new Activity Monitor application that replaces CPU Monitor and Process Viewer and things like this. Really nice application that the Core OS team did. Spin Control, a new application to see what's going on when the watch cursor's spinning. Take a look at integration of the tools with Xcode. There's a number of new features in Sampler that we'll take a look at.

And then with chud, where you can really get in and see what's going on now with the G5 in addition to the G4, things like that. So, With that, let me go ahead and turn it over to Robert Bowdidge, Performance Engineer, for looking at some of the specific tools. Thank you, Dave. Thanks, Robert.

OK. What's the first thing we need in order to actually demonstrate the Performance Tools? we need a victim, and the victim we've chosen this year is the Sketch application. This is a small Cocoa application that's available in the Developer Tools CD. So, for those of you who've seen us using Carbon Apps through all the Xcode demos today, this allows you to realize that the tools actually do work on Cocoa as well.

Now, if you actually go and look at Sketch, you won't see any performance problems. This is a program that's intended to do simple line drawings. You know, draw a few rectangles, put some text in, maybe do an org chart. But if you look at it, you don't necessarily see any serious performance problems. The guys who wrote it did a pretty good job of making it a typical Cocoa app with no performance problems. So, we need to add some performance problems.

And, I'm and actually the way that we did it this time was rather than adding some, salting some bugs in there, we decided to try to increase the scope. So rather than trying to do small drawings, we said, "Well, let's imagine our boss comes into our office and says, 'Hey, you know, that Sketch app, that's really good.

I think we could do architectural software with that.'" And so suddenly, instead of drawing tens of rectangles, five rectangles, we're drawing thousands of rectangles. And the question is, what's going to happen? Are we going to find any performance problems? Are we going to find that our memory uses a heck of a lot more than we ever expected?

Are we going to find CPU problems where we're running too much code? And hopefully this is a situation that many of you run into in your own code as you look at applications and find out that on certain data sets it doesn't quite behave as you expected. So let's take a look at that. So I'd like to bring Kristy Warren up, who is the performance engineer for the text team, to actually do a demonstration for us. To actually start, actually let's go on the slides for a sec.

Thank you. So, one question is how you actually find the performance problems. As Dave gave us an idea of some of the processes that you might go through, whether that's looking for regressions or following a certain pattern of measuring certain things every time, but sometimes you don't have that. Sometimes you start with a new application and you're not quite sure where to start looking.

So, the way I like to start and the way our Vice President likes to start is to use either the command line tool top, which hopefully you've seen in previous years, or thanks to Eric Paton and some of the folks on the CoreOS team, we now have a new tool called Activity Monitor, which gives us a way to look at this. If we could switch to the demo machine now.

Thank you. Okay, so we have activity monitor over here on the side. And the way activity monitor's divided up is the information at the bottom of the screen represents the system wide information about your computer. So in this window we're looking at system memory. And one interesting piece of number here is the page ins, page outs down at the bottom, which represents the amount of swapping your virtual memory system is doing. How many pages are being written off to disk? The other things, the wedge, the numbers here, represent how physical memory is divided up on your system. How much of it's used for user stuff? How much of it's used for the kernel?

How much of the memory is wired down because there are structures the kernel doesn't dare page out like the virtual memory system? The other tabs, for example, CPU, give you an idea about how much work the CPU's doing in general, kind of like the CPU monitor application does. And the other tabs for disk activity, disk usage, and so on, also give you summary data. The information at the top gives you details about specific processes.

And so we can see activity monitor sketch and so on. And we get information not only about what's running, but how much CPU usage they're doing. And we can sort this list according to what's the most CPU intensive or we can look in terms of process name or hierarchy in the process groups. So Christie already has sketch running.

And we can double click on that entry to get a little more detail on sketch. And the important numbers here is the, the percent CPU as usual and the private memory size down on the bottom. Now, private memory size is kind of an interesting number. It represents the amount of resident private memory that's being used by this application.

That is the memory that's resident in physical memory and the memory that's only needed by your app. And so this ends up being a nice number because it represents sort of the footprint of your application because that memory is, first of all, only based on what you're using.

And then you have the number of memory that's being used by your application. And then you have the number of memory that's being used by your application. And then you have the number of memory that's being used by your application. So it's the number of memory that's being used by your application. So you can see, based on what your application is doing.

And secondly, it's the memory you can control. It's the memory being used for the heap or it's the memory that you're allocating via VM allocate. And so it gives you a good idea of, of what your fault is and how much you can reduce as opposed to the others which tend to have a lot of details of amounts that you can't actually reduce. So we can see here that just having sketch up took up about 1.64 megabytes. Not great, not bad.

That'll do. So if Christie can now load one of our architectural drawings. We have a factory here. Okay, we're going to build factories. And when the architect goes to the customer and says, "Here's your factory," the customer says, "Oh, I want six floors, not three." Okay, we can do that.

We can select the entire building, we can copy it, and we can paste it. Now we have six floors. Oh no, that's not enough. Let's make it twice as wide. So let's do it again. We'll select all. We'll copy. And copy is taking a little while. That's not good. And we can paste.

and so we're drawing a couple thousand rectangles here to draw that building but we're already noticing a couple issues. One was that copy was getting a little slow and we're going to find out it actually gets a lot slower as we go along. But the other thing is if we go over and look at Activity Monitor we find out that we're actually using 7.6 megabytes of memory.

Okay, so 7.6 megabytes minus 1 megabyte, or 1.5 megabytes. We used about 6 megabytes of memory to do those two copies and pastes. Okay, so we've got a performance problem here. We have a problem in what we're doing in terms of the copy, so in terms of CPU, and we have a memory problem. Could we switch back to the slides, please?

Oh, another interesting thing about Activity Monitor is because it's looking at the entire system, that means that you can see what's going on in other processes. And one of the things to remember is on Mac OS X, your application's work on the system is not just a matter of what your application is responsible for.

There are other processes, whether they're little demons that are on the side, or more importantly, things like the Windows server, where if you're doing a lot of drawing, now your application may only be taking up 60% of the CPU, but the Windows server could be taking up the other 40%. So when you're looking at Activity Viewer, you also need to look at the whole system to understand what else your application may be doing so that you can find other ways that you might be able to optimize.

Okay, so let's attack the first problem. What do we do if memory use seems a little high? Well, why do we care? Why don't we just like use as much memory as we can? This will at least make the people who sell SIMs happy. Well, there's a couple reasons for that. One of the--and generally, using too much memory is not a good thing.

One of the reasons is your application is slow because suddenly all the data that you want the CPU to be processing as fast as possible, especially on one of these G5s that can really race, is--can't fit in cache or you'll chase it out. And so suddenly, you're having to rely on the speed of the main memory instead of the cache.

And so you want to keep your application as memory lean as possible so that you can have as much as possible in the cache. If you're not using the memory, well, then it's just sort of wasting space because the--it's sitting in physical memory and maybe you're not touching it.

And if I come along and I start playing iTunes or I start running iPhoto or I start using Mail or I start doing Safari, which every one of your customers is also doing when they're running their app, I'm That means that when Safari needs more memory to put in some big page, some of your pages may have to get forced out of physical memory and written off to disk. And so the computer's going to have to do a lot more work just because you want to keep that memory around.

So you want to keep your memory footprint slow for that reason. And if you've forgotten about the memory, if you've allocated it and you've forgotten to get rid of it, it's even worse because you can't free it at that point and it's just going to get copied around on the disk.

And because of the virtual memory system, you can actually run into some rather interesting problems where you might not have expected things to go as badly as they did. So here's an example. Let's imagine we've got some really large file. You know, it's 10 megabytes or 100 megabytes. And reading it in when we need it seems a little slow.

Well, I know what I'll do. I'll just read it in before I need it so that it's available. I'll read it into memory and that way when I need that file, it's right there. The problem with that is that what happens if I go off and I run iPhoto and I run iTunes and I run Mail and everything else? Those start to need memory. And so some of your pages that you've brought in get chased out to disk.

And then when you actually need that file or that parsed representation of the file, say, suddenly it has to be brought in off of disk again. And so in order to save that disk read that you did, you've now read it in memory, written it out to memory, and read it back in.

"So you don't want to do that. You want to try to keep your memory footprint as low as possible. And you want to do that in terms of both the memory you use and the memory that you've forgotten about and that you're leaking. Now there's two tools that you can use to do this. One is called Object-to-Alec and it looks at your memory use in terms of how many objects you have.

And the second one is called Malic Debug and it refers to things, it refers to allocations in terms of where they are so that you can see particular places in your code that tend to allocate a lot of memory. And let's take a look at the first of those, Object-to-Alec. And actually let's switch to the demo screen.

Okay, so here I am. I'm running Sketch in Xcode because Xcode's really cool. And I want to go and do some performance analysis. Okay, how do you do that? Well, the first step I usually do, or at least the first step I always hear from everybody is go hunting around on disk trying to figure out where the performance tools are. Actually, who knows where the performance tools are?

Okay. The developer tools are in developer applications. That's nice. The performance tools are there too. But the problem is, you've got to go hunting around for them. You've got to use the finder which was computer-centric and not human-centric. And that kind of, and that wasn't very good. I was going to use another word but I won't say that. And so, what, and we've improved that. So now what you can do is, you're going along and you say, "I want to look at performance." And you can go up to the debug menu now and there's now an entry called Launch using Performance Tool.

And it will list the three perform--. Ooh. It'll list the performance tools that are--. Geez, if I'd known this would, if we'd known this would make people happy. And in fact, if you actually had installed the Chud tools, which sadly I did not because I wasn't a good person, you'd actually have Shark there too. And I'd suggest you install Shark so that you can actually see it on that list. And so we can launch ObjectAlloc here. And here's the Object Alloc window. And let us launch SketchInit.

And what ObjectAlloc does is it instrument your code, it runs it, you answer a few questions, but it keeps track of how many objects have been created and it updates that constantly. And it shows not only the current number of objects of that type that exist, but the peak number that have ever existed during the lifetime of your program, the total number you've allocated during the entire program. And so we can go to our little example.

We can open our factory. And we can see that we're creating huge numbers of CFStrings and all sorts of other things as part of doing this work. And here's our factory. So let's again do our select all. and Copy and our Paste. And Object Alloc is doing its good work.

and you can see that things are updating. And let's do that again if we could. Now, you can notice that to the far side of the numbers there's some histogram. There's some bars there indicating how many objects you have graphically. And that's very nice because that gives you a way to directly perceive how fast things are changing. So you can see, oh, my God. I'm creating a lot of these objects really quickly.

And the colors actually have meaning because if it's colored yellow, then that implies that the current number of objects of that type is only about 20 percent of what the peak is or less, which implies that you created a whole bunch of them and then you backed off, which may imply that maybe you're not auto-releasing things quickly enough or maybe you're just creating a huge number. But it's hopefully going to make you look at that to try to figure out why you had so many. And the red indicates that you have only 10 percent of peak value.

Okay, so we've now done our copy and we can go over. And now what we do is what we should do in all the performance tools. What we're doing is we're looking at these, basically scrubbing our nose against the data, looking for something that looks suspicious because the performance tool can't really say, "Oh, this is the problem. If you fix this piece of code, you'll be happy." In general, it tends to be much more of a, you look at it and you say, "Oh, gee, I didn't expect that.

Why is that happening?" And then you go and track down the bug. And what we can see on this immediately is that the second most common object after general block 14, that is mallocs of size 14, because object alloc will look at both mallocs and CF objects and Objective-C objects, is we can see that we have 4,000 NS invocation objects.

NS Invocation. That's not in my code. And in fact, you know, not only do we have 4,000 of those things, but if we check count as bytes, we find out that out of 2.9 megabytes of memory that are used for all the objects, 800K of it is used for NS Invocation. So about 25% of my memory is because of these.

That's odd. Well, ObjectAlloc gives us a way to track that down. So what we can do is go over to the Instance Browser. and we can select NS Invocation and we get a list of all the objects of that type. And if when we launch the application we happen to see, check the little box that said keep track of retains and releases, for that object we would see all of the times that we did a retain in Objective-C and did a release on that object so that we could find over-retains. Or we could click on allocation event as Christie's done here and we can take a look at the back trace indicating exactly where that object was allocated. And we see here that it's allocated in our select graphic object.

Now, this also shows another feature that's new in the performance tools. That in the past you'd find something simple and you wouldn't be able to track down where it came from. Now, we take a look at the STABs information, the debugging information in your binary. And if we can find the location of that function, we actually will highlight it in the performance tools, either with a little file icon or by underlining it. And so you can double click on that and project build--and Xcode will actually show the code for you.

And what we find out is that the NS invocation objects are being used for undo's. So every time that we select a rectangle, that we copy it, that we paste it, for each of those thousands we end up creating an NS invocation object to handle undoing that at the end.

Okay, so we're creating thousands of these things. Okay, so this is an interesting problem. We've got some solutions here. We could just decide that the undo support in Cocoa is just such a big win productivity-wise that we just don't care. That's a fine answer. If we really care because we're going to be doing lots of architecture stuff, then maybe we actually want to change this and we want to create our own undo mechanism. Or a third option is we could say, "Why are we allowing undo to select because the HI guidelines don't require us to?" And so you could actually get rid of that.

So this is one of the ways that you can step through finding something suspicious and tracking down why that's happening and tracking it down to your code to understand what the problem is. And that's how you can use the performance tools. So can we go back to the slides, please?

Okay, second question: What do we do when the CPU use seems too high? First of all, why do we care again? Well, answer: If you're doing something and it's taking too long, it's not only making your application look bad, but you're going to make my iTunes skip. And I don't like that. So you need to worry both about your own application and how it performs and how you're affecting the rest of the system because you're not alone. There's lots of other things running on all our computers.

So there's a few tools you can do to track down CPU use. One of those is Sampler, our profiler. Sampler can also be used to look at what's called dynamic memory footprint, as Christy puts it, which is a way of understanding where your calls to malloc are and using those as suggestions of where you might be doing too much work. And there's also a tool called Spin Control that's new on the release that gives you a way to automatically sample when you look at when the spinning cursor comes up. And I'm going to show all three of these. In ports debug, look at it on your own.

So with Sampler, Sampler is a statistical profiler, technically. And what that means is that every 10 milliseconds, 5 milliseconds, 20 milliseconds, Sampler stops your program and says, "Hey, what's going on?" And it goes and it gets a backtrace from every thread that's running. and looks to see what work is going on.

So it gathers the backtrace. And then it lets the application run for a little while longer and keeps doing that. And at the end of the sampling, it gathers all those backtraces together and smashes them together into a tree so that you understand the range of ways that your application is behaving.

and then it presents in a graphical way. Now, there's a couple of things you need to remember about Sampler. Because it's statistical, because it's only stopping the program at times, it doesn't know what happened in between. And so that means that it may not catch all the functions, though any function should appear in the samples in proportion to the amount of time that it's actually spending running.

Now, if sampling every 10 milliseconds isn't good enough for you, that you need finer resolution, then you should try using the Performance Tools chart, which you'll see after this. And if you need to know about every single call, then you might want to consider actually using Gprof, which is a standard Unix profiler, which requires you to actually recompile your code. So, let's take a look at Sampler. Could we switch to the demo machine again?

Okay, so we have Sampler up. Up in the upper left-hand corner now on the new version, this is a new UI for this release. And up in the upper left-hand corner you get the type of sampling you're doing. You can sample either based on time or you can get a back trace every time Malik's called or you can look for specific function calls. We're just going to do time samples here. And what we'll do is we'll launch Sketch in this.

We're going to look at that copy because copy seemed like it was going a little slow and I don't like that. So we open up the factory again. Let's do what we did before. First of all, we need to start sampling. If you remember how Sampler used to be, you actually had to switch over to the Sampler window and press the button and go back to your application. That was annoying because you often would get lots of garbage because of having to raise the window, data that you didn't care about. Sampler now has a hotkey. Christie can actually hit Command-Option-Control-R to start and stop sampling.

Thank you, thank you. We appreciate it. And she can do the copy and paste. And again. and Kristi Krohn. And Kristi can stop. Now we can go over to Sampler and we can try to take a look at what's going on. Now this is the way that you used to look at Sampler as well, with a browser.

And the browser has some good points and some bad points. However, there were a lot of people at Apple who actually would write their own tools to sort of parse this data. Because they like to display it in an outline view. And so Kristi actually was very nice enough to actually put in an outline view, so make sure to thank her. And so you can actually look at the outline, an outline view and actually turn down triangles to see your call tree.

So for example, we can see here that in every one of the samples we were in main, main always called NS application main and so on. And we can sort of step down in there, snooping around, and we can find where we're calling into the menu code, which is right about here.

Now, actually one of the things that you saw was that the counts were originally in terms of the samples, how many times the program had been stopped. And Christie actually just switched this so it was actually showing it in terms of time, which tends to be a better way to actually understand it, even though remember that's statistical and so you can't say that took, you know, that took 0.01 seconds.

And what we find is in that time that we were doing the sampling, we spent about 2.44 seconds in copy. Okay, that seems a little odd. And we can actually turn down the triangle and see where the time was spent in copying. It turns out that we called four routines there that took all the time.

Gee, that's weird. Well, luckily we still have that way of linking to the source code because I can't understand it from this point of view. And so, Christy can double click on Copy there and we get our source code. And we find out that what's happening is that when we do that copy, we create a PDF file, well, PDF clipping, we create a TIFF clipping, and we create the Sketch internal version of clipping. Okay? This is because Cocoa has this nice feature where you can say, "Hey, I can give you a clipping in any of these formats."

Okay, that's very good because then the application that you're pasting it into can say, "I only work in PDF, I need a TIFF, I need a whatever," and it works. But when we're doing thousands of objects, doing all three tends to be a little wasteful. and so a better way of doing this would be to use another feature of App Kit which is to basically say, "Here are the things I can produce, but I'm not going to produce until you ask for them."

And so we could change this code so that we only said, "These are the types of clippings we create." And then only when somebody did the paste would we actually create the clipping for that. And so that would get rid of this performance problem. We'd make copy a lot faster at the expense of making paste a little slower. Okay, could we go back to the slides please?

Now, the performance tools, as I've said, are very good for exploring your data. They're good for looking around and trying to figure out what's going on. But that's not always the best way to work. Because when you found a particular performance problem, such as the Safari people found, that they really cared about page times, about the page load times and nothing else, having to run Sampler every time to gather the amount of time would be wasteful. And so, if you know what you're going to be measuring, try instrumenting your code, putting in print statements to learn how much time was spent or automatically logging that time.

And this is really good because it means that you can automatically gather statistics so that you can check for regressions. And it means that you're always watching exactly what you want to be watching. And there's many ways that you can do this. There's a number of APIs in Mac OS X for looking at time. Some of the ones that are interesting are uptime, which tends to have nanosecond resolution. Or you can use get time of day if you prefer BSD or the NSDate class if you're in Objective-C.

When I actually did this on copy, actually it's very strange because I didn't expect to find anything, but I thought, let's do this. I actually found that I actually started graphing out the amount of time spent for each of the clippings and for this task at the beginning, which was called ordering the list, which was sorting the things that you clipped from back to front.

Okay, so first one, you know, the PDF was the longest, second one, sketch was the longest, and then suddenly when I got to about 4,000 objects, suddenly the sort would take forever. And unless I actually measured this and unless I tried it on a bunch of different sizes, I never would have seen this.

And so this is one of the advantages of instrumenting, is it makes it very easy for you to check to see when things go wrong and why they are. And if you actually look at the code what you find is that Sketch had--Sketch was basically made for dealing with tens of objects. And the way that it would do the ordered list is it would use the NSArray sort method for those of you who are Objective-C fans.

And so it would basically say, "Hey, go sort this." And there--and you had to provide a comparison routine to be able to say this is how you compare two of these rectangles. And the way the rectangles would be compared is it would say, "Hey, what's the index of each of these in the big array that lists everything that's being drawn?" Okay, is this the first one, the fifth one, the tenth one? That's pretty efficient.

Except that that code would take your NSArray and it would make a copy over here in a nice static array so that it could do the search really easily. So it would have to malloc huge amounts of memory and then it would have to do a linear search.

So that meant that the comparison was an order n operation, which meant that sort ended up being like an order n squared log n or something like that. And so you end up with this funky thing where sort looked really fine until you got about 4,000 elements and then suddenly it was huge. So this is why you want to instrument.

Now, another thing that we've now looked at a couple of ways that you can go looking for things that are suspicious. Now, one of the interest--one of the interesting things about big objectives, object-oriented systems is that you tend to have a lot of layers because you've got this thing called instrument or information hiding, which is great.

And so you don't really know how people in the line that thinks below, but you sort of hope they're doing the right thing. But the problem is that often they make assumptions you don't. You're using the API in ways that they don't expect. And so, some call that you might be making into some layer might go all the way down to the bottom of the system and back up and take huge amounts of effort. Or you may have something that you think is inexpensive like that sort that ends up being a very expensive operation.

So, Christie actually came to Apple and suggested that one of the ways that we should be looking at systems is to be looking for this kind of repetition because object oriented systems tend to suffer from this. And one of the ways that you can do this is you can look for mallocs because mallocs tend to be time consuming and memory intensive operations and everybody uses them all over the place. And so if we can poke around and see where mallocs are being called, we might be able to see where we're doing repetitive work we really don't intend to be doing. So, let's switch over to the demo machine.

Okay, so Christy will now switch to watching memory allocations and to using what's called the trace view, which is a way to actually look at these mallocs in a very interesting way. And we can go back over to Sketch and we're going to do a very small example because when you're doing sampling by time, you want to have lots of stuff so that you hopefully find all the functions you're looking at.

Here we're looking at every single call, so you want to make your example relatively small. And we're going to look at what the mallocs are being--what mallocs are being done when we do our copy. And so, Christie has the two rectangles there. She'll hit the hot key to start recording, do the copy, stop recording, and we find out to do that we required 6,000 mallocs.

You know, and this probably isn't that unusual, but it's a big system and there's a lot of things going on. And in fact, if we poke around, the idea is that this graph actually shows you the height of the call stack going to each malloc. So how many functions you had to get to before you got to malloc from main.

and if we zoom in on one of those, we'll actually find that you start seeing these repetitive patterns. See how it's kind of like a EKG, you know, so blip, blip, blip, blip, blip, at very regular patterns, which implies that there's actually some very regular operation going on there if we're seeing that signature over and over again.

And in fact, if we go look we find out that we're down in some code that's parsing an XML file. And it turns out that when we do a clipping and it's a PDF file, the PDF file has to get information on the printer because the printer is used for the size of the page.

and the printer ends up going through the CUPS daemon and the CUPS daemon ends up giving us back XML and we have to parse the XML and so we do lots of mallocs. We never would have known this and it might not show up in Sampler, but this is a way to understand sort of what the costs are.

And some of these are cases where you might be able to say, "Oh gee, I shouldn't do that." And a lot of those are cases where Apple needs to say, "Oh gee, we ought to fix that and we can actually fix it for you so you never run into it." Okay, can we switch back to the slides please?

Okay, the final demo I'll do today is Spin Control, which is a new application. So the problem here is that in general when you have, when your application takes too long to do something, when it keeps, when messages coming from the Windows Server don't get responded to within about five seconds usually, the Windows Server puts up the spinning cursor.

So usually this implies that your application is behaving badly. It's not responding quickly enough for the Windows Server. And so these tend to be bugs. You know, you're doing too much work. The problem is you can't sample them because, first of all, they're sort of difficult to catch because they tend to sort of appear and disappear.

And even if you could get to sampler, usually your machine's doing other things because there's a spinning cursor up. And so there's not really a chance to actually go and attach to it. And so the idea is that SpinControl automatically samples your application for you. So let's switch back to the demo machine.

So Christie's Launch Spin Control, which is in Developer Applications, and you have to go find that yourself, sadly. And it has--it basically keeps a list of every time that it detects a spin. And you can set it for only one application or all applications. And we can do that copy that we were doing that was causing us all that grief. So we can select all again. We can copy.

And we can paste. We can do that again. And sometimes you actually need to click on the window so that there's a window event that you might need to notice. That's usually when the spinning cursor comes up. And we can see here, the spinning cursor just came up, because we copied one of those things that takes 800 seconds. Hopefully not. I think I need to get off the stage soon. And it automatically sampled it.

And now we could copy that and paste it into email to send to a developer to say something's wrong. Or we can double click on it and we get a sampler like Vue where we can actually look at the code. And in fact, we can go and see that we're calling--oh boy, that's nice. We're in copy, which turns out to be in NSArray, which ends up calling CFArrayGetValuedIndex, just like I was explaining. So I wasn't lying.

So, Spin Control gives you a way to see the invisible. It lets you actually see the kinds of things that you otherwise can't sample. So, this is a cool tool. Try running it on your system, leaving it up, and seeing what you catch. Could we go back to the slides, please? And thank you very much, Christy.

There's a number of other tools that you need to check out yourself. We don't have time for everything, sadly. Hopefully you've seen these in previous years if you've been here. If you haven't, take a look at some of these tools. Take a look at the performance book to find out how to use them. But they all will have--they're all valuable and interesting ways.

They might be able to help you on certain types of problems, and you need to explore how to use them and which sorts of problems are best found using any of these. And make sure to watch your applications. And with that, I'd like to bring up Nathan Slingerland to talk about the Chud tools, which is--allows you to look at code one level deeper than what we've been looking at now.

Good luck. Good luck. There we go. Okay, so as Robert said, I'm Nathan Slingerland and I'm going to talk to you today about the CHUD tools, or Computer Hardware Understanding Developer Tools. And these are tools written by the Apple Architecture and Performance Group. They're a suite of tools that give you low-level access to the performance monitors. So these are counters that are built in to our hardware in the processors, memory controller, operating system like that.

And using these counters, you can find problems in your code and improve your code. And of course, CHUD tools are freely available with the Developer Tools CD. You can bring up Shark and Xcode as you saw. And they're freely available on the web too, so you can check there for updates. If you were here last year, we introduced CHUD.

We're happy to have 3.0 this year with a lot of great improvements. Shark is an instruction-level profiler, so if you've ever used Shikari from the older CHUD tools, Shark is the successor to Shikari. Monster is a spreadsheet for performance events, so you can look at these counter results in either a spreadsheet or a chart form. And Saturn is a new tool for visualizing function call behavior.

And of course we have a set of other lower level tools that you can use for tuning things like AlteVec code or very CPU intensive code that you want to simulate using SIMG4 or soon SIMG5 that will let you see exactly what's happening at the lowest levels on the processor.

And of course we provide the Chud Framework API so you can write your own tools or control the Chud tools. So the performance counters, as I said, are in our processor and memory controller and operating system. And what they do is they count interesting low-level performance events, so things like cache misses on the processor, execution stall cycles, page faults in the operating system. And Chud lets you control these and view the result.

So the first tool that we're going to talk about that uses these counters is Shark. Shark is a system-wide profiling tool and using Shark you can profile a process, a particular thread or the entire system. And in the most general usage of Shark you can create a time profile.

So this lets you visualize performance hotspots either in your code or not. You can see if your hotspot, your bottleneck, is actually in your code using this. You can also use it to find event profiles so you can relate performance events, things like cache misses to your code to find out where cache misses are coming from. It captures everything, drivers, kernel applications. What this means is if you're a driver writer or a kernel extension writer, you can use Shark to see the call stacks and find out where the time is being spent in your driver.

And we're very low overhead because we're handling everything in the kernel. In addition, once you have your sampling session taken, you can use Shark to find out where the time is being spent in your driver. And we're very low overhead because we're handling everything in the kernel. In addition, once you have your sampling session taken, you can use Shark to find out where the time is being spent in your driver. And we're very low overhead because we're handling everything in the kernel. And we're very low overhead because we're handling everything in the kernel.

And we're very low overhead because we're handling everything in the kernel. And we're very low overhead because we're handling everything in the kernel. And we're very low overhead because we're handling everything in the kernel. We provide automated analysis. We attempt to annotate your source code and the disassembly of that source code to point out common problems and other things that you can do to optimize your code.

There's a static analysis feature to find suboptimal code. So if you were in the earlier Chud session, you know that there are some instructions that are on the G5 that you need to look for and watch out for and this will help you find them. And we also provide optimization tips as it says.

So we have a scriptable command line version. You can tell that in and sample things. And of course you can save and review sessions and pass those around. So without further ado, the best way to see this, how to use the Chud tools and Shark is to have a demo. So for that, we're going to use the Noble Ape simulation. This is an open source program written by Tom Barbele. And to help me demo, I'm going to bring up Sanjay Patel, also the architecture and performance group.

Okay, so first thing we'll do is we'll bring up Noble 8. Okay, so here we are. We're simulating thinking apes on an island, a tropical island. And this map window is showing us an overview of the island and the little red dots, each red dot is an ape. And we can look at, we can focus in on one ape at a time. That's the ape with the red square around him there.

And the brain window to the right here shows what his brain is, how the changes are occurring in his brain when he's walking around the island and thinking about things. So, you know, our every good performance tool is here. The first thing we'll do is we'll use Shark to see what's happening in the system while we run Noble Ape.

So this is the main Shark window. By default we go to the time profile. There are other built-in profiles, of course, too, that take advantage of the performance counters. But for now we'll just use the time profile. And we also have a global hotkey. You can put Shark, it doesn't have to be in the foreground to use it either. So let's sample five or so seconds and see what's happening.

So here's the profile listing the important functions from most sampled to least sampled. And in the lower left here we have the process pop-up and this lists all the things that we sampled during this time period. So at the top is Noble Ape and we kind of expect that.

We know that our simulation is CPU bound, but it's only 50% of the time. And you kind of wonder, well, okay, why is that? Well, if we go to the thread pop-up, we can see that, in fact, this application is single-threaded. And because it's single-threaded, we're not using half of our dual processor machine.

So our first step in optimization was, hey, let's thread this thing. We used the Carbon MP API and threaded Noble Ape. So let's see what the performance improvement was like. Remember we had 1200 thoughts per second before and we're getting almost double that, so that's pretty good. But let's profile again and see what we can do with this code.

So now we can see that we're taking up a much greater portion of the time on the machine and that's reassuring. We want to do that for our simulation. And we can see that we've spawned these threads now. We've got the main thread at 8% and then two other threads that are processing the apes in parallel, 40% apiece.

So the next step we can do is we can double click on any entry in this profile view and it'll show us our source code colored with where the samples were taken. So what this tells you is what lines of source code the most time was spent on.

So if we look here, the scroll bar also gives us a way to jump quickly to the hotspot. So the hotspot is literally just this function, just this piece of the function, this for loop, inside of the Cycle Troop Brain Scaler function. So it turns out that this is about 94% of the time if we highlight this.

Right? So if we look, Shark gives us a hint on how to fix our code or how to make it better. We click on this little exclamation point. It says, OK, this loop contains 8-bit integer. It's taking a lot of time. You're spending a lot of time in this loop.

Maybe it would be worth the effort to vectorize this loop. So that was our next step. We went and we vectorized. So let's go back. Remember 2400? Let's turn on vector. Alright, so 10,000. That's nice. But we're still not done yet. Let's look again with Shark and see what else we could do.

So we see the vector function showing up there. We'll double click. And we're in the vector code. That's good. If you're a Shikari user, you probably know that we had this disassembly view that was similar to this. And you can still get this back. This disassembly view is actually set right now to show G5 dispatch groups. And there's more detail on that in the full Chud session. We'll go back to the source code for now.

And if you look closely at the scroll bar, we can see that actually, even though we're spending a lot of time in the vector code that we optimized, now we're spending a relatively bigger portion of the time inside of the scalar code that we didn't optimize before in the first step, that we didn't vectorize before. So our next step is, hey, maybe we should vectorize the rest of this since all these loops are fairly similar and that's what Shark says to do. So let's go back to Noble Ape.

So about 10,000, 9,500, turn on vector optimized, and we're almost 15,000. So this is around 14 or so times the original performance. And what we're able to do is take advantage of this massive bandwidth we have available on the Power Mac G5 by using AltaVec. OK, so could we have the slides again, please? Thank you, Sanjay.

Okay, we did that. Oh, wait. Yeah. So, just to summarize, we compared this against the Power Mac G4. So, this is the scalar code running on the current Power Mac G4 top of the line against the Power Mac G5. And you can see that actually they're not all that far apart. In the scalar code, we actually have a longer pipe on the G5, a longer pipeline.

And so, we're not entirely scaling with this higher frequency. We're entirely bound in this CPU for this. So, when we added the threading, we can see that we get a bigger jump than what the G4 got, right, going from scalar to scalar threaded. Then vector, an even bigger jump. And vector optimized, an even bigger difference, right?

And the reason is that as we improve this code, we're more and more constrained by the memory band. So, we have a lot more memory bandwidth available on the system. While on the G5, we're simply not as constrained, right? We have a lot more memory bandwidth to play with here.

So, by vectorizing your code, you can, you know, if we had just thrown this on the G5, we would see a very marginal improvement. But by putting the effort into vectorize, we're able to take advantage of a lot more of the system, a lot more of what it has to offer.

So in addition to Shark, we have some other tools. Monster allows you to directly configure the performance monitor counters, collect data based on these timed intervals or event counts or hotkey, and then look at this in spreadsheet or chart form. It also has the ability to compute metrics, so things like bandwidth or cycles per instruction, and actually that's how we got our bandwidth numbers for this when we were looking at it. This is a command line version of Monster and you can also save and review sessions for that.

Saturn is the last tool we're going to talk about. Saturn is similar in some ways to Gprof. It gives you an exact profile and allows you to visualize the call tree of an application. It uses GCC to instrument each function in your application at entry and exit, cords the function call history to a trace file, and then for each function it can give you the call counts. It can also use the performance monitors to tell you the counts for each function as well as the execution times using a low level timer. So okay, at this point I'd like to bring up Dave Payne again for session wrap up.

[Transcript missing]