Using Performance Analysis Tools for Mac OS X - WWDC 2004

Development • 58:58

Xcode includes a broad suite of tools for profiling and analyzing applications. This session will give an overview of all the performance tools that ship with Xcode, including Shark, Spin Control, MallocDebug and others. We explain how developers can take advantage of these powerful tools to identify and fix performance issues. This is an introductory to intermediate-level session.

Speakers: Dave Payne, Christy Warren

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning everyone. Welcome to Using Performance Analysis Tools for Mac OS X. Let's dive in. I'm Dave Payne, working on the performance tools. So this morning we're going to cover some of the concepts of performance analysis, just general concepts for any system, and then look at the performance tools that we have available for Mac OS X, and then dive into looking at a case study of real-world use of the tools, and then showing some of the cool new work that we've been doing to integrate some of the features of the sampler profiling tool into Shark, which is another profiling tool, to combine them. And then add a bunch of powerful new features to help you find performance problems faster in your applications.

So why are we interested in this? After all, those machines are getting faster and faster, right? Just buy beefier hardware. That's really cool. Well, there's a lot of jobs no matter how fast the hardware is. People always need more power. You need it in your development tools, for example, faster compilations, etc.

In general, I don't want to have to go out and buy another gigabyte of RAM to run this application or these three in combination. I'd like my laptop to have the battery last all the way back to Dallas on my flight. A lot of that is how much am I beating on the hardware, the CPU, the memory. When the fan kicks in, what's going on there? That takes power as well on the laptops. In general, we want applications that play well together with others.

That really is important to reduce the amount of overall memory that your application is using. Because I'm running... I'm running more and more applications on my system every day. You guys are creating cool apps. Thanks. But I don't want to have to page tremendously when I'm switching between those applications.

So rather than just diving in and saying, well, I think this routine is going to be slow. I want to go rewrite it because it'll be fun and cool. Humans are notoriously bad at guessing where the performance problems might be. So we need a systematic way to go about looking at this.

You're the expert in using your application, what it's for. So an approach to performance analysis, first, define what the major operations you're interested in having fast for the user are. Then, what are your goals to make sure that that's nice and fast for the user? For example, responsiveness.

If you have slow operations, then you can either speed those operations up, which is the best case, this just happened like that, or if it's unavailable, you can speed those operations up. If it's unavoidable that it's going to be slow, then you want the user to be able to be in control of the application again quickly. So maybe move the slow operation to a separate thread so that the UI can be responsive again. Throughput.

If you're doing a game, you want lots of frames per second. If you're doing a network application, you want a lot of data throughput. Servers, you want lots of transactions per second. What's the goal for your arena? Then, establish precise benchmarks. For that, define what your target hardware for your customer segment is, what operating system version you're testing with, and the specific data that you're going to be passing in and specifically what operations you want to test. Then, add in time measurement code instrumentation so you can time and time again measure that same operation and see how you're doing.

Track that throughout your development. This is what the Safari team was doing to make sure that they had the fastest web browser on Mac OS X. Define a precise set of benchmarks and then ran it time and time again. And when somebody was going to check in a major piece of new functionality, it couldn't go in if it caused the system to slow down noticeably. So don't allow regressions. And finally, if you do identify some performance problems, then focus your tuning efforts on those hotspots. Well, that's easily said, but how do you actually do that? How do you find what the hotspots are?

Got tools to help with that? Yes, absolutely. We've got tools. You've got tools. These have been included with your Mac OS X performance tools for several years at this point. Take a look in Developer Applications Performance Tools. As with all of our developer tools, these tools are free. We really want you to use these things to create great applications. They provide full support for everything you need to do with Mac OS X, including now a lot of support has been added in for Java profiling.

And in addition, our GUI-based tools are integrated with Xcode for a full round trip of the development cycle. You can launch your binary under a performance tool. And then you can get back to the source code directly from the performance tool back into Xcode. So we have a variety of performance tools for monitoring performance problems, for analyzing what once you've found that you do have a performance problem, then analyzing what is the problem, where is it, why is it happening in a variety of areas, memory use, execution time, other types of resources. And I'm going to dive into a bunch of these tools as we go through.

So in the area of high-level monitoring, there's a number of things you can look at to help answer just what in general is happening. One of the primary ones is the command line tool, Top, which is always there for you. You can use that for just my system right now or to look at a headless server or remotely log into a system. You can also use that to look at a full-screen game going.

But we've got a bunch of other tools as well that are a little bit more friendly with graphical user interfaces. So for example, a nicer user interface to Top is the activity monitor application that can help you analyze why is my system slow. Is there some particular process that's taking time? Is its memory growing?

So this one ships on the user CD. If the user calls up your support line and says, "My system's slow." You can call fire up activity monitor and tell me what it looks like. Another one you might not be familiar with, this was one of the Chud tools. We've now elevated it up into the mainstream set of performance tools. It's called Big Top.

So this actually takes the Top information and graphs it over time. So you can really see how it's changing over time. I find this very useful to watch and actually see graphically is the private memory use of my application growing. Or the virtual memory use. Those are two of the best metrics. And you may end up, you don't want to look at shared resident size because that's like your frameworks that are shared with other apps. But the private space could be growing. And that's important to know.

Oftentimes, I'm sure you've seen it, the spinning cursor comes up. And you wonder, dang, I wish I could capture that. Maybe it's a two second spin. And by the time you're off typing sample on the command line, the spin's over. But it makes your app not feel responsive. So spin control is a great way to just capture that. Just have it running in the background all the time. It automatically detects when applications aren't responding to the user interface events. And lets you see what's going on there.

Now, sometimes I mention that the fan will kick in. I've noticed this a number of times that my machine is sitting there idle and yet suddenly the fan fires up. Whoa, what's going on there? So I fire up top or activity monitor and take a look and some process is using 50% of the CPU.

Maybe it's drawing too much. Quartz Debug is a great way to look at this and I've actually seen this a number of times in real applications. It's periodically drawing and Quartz Debug flashes the graphics yellow on the screen every time it draws. Or you can also set it up to just see duplicate drawing in just normal operations. Very useful. So once you've used the high-level performance analysis tools to see what's going on overall, then now we want to go and dive into why is this happening.

So we have a variety of profiling tools. We're going to talk a lot about Shark today because we've put a lot of effort into the Shark application. Spin control, I mentioned. Thread viewer lets you see the thread activity of your application. What are all the different threads and what are the back traces of them.

And then for those of you doing OpenGL graphics programming, OpenGL Profiler is a fantastic application to help figure out where the time is going there. Command line tools, sample is excellent for just a quick basic sample process name or process ID, number of seconds. Very useful. And then we have the Venerable Unix Gprof tool. But that's the only one that really requires recompiling for profiling. The others don't need to do any recompilation of your application.

So I mentioned Shark. We have a new version that was shown yesterday in the Development Tools Keynote, Shark 4.0. This tool helps you figure out why is my time being spent in some certain place, where is it going. You can look at specific threads, specific processes, the overall system, and now you can actually do sampling over the network as well. For, again, like dual full-screen games or something.

This captures all the information about the system, both user space and kernel space. There's a full session on Shark this Friday afternoon. I'd encourage you to go to that. We'll see more of this today, but we aren't going to dive into all the full-blown features of it. One of the things I really love about Shark is I don't even have to attach to a process with this. It's always sitting in the background with an option escape hotkey that will sample the entire system.

So if I notice something seeming sluggish, I can just hit option escape right away. Shark fires up and says, "Okay, let's sample," and then I can option escape to stop it and see what's going on in the system and then dive into that specific process. But there's a lot of ways to get more specific information here.

Different sampling methods, memory tracing, function tracing, we'll look at a lot of these. There's three primary views in Shark. A profile view that lets you see a top-down view. A profile view that lets you see a top-down view. A profile view that lets you see a top-down call tree of your function calls. Bottom up for real profiling information and who's calling the leaf functions.

We've done a lot of work in this to really let you hone in and do filtering and data mining to simplify the complex picture. There's a chart view that helps you really see the patterns of execution of your code. This is excellent for both performance analysis and just understanding what's going on in your application. It's really cool.

And finally, within Shark itself, there's a code browser that you can see source code or assembly, get hints about what the assembly code is, and get directly back to the offending lines of code in Xcode. These are all instrumented here with specific lines of code the time is going into.

So I haven't talked a lot about Sampler. So many of you may be familiar with using Sampler. What's going on with that? We've integrated all of the features of Sampler into Shark and we plan to remove Sampler from the system. So you can see here that Shark has a number of additional features and we'll touch on some of these and then more on Friday.

But please try Shark and the team's been very busy. So Shark is there on your Tiger CDs, but there's actually a newer public beta that's got a number of additional features that the team banged in in the last couple of weeks. So please download the new version of Shark.

It runs on both Panther and Tiger and send us your feedback. I'll give an address later. So what about memory use? I mentioned that it's really important to try to minimize the overall footprint of your application. We have a number of tools to help analyze what's going on with memory use.

So, a very nice one is ObjectAlloc. This is great for looking at dynamic memory use, both how much memory am I using right now and how much was the peak that I used in some particular operation. You can use this with Cocoa applications and this is great for seeing your allocations by allocation type.

What type of objects, so Cocoa objects, core foundation objects like you would also have with Carbon applications, and just general malloc allocations and it says what size they are because a lot of times you'll allocate specific sizes at specific points in your code. So you can see all that. And you can look at information about specific instances of that.

It's not quite as good for pinning down precise memory leaks. MallocDebug is still the best application for that. This shows a full call tree of all the allocated memory, not by type, and it's not so good for dynamic allocation. But it is still the best tool for leaks. There's a command line tool equivalent of this called leaks that can also show you the backtraces of where allocations were occurring if you set this malloc stack logging environment variable.

So another major function of MallocDebug was to help find corrupt memory operations. But we actually have a better solution for that at this point. The purpose was to help crash your application if you did something bad with memory. But really, you want to be operating within a debugging environment if this happens.

So we have a new Malloc debugging library called GuardMalloc. So this operates within the context of the Xcode debugger. There's a nice switch on the debugger menu item now to say "Enable GuardMalloc." And what this does when you turn it on is every allocation you make goes onto a separate virtual memory page.

Then the end of the buffer is lined up with the end of that memory page, and the next page is non-allocating. So if you overrun the buffer, you'll crash immediately, and you're in the Xcode debugger, you can see immediately where buffer overruns are in your code. If you free the block, then we free the virtual memory page. And so if you go and read or write from that page again after freeing it, then again, you crash immediately.

So this is a great way to find really nasty memory problems. So you can learn more about this in the Xcode debugging session on Thursday morning and the libgmallocman page. So not so much a performance tool, but a great solution. So again, I've said we're putting a lot of effort into Shark.

We're trying to add a number of these memory analysis features into Shark as well. Shark can now do malloc allocation sampling and show you the size of your allocations and call trees there. There are still different strengths and weaknesses of our memory analysis tools. Again, objectalloc is great for dynamic analysis and looking at specific object types. Malloc debug is good for leaks, and we want to add leaks detection into Shark, but Shark has new capabilities too.

So that's it for a broad brush overview of the performance tools. Now let's dive into a specific case study. Now I'm not actually going to do the requisite planetary motion simulator that seems to be so popular. I'm going to be looking at an application called Disk Inventory X.

This is an open source application. It's kind of cool and actually kind of useful. It uses a concept from Ben Schneiderman at the University of Maryland for representing hierarchical information in a compact two-dimensional space. So it's an open source application, GPL'd. I'll be sending changes back to the author.

He's pretty excited about that. And as we go through, we'll be looking at a number of areas of what might be slow here: time, memory, other resource use. So in your application, as you look at something like this, what might you want to look at? Of course, major operations.

How long does it take to open a large document? If the application is idle, again, you should be taking 0% of the CPU. And again, watch for UI spins and deal with those. Memory size, I've talked about the importance of looking at dynamic memory use. We'll see that.

Leaks. One thing that may not be obvious is auto-released objects with Cocoa applications. If you create a separate memory, you'll have to look at the memory size. If you create a separate thread, or if you have a foundation-based tool, it's real easy with a lot of the Cocoa APIs to end up creating an auto-released object, but maybe not getting back and freeing the auto-release pool very frequently.

Maybe it's a long-running thread, and those objects just build up and get paged out, and that can take-- I've actually seen applications crash due to this problem. The system gets slow, you crash. So also look at disk and network activity. I'm going to be specifically looking at some of this with our sample application here. So what I want to do is switch to--oops--we'll switch to demo one here.

So this is the disk inventory application. What we've done here is-- actually, we can't see the menu bar up there. If we can get the menu bar, that'd be great. So what we've done is taken a look at our applications directory that's got 1.9 gigabytes of space in it. And I'm interested in where is that space going?

So what this application does is graphically show me the size of the files. The larger the rectangle, the bigger the file. And the color represents what kind of file it is. So we can see the blue is a disk image. So wow, I have at least one big file here. OK, that looks like the Adobe Photoshop 7 disk image.

I probably don't need that. Down here, a couple other disk images, application packages. So this is kind of cool. I can click on a directory and see how much space that directory is taking. I can move around with the mouse and see things there. So it's actually kind of useful.

So let's go ahead and quit out of this and bring up the Performance Tools folder. I'm going to launch the Big Top tool that I referred to, and I'm also going to launch Spin Control. I'll just put Spin Control down here in the background. Now, with Big Top, I can look at things like the CPU usage. As I move a window around, we can see that CPU use goes up and down, as expected. Let's go ahead and launch the Disk Inventory application again.

I'm going to look at the, specifically the disk inventory process and watch the memory size of that as I go through and, so what I'm doing is going to open recent and actually reopening the applications window there and scrolling, analyzing that. So you can see that memory use is climbing here. I've added a little instrumentation window here and it took a little bit of time to analyze that 1.9 gigabytes.

So that took about 9 seconds to scan that folder and a little less than a second here, so about 10 seconds to look at this. Actually I haven't tried this operation on this machine. We can also show package contents and Note that we actually caught a little spin here as well with Spin Control at this point.

So it took about four seconds to show the package contents. And with the spin, I can come down here and select that and show a text report and just see what was happening in there. So we were making a bunch of recursive calls to determine the file kinds inside of this package that I'm looking at. So that's interesting.

So we saw the memory use climb. That's not totally surprising because we were building up data structures to represent this. But we should look at that and see if we're as efficient as we could be. Let's look at one other thing. If I resize this window here, notice a slight pause before redrawing. And that was interesting with the memory use.

A little spike there, actually. That looks like it was probably over about a megabyte of dynamic memory creation while I was resizing that window. So maybe that dynamic memory creation was a little bit more efficient. But I think that the duration and deletion has something to do with why it's not as fast as it could be. So let's go back to slides.

So when we tested this, we don't have such beefy hardware in the labs. We have mere mortal dual G5s. So my results on testing this same directory in the lab was a little slower than that. It was actually almost 20 seconds for scanning the folder and getting the file sizes, and about 10 seconds actually to classify the file kinds.

And showing the package contents was again pretty consistent there at about 4 seconds for a total of almost 33 seconds to scan not quite 2 gigabytes of space. There's a lot of 80 gigabyte disks out there on your personal computer systems. So what, 20 minutes to scan my disk? What if I have a terabyte disk farm and I want to use this technique? That could be nasty. Maybe we can speed this up. Now, I've often heard the question of what are the best timing APIs for instrumentation on the system?

So, Mach Absolute Time is a Mach API that's the fundamental call. This goes down in and reads the time-based register out of the CPU. There's a number of other different APIs that you can use for different, you know, depending on what's convenient for you. Like Get Time of Day is a nice portable API in the Unix environment. These all end up calling down into Mach Absolute Time.

This is the way, the actual code that I used in this application. So I simply call Mac Absolute Time, say get time, gives me a 64-bit value back. I guess I could just recall it, call it directly. Then I, in subtract time, once I have two of these, I just subtract them and apply a conversion to get me a double value that's seconds, makes it easy to print. So with that, I've identified that we have some issues. I'd like to bring on one of our experts in analyzing these issues and also then creating tools to help do this process. So it's my pleasure to introduce Christy Warren.

[Transcript missing]

So going back to this picture, we're going to zoom in on this graph. And it's not just complexity at the high level. Look at it as you zoom in to the finest detail. You see repetition on many different levels. It's like a fractal, like in a Mandelbrot set. You see coarse, grand detail and then finer structure as you zoom in. It's amazing what we run into with software and processors today. It's just incredible.

So to analyze performance, I've come up with a sort of simple formula. The impact of any operation you do is equal to the cost of that operation times the number of places it's used in your code. So like in the Dave and me quicksort example, there's two uses of it that are redundant, so that makes it twice as expensive as it needs to be.

Now traditional profilers make it really easy to understand cost. You know, you sample a program, you see all the leaf functions that you spend time in. But it's hard to see use, because these are complex patterns of usage that often go through not just my library, but ten of your libraries scattered throughout the system.

So to help us analyze use, we have two techniques available to us. One is called Call Stack Data Mining. And this is a new functionality that we're introducing. We're not aware of this being available elsewhere in other programs. And the idea here is you can strip away the stuff that you don't want to see and focus on what you really care about. I'll describe that more in a second.

The other approach is this graphical analysis. The idea here is you visualize your execution trace, as I've shown. And it turns out there's a technique called software fingerprinting. Where, when you see similar patterns on the picture, they would mean that you're going through the same code path. And if there's repeated patterns that are the same, like heartbeats on an EKG, that means you're going through the same code path ten, a hundred, or a thousand, or a million times.

And it's at least worth looking to make sure that you aren't just doing it on the same data. You're going to be looking at the same data over and over again. Or even if you are doing it on different data, can you hoist things out? Like in a quick draw, I mean a quick sort, sorry, each time you do a compare, the compare may go through a whole bunch of layers of objects.

You know, a whole bunch of overhead that has nothing to do with, you know, just comparing two values. So remove that stuff. Pull it outside of the iterative structure, and your program will run a lot faster. But with this tool, you'll be able to see these things. You know, they'll shout out at you things that you'd either have to go digging through code and spend countless hours on. countless hours trying to find the problem if you didn't have these tools.

This is working? So data mining concepts. So I talked about stripping away what you don't want. Now, I have a question for you. How many of you have profiled your program and seen not your code, but like countless system files, system frameworks? Isn't that stuff annoying? Some of you? A lot of you? Yeah. I mean, when I first used a profile, that was the first problem I ran into when I was new at this. And it was like, well, what good is this? I can't do anything about C-Live. I can't do anything about AppKit.

I can only do things on my own program. So wouldn't it be great to have a button that goes-- you push this button, and all that stuff goes away, and you see your functions as the leaves, and the charge-- the cost of the system libraries all ascribed to various functions in your code. Wouldn't that be a lot better?

I mean, I find that really useful. And that's a very coarse operation. If you want to do finer-grained stuff, you could strip away one library. Now, let's say you have any work that you're working with the AppKit team or the Foundation team and say-- so you want to see details in those libraries, but you want to get rid of the low levels. So you want to get rid of core foundation. You want to get rid of standard lib.

Exclude by library lets you strip out a particular library in the trace. And this is all non-destructive. It gives you a different view on it. And you get rid of those libraries, and it charges the cost up. So by transforming the data, you can focus in on the hotspots that you care about.

This idea of flattened library. Instead of just completely limiting a library, you can flatten it to its entry points. So you can see where your code is only calling into these libraries. We don't see all the details of how, say, CFDictionary is implemented, because you don't care about that. You just care that I'm calling CFDictionary get value.

So, to help you see what you want, use the thing Focus Symbol. In Focus Symbol, you choose a particular call tree that you want to look at. And when you do that, you strip away everything that's above it or to the sides of it. So, I just told you a lot, and it's kind of thick, so I'm going to give you some pictures to help illustrate this. So, here we have a main, and it calls an init function, you know, do example, which does your work, and some cleanup.

And then you have this bar function that's called four times by, you know, our function. In real programming, it's probably more like a hundred or a thousand times, but I made it simple here. And this, in turn, calls core foundation, and it's using a CFDictionary get value. Now, if I just profile this, I'll see functions mostly from these ones in yellow.

These are leaf functions, and they'll be far removed from what you care about. So, if we do the exclude library, those go away, and now bar becomes our leaf function. So, by doing that operation now, instead of seeing these things are removed from what we care about, we're in what we care about.

So now, Flatten Library is similar. Let's go through this quickly. But it replaces the library with the entry point. Focus. Do example. Strip those away. And boom. So by doing these transformations, you can manipulate your qualities. And this is also really cool because you can make really good performance arguments.

When you strip these things away, you're no longer trying to point out something here, something there, something there, and maybe it makes sense. You actually see the counts and the places that matter. And you can make really good arguments to other people that we need to work on this stuff. We need to fix it. So with that, let's go do a little demo. Bye now.

We're going to launch our application, and we're going to launch Shark. Now, how many of you have used some version of Shark before? So about most of you? So this is Shark 4. And you're going to see a lot more of it in the Shark session later this week on Friday at 3:30. But you're going to get a little preview today.

The UI is a little bit different. The original Shark was a time-based sampler that lets you sample the entire system. That's really cool. It's good to have it around in the background and whatnot. But in this case, we want to focus on particular processes. And there's a number of different things we can do. We can trace memory allocations, function traces.

We can trace various Java things if you do Java. And for this application, we're going to use what's called a sampler time profile. We choose this because the program uses file system operations. And this involves a lot of waiting on the kernel. And this trace does the best job of attributing those costs to the user calls. So let's go over to our application. .

And do Open Recent. And Shark gives you this Option Escape hotkey, which is really handy. So I'm in the middle of the UI manipulating, and I can hit Option Escape and do Start. So we're going to start our scan. And this stops every thread and records a sample, whether it's doing something or it's waiting on a kernel call. Now that it's done, we stop.

And we have a heavy view. This is a view of all the leaf functions, as I've been talking about, and the relative percentage of the counts that we're in. So we click on this syscall thread switch, and on the right here, this is one of the nice new features, is you can see a backtrace of that particular similar. So we see that we're in a heartbeat thread. Well, that's not very interesting. It's sleeping, sleeping until date.

So in this case, let's get rid of that thread. So we're going to go down to the thread pop-up here and choose one of the threads. And it's going to give us the one that we're interested in. So the topmost function here is get adder list. And this gives you a very large call stack to look at. Now, to help us navigate things a little bit, there's a nice little thing over here called color by library. And when you click that, some of you may remember this from Sampler. That feature was in there.

And now we see that we've colored things by different colors. So lib system is lavender, comp page is red, disk inventory is brown. And this just helps us make a little more sense of it visually without spending a lot of time. Let me make this a little bigger.

So we click on Get Adder List. We see that, yes, we're in user code. This fsitem load childs is our own thing. But let's use the exclude library operation. We're going to exclude libsystembdlib. And when that happens, that goes away and we see that a fair amount of time is spent in CarbonCore. So we're going to do this again. We're going to exclude library CarbonCore. And we'll do this a few more times. So there, one piece of user code comes up. Let's do core foundation and launch services.

And you see that this FS item load properties and this load child are all floating up as pretty major players in this profile. Before we go in and look at those in a little more detail, let's go over to the heavy and tree view. This is another new feature in Shark.

It lets you see both the heavy and the tree view, the top-down view, simultaneously. So in the top-down view, we start at the start of our program, kind of like that diagram I showed you. And you walk down through your program until you get to our code. But there's still a problem.

You've probably seen this before, too, that there are all these app kit calls. There's all these system calls that makes it kind of hard to look around. If you expand one of these trees, the outline will be awfully big and hard to keep track of. So we're in luck. There's a button over here called Flatten System Libraries, which does that flatten operation in all the system libraries.

And when we do that now, This guy simplifies, and it's a lot more manageable. There's only a few layers of these calls, so it gives us context. You could also exclude them if you wanted to. But in this case, I just thought it was useful to help me keep track of where I was.

But then we'll notice another problem, which is we just expanded this recursive call to load. This is a file system application, so it's very natural to write it in a recursive style. But if any of you have tried to analyze performance on a recursive function, it's rather difficult.

Because at each layer of the recursion, you may call out to a branch function. And each of them individually will show up as relatively small contributors. But there's no way to kind of gather them together and focus them. So for example, this FS item path shows up as 0.1% here, 0.1% here at different levels. But it's kind of hard to determine if that means anything or not. But luckily, there's another option here called flatten recursion.

We click on that and look what happens. LoadChild becomes a single thing and look at that. A knit with name parent suddenly pops up to 43% of the overall time. So by using this data mining, we're getting past the obstacles and getting to the parts that are interesting. And by the way, with the Shark 4 download that Dave mentioned, there's actually a nice tutorial that you can go through that will walk you through these things. You don't have to remember everything I'm going through today. So let's double click on a knit name with parent.

and Pray to the Demo Gods. There we go. So this shows you source now, annotated with percentages of time. Those of you who've seen Shark have seen this before. But there's a couple new things that are really cool. You notice that various symbols are underlined. So that means you can follow that link, just like in a web browser. So if we double-click on self-load properties, which is our heaviest line, we go to another source file.

And you can navigate forward and back. And this way you can move around and explore your performance problem in a way that's much more concrete, at least it is to me. I mean, isn't it better to deal with source than to deal with these trees of symbols? So I found this really a nice feature. So now we can look at our problem. This class is an FS item. Even its name suggests that it's something you do on every file system item that you encounter as you iterate through these directories.

And if you look at the details here, it does an fs_path_make_ref. It does file attributes at path. It does fs_get_catalog_info. So there's a bunch of these. It's only about five operations. And if I look elsewhere, there'd be even a sixth operation that we're doing for every file in the directory.

Now, Dave, there are bulk file system operations that we support. They're really cool. And you can reduce this from doing this for every file to just doing it for every directory. And this should give you a really nice speedup. So please consider using that, you know, in optimizing your program. So while Dave's off working on that, I'm going to show you some function tracing.

You know, I just showed you data mining and how to analyze your program using data mining. Now I'm going to show you graphical analysis. You know, using this feature called function tracing, you can specify a list of functions that will let you do an exact trace of the functions that are called. So I can choose function trace, and there's some presets down here. And you can also enter your own if you have a set that you particularly like. So I'm going to do file I/O.

And this gives you a list of file I/O functions. It's a little bit hard to read, but these are just the standard Unix file calls. And I already made a preset here called file I/O. So we're going to choose that. And go back to our program. Open recent. Start recording. And this time... Oh.

Okay, we'll just have to do this again. That's the nice thing about Shark is it's pretty forgiving. So we've-- and when you do an exact trace, you want to do it for a relatively short time, or you might wind up with hundreds of thousands or even millions of samples. Even in that short time, we got 60,000 samples.

And this is kind of a cool view. You get a distribution of different file system calls and the percentage of time that you've used them. So it gives you a hint of what your program is doing. But there's an even better thing we can look at here. Let's go to the chart.

And in the chart here-- let me do one thing. So I get selections out of the way. Here in the chart, you see that there's this kind of wavy pattern. And let's just zoom into it a little bit, like here. And this is a new feature. This is a really nice zoom control. As you drag along, you can zoom in and out, just like we did in that movie. That movie wasn't fake. It was just filmed from the actual live program.

So we go in here, and we see this thing that looks like we're iterating over files. It's kind of different levels. If you look in a finder outline view, you'll see that it's a similar kind of pattern. And you'll see that load child shows up in the stack here. So let's just do flatten recursion.

And look what it does, is it completely flattens out our trace. And we come down here and we find a fingerprint. This little shape here is very redundant. It occurs over and over again, even in this little thing. And that happens to be in load child, a knit name with parent, and then load properties.

So we found our culprit with graphical analysis very quickly. So use both techniques. If you have an idea of what functions are expensive already, you can do a function trace. You know, if you need to figure out what areas are expensive, do time tracing and use call stack data mining. Okay, back to you, Dave.

Okay, praying to the audio gods. Excellent, excellent. Good job, thanks. Okay, moving on. So I did my homework while Christy was speaking. And what I've learned here, I studied the app. In each directory, we're making a directory contents at path call to say, enumerate all the files and folders in this directory. Then for each one of those items, let's go through and do a number of things to gather the information that the program wants to display.

So again, I'm getting an FS ref for each item so that I can make additional calls with that. I want to know whether it's a folder or a file or a symlink because I don't want to navigate the symlink to duplicate the representation of the space taken by the file. So I make an attributes at path call on that.

I want to get the file sizes, so the data fork and the resource fork, and also the parent ID to see if I'm on the same volume. I don't want to walk off to multiple volumes here. So I'm making a FS get catalog info call on that. And finally, when it's doing that classifying files, it's saying I want to get the kind string as represented in the finder. So if it's a .nib file, we want to show that as interface builder document. So for each FS ref, we end up calling down to launch services saying get me the kind string for this.

And then we're going to call the FS ref this file. So having done my homework, I did learn about the FS get catalog info bulk call. So what this does is it's optimized. I can say for X number of files, I can specify how many I want in a given directory.

I'm looking for a set of information here. I want to get the bit that says is it a directory or is it a file. I want to get the parent directory ID. I want to get the directory ID. I want to get the resource and data fork sizes.

I want to get the type and creator information. And we'll see what I'll do with that in the next slide. And then I want to get just the full array of FS refs for all the individual items and the full array of entry names. So I get arrays of all this from one call that before I was making lots of file system calls. So in the classifying files, so again, what we were doing was hitting the file system once for each file to say, give me the kind name for this. And then the way the code was written, it's actually storing that kind string for each individual different file.

But if we step back and think about it, I just don't have that many different kinds of files. And I really don't need to query the file, you know, about the specific file. What I care about is the kind. And the information that specifies the kind is the type, the creator and the extension.

So I can build a dictionary to map this triplet of type, creator, extension to the file name kind string. So I actually put all of those into a string and just make it unique, use that as a key into an NSDictionary to do a lookup. Now, if I don't find it in the cache there, I can make a different launch services call to say, given this triplet of information, look up the kind string for that. Now, that's not even hitting the file system, right? So I'm going down from an order. I'm going down from an operation here for once for each file down to zero file system accesses.

And I'm also only storing the kind string for each different kind, not once per file. So I'm also significantly reducing my memory use. So before I show the results, let's see if there's anything else that we can determine from the application here. So let's do a memory analysis demo.

So let's go ahead and quit out of the app and bring up our performance tools again. So let me also point out Shark is now up here in the performance tools. It used to be down in the chud folder. Now it's really going mainstream here. So let's look at ObjectAlloc, though. So we double click on that. What we do here is we launch our target application from within ObjectAlloc, because it needs to set up some of the environment for it.

So with this, I simply hit Go. And what I want is to keep the backtraces. I could keep reference counts on objects, but I don't need that in this situation. So it goes ahead and launches the app. Hasn't done too much yet. I'm going to change the scale here, because I might have a lot of objects. And we'll see that this application is doing live updates. Let's go ahead and walk that folder hierarchy again. As we go through, we see-- let's do an auto sort.

We see that we're building up a lot of CFStrings. These are currently allocated items. We're building up a lot of FS items. Now those make sense, kind of. FS item, I'm getting one for each file system item. The CFString is actually the name for that particular file system item, so that's being stored. So that's kind of useful. I can see the peak amount, how many has been the peak of any particular type that I had, and I can see the total amount. And you saw again, live update and auto sorting.

So if I go to total, this is really interesting. We see the different colors of the bars here. What the red bars indicate, as opposed to blue, is the percentage of objects that you have left remaining. What's the current number of objects versus the total you've allocated? Red means that you have less than 10% of them remaining, so maybe you've got a dynamic memory issue there of creating more of them than you actually need. Yellow means, I believe, it's 25% or a third. So the bright color indicates the number that's currently allocated, so we can see that we have a lot of CFStrings still. We had a little bit more peak, and we had more total that we got rid of.

But what's this NSPath store? We can see that we've got 24 of them left, but we allocated 180,000 of them in going through this. What's up with that? I can actually double-click on this and get an allocation chart and see what the dynamic allocation pattern looked like here. So it kind of looks almost like this might have been walking the file system hierarchy and we're doing something here. It looks similar to some of the patterns that Christy showed in Sampler. I can go in and look at specific instances of these objects.

And where they're allocated and what the contents are, we can see this is Twitter library fonts, another path there. I can look at call stacks, go down through, descend the maximum path. Another thing I can do is I can set a mark and say I'm only interested in seeing the number of objects since the mark.

If I do the show package contents, we see that I can just watch how many objects are created during that operation there. So you can get a lot of information about your application here through this. What I want to do here is go back in and look at-- now, you'll notice that it actually took a lot longer to run under Object Alloc because of the amount of time it was taking. So don't do time analysis while you're doing this.

But let's go back in and do some memory analysis with Shark. So I'm going to switch to the MallocTrace operation and start up Disk Inventory again and select the Disk Inventory process. So going back to disk inventory, let's now-- Look at apps once again. And like Christy did, I'll start the sampling and stop it.

And I'm actually going to jump directly-- well, that's right. First off, I wanted to show that with the value here, if I switch to value, you can actually see in the call tree the amount of memory that was allocated by the various calls here. And I could exclude everything that I don't have source code for and get down to just seeing the stuff that I do and where that allocation is going.

So that's fairly interesting. Let's go directly over to the chart view. Remove the exclude no source. And we can see from the chart here that again this looks like we have some interesting patterns. So let's just click on one of these and that might be potentially interesting there. Let's zoom in a little bit, see what we might see. Zoom, zoom, zoom. Interesting little sawtooth patterns here.

So if I click on this, we see a number of different allocations of paths, and I can just use the cursor keys to move through. So it looks like, in fact, from the code, As I'm walking down through the file system hierarchy, the way the code was written was that when I got to each FS item, it was making several calls to say, "Well, I need to know the path right here." I'm being a good citizen for memory and I'm not storing the full path with each object. That would be overkill. That's too much memory use. So I'll dynamically ask for it. So I'll get my path by asking what my parent folder is and then appending my name to it.

But my parent says, "Well, what's my name?" Let me ask my parent and then append my name to that. And up we go. So that's dynamically creating lots of auto-released NSPath store to string objects with Cocoa. And then the next thing that we see happening is then we spend a bunch of time actually auto-releasing that.

We can see the auto-release time there. So you can see the impact of too much memory use. So since I'm recursively descending down through the file system, I should be able to, at each level, say, well, this is the path that I'm currently at. And when I go down into the next directory level deeper, just append the path part to that and pass that down through. I don't have to recursively go back up every file system item.

So that significantly reduced the amount of memory we're using. So now let's go back through and say, OK, that's all good. Did we have any results? Well, I was busy coding away and slapped a new binary up here. So let me dynamically enable some optimizations. And let's try it again.

So off we go. Boom. OK, so what are my test results here? The folder scanning that before took over 10 seconds in this case, about 9 seconds before, is now a little less than 2 seconds. Remember this is from vastly reducing the number of file system operations. The classifying files that again was asking for the file kind string for each file, that's down to virtually instantaneous because you'll remember I'm doing no file system calls there now. And if I do this show package contents operation, boom again, 0.16 whereas before it was about 4 seconds. So we can see that we've significantly reduced the amount of time the program is taking. Let's switch back to slides please.

So to summarize, what the tools helped me do is figure out that I should use bulk file system calls, and there's documentation about this. I actually copied much of the code from the performance documentation. I used caching of the file kind strings so I could just do rapid lookups and not query the file system.

That helps me reduce my storage for the file kind strings. I talked about reducing the dynamic creation of the path strings as we go through. And then as you go through, you know, optimization is an iterative process, right? You've got a hotspot, so you go in, you tune that, you make that faster, and now you've got a different hotspot.

So it was actually interesting to discover that once we made the file system accesses a lot faster, that the way it was updating the UI for feedback about what it was doing, display this path, display this path, display this path, was actually starting to take place. And it took a fair amount of time.

And so I just display fewer paths because all you want to know is where you are. And so that made things faster also because that's not an important part of my process here. So we made significant improvement here. This is the measurements I got in the lab, again, on somewhat slower hardware.

We ended up making the file system traversal seven times faster. Classifying file kinds, you know, depends on the size of your file system. But that's like infinitely faster. Much faster showing file contents for a total of, call it, ten times faster. So now this starts to get to be a more useful application for me.

So we've covered a lot of things here today, a lot of tools, a lot of techniques. We have a lot of documentation about this on the system for both of these. Plus the tools have documentation in them and with them. There's man pages for the command line tools.

And so in conclusion, we've seen we have some powerful tools that help you both monitor to see if you've got performance problems and then analyze what the problems are. We've put a lot of work into Shark, working with Nathan and Sanjay, doing a very collaborative effort here to try to improve both the power, add more new features, but make it easier to approach and understand at the same time. So we need to know how we're doing.

You know, does this work for you? If I remove sampler from the systems, is that going to cause you a problem? So download the beta. Please send us your feedback. So I'm going to bring Xavier Legros on stage, who's our Mac OS X technology evangelist for this. This is a feedback list that you can send information, feedback about this too.