Advanced Performance Profiling Techniques with Shark - WWDC 2008

Tools • 1:10:22

Shark is a powerful, lightweight tool for collecting accurate information about the performance of your application. Move beyond the basics to discover how to further interpret what Shark reveals about your Mac OS X application and learn how you can identify and fix difficult performance problems. Finally, join us for a sneak peak at how Shark is evolving for future releases.

Speakers: Eric Miller, Lance Hammond, Rick Altherr

Unlisted on Apple Developer site

Downloads from Apple

SD Video (808.3 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. Welcome to Advanced Performance Analysis with Shark. I'm Eric Miller with the Architecture Performance Group. We're the authors of Shark. We're here today because you care about the efficiency of your programs and their performance, and we care about the efficiency of your programs. Most of you have probably used Shark before, maybe messed with the basic controls, taken a time profile, and reviewed the session and made some optimizations in your applications. Well, we're here today because Shark does a lot more than simple time profiles.

We'll go over the windowed time facility, which changes the way Shark collects data with a circular buffer for samples. And there are five other ways to actually start and stop Shark besides using the start button that we'll go over briefly. Most of our talk will be devoted to two other methods for examining performance issues that are not time related. And at the end, we'll give you a sneak peak of Shark's future and what that means for you.

So first, the window of time facility. In the normal mode of using Shark, you simply start Shark and let it run. The buffer collects samples endlessly until you stop Shark and examine the samples that are collected in between. And that's fine for probably 90% of the time you'll use Shark and take samples.

But there are times when you don't know exactly when you should start and stop. And there are other times when you just can't pin down a problem because it's difficult to reproduce. That's where the window of time facility comes into play. You simply start Shark in window of time facility mode and let it run forever.

The sample window will fill full of samples, and as new samples come in and fill the buffer, the old samples are rejected. This process will continue, always having the latest buffer of samples available, until some time where something occurs and you want to catch those samples, or you reach a point in your application where you want to stop and take a session.

So, like I said, you can profile indefinitely with Window Time Facility because you'll never run out of memory because your sample buffer got too big. It's fixed size. You can use the Window Time Facility with any of the trigger methods that I mentioned. and you can use the window time facilities built into the time profile and system trace configurations.

You can also change the size of the sample window with configuration options. Here you can see that there's a 10,000 sample window by default for time profile. The system trace buffer is a million events. So keep window time facility in mind when you're using Shark. I'm going to talk about some different ways to trigger Shark. But first, a brief word about the Start button.

It's right there on the front of Shark. It's really easy to find and easy to use, but you do actually have to click the button to initiate sampling and to stop sampling. That means that the shortest assumption you're going to be able to collect is a couple of button clicks, and most of that data is going to be the UI.

So that leaves some accuracy behind, but still that's fine for most of your work. But when you need to get more efficient, that's when these other trigger methods come in. The first method is the hotkey combination, which is listed right above the start key, the start button rather, and by default, the option Escape. Now that's more accurate than just pressing Start and Stop because Shark can be in the background of your Windows stack, it can be iconified, and you can still trigger sampling.

But you do, like the start button, you do need to be logged into the system in order to use this facility. And if you happen to be using the option escape key in your program, or maybe you just want the hot key to be on the other side of the keyboard, you can configure it off of Shark's Sampling Preference tab.

The next way that's more accurate still is the Launch Panel in Shark. And you can access that by setting Shark's target menu to Process, which makes a new menu that the first entry will be Launch. And when you select Launch, you'll notice that the Start button will gain an ellipsis.

That means when you actually press Start, you'll get the Launch Panel, where you can fill in your executable path, its working directory, shell variables, and arguments you might need. Then when you press OK, Shark will immediately begin sampling and then launch your application. So you can sample the startup of a program, and you can also sample any short-lived program.

Next is the unresponsive app trigger, because we all know how much we hate the spinning pizza of death. So what you do here is you put Shark in this mode and Shark automatically begins sampling when the spinning pizza comes on screen. The session will then show you which applications are where while that pizza's on screen.

So because it's in everything mode by default, you can also filter on application names. The only applications in the filter list will be the ones Shark samples. And of course, you can set the minimum amount of time that Shark should wait before it begins sampling when the spinning pizza appears. By default, it's a half second.

Now this is your most important trigger method for this week. That's the network iPhone triggering. In this mode, once you select it, Shark uses Bonjour to discover new machines, and you can use a Mac running Shark in shared mode on your local network and any iPhones that happen to be tethered to your machine. They show up in the list. Once you connect to them, pressing the Start button on the local machine causes Shark in remote mode to start sampling and stop sampling on the iPod or on the network Mac.

Now, one of the differences here is that sessions that are created will be delivered right to your local machine's desktop as opposed to remaining on the other machine. So not only do you want to use this for your iPhone and your iPod and for network Macs, you can log into a Mac on the network and collect samples from a full-screen application. And it's also really useful for clusters of servers and single servers where you don't have access to the display directly.

And lastly is Remote Control. It says Programmatic Remote Control because the way you use this is you actually have to add code to your application to start and stop Shark at your convenience. So when you use those methods in your code, of course you have to link with the Chud framework. If that's not a possibility, we also have a command line utility called Chud Remote Control, which has command line arguments for starting and stopping Shark to sample your application. This is the most precise way to control Shark.

Here's a simple programmatic example that's included when we ship Shark as an example. We're going to instrument the Towers of Hanoi solver every time the Hanoi function is called. So we slip a little code in. And the first thing you want to do, these are your five important lines for using programmatic remote sampling. Initialize the framework.

And called chat acquire remote access, this ensures that Shark is ready for profiling and essentially that means that you've selected the sampling programmatic remote in Shark. Then right before the Hanoi function, we'll call start remote perf monitor, which takes a label. That label string will be the title of the session that Shark creates. Then we run the function, then we call stop remote perf monitor, and every time we do that through the loop, we get a new session. When you finish with remote profiling, remember to call release remote access. That way another program can connect to Shark.

Here's a better example, maybe more concrete, that uses that window of time facility. We did this with Final Cut Pro. Put Shark in programmatic mode, choose a config that has that window time facility enabled, and then add some code to your program that'll stop profiling if you find some kind of performance issue. So you can see here we start sampling before we enter a work loop. And then in the work loop, the only time we call stop remote proof monitor is if there's some kind of reason to.

If a slowdown is detected, or in the case of Final Cut Pro, a dropped frame, this way you can let Shark run all night. And at some point, it may collect a session. You come back in the morning, and everything's good. You have sessions. Hopefully you don't have sessions, but you need to drop your frames.

So I want to mention batch mode, because batch mode isn't a trigger mode, but it really enhances the other modes by instead of having sessions popping up on your screen, the sessions are listed in order as they're created. This allows you to then choose which of the sessions you want to open instead of having a lasagna of windows on your desktop from dozens of sessions coming in. And that will happen.

Lastly, I'd just like to mention the command line version of Shark, which is really quite similar to Shark, except it adds a couple of text file output reports and some comma separate value reports depending on the configuration. You can use the same hotkey as the application. However, if you run the application and the command line at the same time on the same machine, you can never be sure which of the two will get the hotkey first. So if you use Shark and you log into a remote system with Command Line Shark, you can use Control Backslash to toggle sampling.

With Command Line Shark, there are a host of command line options to start sampling and launch your application. Command Line Shark can also attach to your running application and sample that. In network mode, as I mentioned the iPhone network, that was the control system in the application. In network mode, the command line is always the shared version. It's the other side, the client, if you will.

And lastly, if you're using the remote triggering that I mentioned with the programmatic, you can avoid using the chud calls at all because the command line version of Shark actually responds to Unix signal SIG user 1 to start Shark and SIG user 2 to stop Shark. So after that brief run through, I'd like to turn things over to Lance to talk about a couple ways to use Shark.

Thank you, Eric. So now that we know many ways to start and stop Shark in different ways, let's look at how we can use Shark to move beyond the basic time profiling of your CPU-bound applications and go in and analyze the performance of your multi-threaded applications that with today's Macs you inevitably will have to be doing.

So the first part of this, we're going to go in and look at a configuration in Shark that you may not have used before, which helps you track down the idle time in your application and figure out where it's coming from. We call it time profiling all thread states.

So, if you're here in this session, we assume you've probably used Shark before. Otherwise, you might want to consider our session in two days at the same time. And hence, you've probably seen this tabular browser window pop up on your screen many times. And, well, you know, this is great.

It summarizes what exactly the CPUs in your program are doing in a very simple, concise format. And moreover, right up to the top, just screaming, hey, optimize me, look here, are the things that are executing the most in your application and telling you, hey, that's where you need to go optimize.

That's wonderful and all. But unfortunately, in today's multi-threaded applications, it's not the only situation. Instead, our application performance can really suffer because of what the CPUs are not doing and when nothing is happening. You know, it can be as simple as, you know, blocked waiting for I.O., timers and things like that, or even page faults. But moreover, in your actual multi-threaded applications, it's more likely to be your threads are synchronizing from one to another. And they're waiting for locks, barriers, and other synchronization events to occur.

And unfortunately, if you just take a time profile of the whole system, all you'll see is this idle time pop up to the top. And that really doesn't tell you what you need to optimize. Moreover, if you take a time profile of your process, you'd simply get no or few samples.

So therefore, we're going to suggest that you use Time Profile All Thread States, which is a lot like Time Profile. It does almost the exact same thing in that it comes in and samples your application very frequently, or it can sample the entire system if you prefer. But unlike Time Profile, which just simply records what's happening on the CPU, All Thread States records not only the CPU-bound threads, but also all those threads that are just sitting around waiting to run and records, well, what they're not doing.

And so, you know, a really simple example of how we could use this is let's say your program is running along and it completely deadlocks and nothing appears to be happening. If you would take a Time Profile of this, well, you'd probably get no samples or very few samples.

With a Time Profile All Thread States, we can actually see and get samples of what all these threads are not doing and see where they are deadlocked at and get a feeling for, you know, how our program managed to get all pretzeled up in this way. And moreover, you could also see if you have one thread that's running out there all by itself and the other threads aren't responding. So, if you have one thread that's running out there all by itself and the other threads aren't responding, you actually will get samples from that thread being live.

So, this is nice as a debugging tool. But let's see how we can use this to performance analysis. So, for that, we'll give a little simple example here. Let's say we have this small loop in our program, which happens to be a very important loop. And we want to make our program thread safe. And the simplest way to make serial code thread safe is, well, let's just slap a lock around it and say, okay, only one thread at a time going through here. And this is often actually sufficient.

So, let's say we have enough of these locks and they're covering small enough regions. But what you might find is if you put some of this in, you try to measure your performance and, well, it doesn't seem to be improving at all. So, what can we do? Well, if you just simply use a time profile on this, well, you'll get this hot spot in the middle of the loop. That's pretty much what you'd expect. You know, that's the code that's executing the most. It's doing a lot of floating point math and the like. But unfortunately, it's not really telling you what your performance problem is coming from.

For that, instead, you need to know that the real problem is all the time. So, let's say we have a lot of time. Well, that's pretty much what you'd expect. You know, that's the code that's executing the most. All the threads getting piled up on this lock at the entry point, trying to get in here and just wasting their time waiting. And to know this, you need to use a time profile on thread states.

Well, you know, heck, you could have probably just taken a time profile and gone, well, yeah, that's the hotspot in my program. You know, the lock right in front of it is going to be where everything is waiting. I can kind of figure that out by inspection. Well, you know, that works on an example that's trivial.

But in a real program, of course, this is only going to be one little spot in your big, huge program. And instead, you're going to have a lot of other blocking points scattered around as well. And, well, you know, those might not have any correlation whatsoever with the hotspots identified by the time profile. But all thread states let you go and identify all these other hotspots as well.

So, hey, this sounds wonderful. This shows both the hotspots we're executing, where we're not executing. Hey, Ash, forget time profile. Let's just use this. That's not always the best idea either, because there are a couple caveats here. First off, if you think about it, we're recording the state of all the threads in your system, whether they're running or not.

So there is inevitably higher overhead associated with taking each sample. And if you're taking a high enough sampling rate, this therefore can cause Shark to actually eat up a fair number of CPU cycles and may have some impact on the performance of your program. So that might not always be desirable.

But the more important part is that unlike with the basic time profile where all this important stuff comes right up to the top and says, hey, optimize me, well, with all thread states, unfortunately, that doesn't always happen. Because we're actually sampling from all the threads, whether they're blocked or not.

And, well, there's usually a bunch of blocked threads in Mac OS X processes that are, you know, just sitting around minding their own business, and they're supposed to be sitting there blocked. You know, typically things like waiting for UI events to occur and the like. And, frankly, you know, you could care less about those.

[Transcript missing]

So let's give you an example of the kind of thinking process you'll have to use whenever you're trying to parse one of these all-threaded state displays. So up here on screen, we've got a simple little example of how you might have a multi-threaded application. And then we've got a master thread, which is running along and forking off work to a variety of worker threads.

So if we look at how this executes over the course of time, what's going to happen is the master thread will execute for a while. And then when it reaches a point where it can fork off some parallel work, it will, at the start line barrier where all the workers are waiting, it'll say, hey, everyone, go.

They'll go and start executing in parallel for a while until they finish their parallel region. And then they'll all pile up at this finish line barrier until all of them are done with their work. And then they'll let the master start up its serial execution again. This is a really simple and common parallel programming paradigm.

How can we optimize this? First off, you'll definitely want to go in and optimize all this active time that you see up here in blue. Just the stuff that time profile was telling you to do in multi-threaded domain, that's still very important. You still want to go in and optimize it, no question about it. But with the block time here, it's important, but it's not quite as straightforward.

So, for example, the time that workers are spent waiting at the finish line barrier, this is caused by load imbalance among threads, some of them finishing too early. And frankly, by waiting around for some parallel threads longer than others, we're actually wasting our time. If we could take some of the work being done by that third longer thread and move it over to one of the fast ones, we could actually reach that finish line barrier sooner with all four threads. So we really want to minimize this to reduce load imbalance. points.

Secondly, we have the time that the workers spend at their start line barrier, just simply waiting for work to be given to them. Well, you know, we do want to minimize this, but, you know, frankly, the information we're getting here from Shark is maybe not as useful because, really, it's pretty redundant. It's just simply the same time that the master spend is working, you know, multiplied by four here. So it's nice, but it's not really any new information.

And then we had the time that the master spends waiting and blocked. And frankly, who cares about this? It doesn't make any difference. We want the parallel workers to be working as much as possible. We want the master to be sitting around waiting for them. So, you know, you can ignore this. Going back and looking at our original code, therefore, well, these hotspots will appear when we do a time profile all thread states, but we don't actually care about all of them.

If we go in and think about the analysis we just did and apply it to what we see here, you can actually realize that these top two hotspots, poof, we can just ignore those altogether because those are things which we really don't care about. Instead, we care about the time that the workers are actually working and about the time that they're waiting at the finish line barrier caused by load imbalance. And those are what's really important for us.

Okay, so now that I've kind of given you a view as to what time profile threat states can do, let's make this a little more concrete. So last year in our session, we demonstrated how we took the reference code from the internet for the MPEG-2 decoder. And just doing serial optimizations, we were able to speed it up by about 6x. Pretty decent. But we were only using one core, and today's Macs have several cores to use.

So unfortunately we found going beyond that, the parallelizing was either easy or pretty hard depending on which loops we were looking at. We found that the pixel operations, which we took entire frames of pixels and did operations over the entire frames, those were pretty easy to parallelize. And that we can just simply take the frames and, you know, pass them out, pass out chunks into the cores and use a synchronization routine just like you saw in a couple of slides back with barriers at the beginning and end of each frame.

Unfortunately, the actual decoding loops were much harder because we had in between the actual decoding operations, we're actually accessing the movie file itself on disk. And that's a serial operation because we're reading variable length chunks for each bit of picture. So, you know, we wondered, well, do we really have to do this? That looks like a lot of work. Let's see if we can just simply do the easy parallelization and leave it at that.

[Transcript missing]

So this is an eight core machine and I want to... One thing, if you want to use the processor pref pane, we get a lot of requests for this. It's actually buried nowadays off in developer extras preference panes. It's not actually installed by default when you install Shark.

So this is a really good trick to know if you want to actually play with different numbers of processors. On the same computer and see how the number of cores actually affects your work. I'm going to look at this right now with just four cores, just to see how this goes.

So first off, if we go back to look at our decoder, well, it started out, this is running about real time with these seals swimming around in circles. And so, you know, I think, well, that's not too bad. But this is, of course, using an entire processor on a high-end Mac Pro machine.

So, you know, we really will typically want to just use a very small amount of processors so we can do multiple, we can do like high-definition video, because this is just a standard definition video, or we can do multiple streams at once, or, hey, we can do other things like shutting down the processor and letting it save power.

So we first took this and, like I said, last year we managed to speed it up pretty well just by doing serial optimizations. So now these, just by working with serial optimizations, we're able to get these seals going considerably faster, now to about 144, 140-ish frames a second. Pretty good. But we wondered, can we make that, can we do better with multiple cores?

So, let me change this to use four cores and just doing the easy optimizations. Let's try this out. Well, it's definitely faster, but that's not very much faster. That's only about 10 frames per second faster. So let's use Shark to go and analyze this. So first I'll switch from time profile to time profile all thread states.

And I'm going to look at my MPEG-2 decode process. So I'll start that out. And let these guys do a couple laps. And then I'll use the hotkey that Eric was discussing to go ahead and quit that. And clear this out of the way to let Shark analyze faster.

So what we're going to see here in a moment is a browser display that looks much like you just saw from any normal time profile that you might do. And it looks pretty familiar here. The one big difference is that you can see up here at the top, are several symbols that you may not be familiar with. A lot of semaphore stuff and so on.

And what these are from are symbols from anything which is blocked, threads that are blocked. And you can see when we look down here at all the threads, where on a time profile, we'd see a couple of decoding threads, here we have a lot of threads all piled up. 'Cause a lot of these threads were just simply sitting around doing nothing. But we took samples from them.

And as you can see here, most all the threads had the same number of samples because we were taking a sample from every single thread every single time. And this is sort of a hallmark of a time profile all threads stays. Except you see this one guy down here, well, that was a thread that was actually created or destroyed while we were sampling. And hence it doesn't have the same number of samples.

So going in and looking at these, we can pop some of these open, and if you dig down a few levels through this list, well, eventually we reach a point where, well, let's see, this looks, oh, man, that's like some UI garbage or something. I don't really want to know about that.

You can keep digging through here. This can take a while. So what I really recommend you do is if you hold down the Option key at any time, either with all thread states or with a time profile one as well, you click, boom, these are just all open at once, sometimes giving you more information than you really want.

But it makes it, when you're digging through these really deep stacks, it makes it a lot easier. And we can just see at a quick glance here, that's from a bunch of, like, UI threads and so on. So I really don't care about that either. And really, we have to dig down through a few of these before, oh, okay, now we're getting into weight points that our code actually called. And in fact, we can see these are various points in our program.

And to kind of try to reduce the clutter somewhat, I'm going to go in and look at one of the various worker threads that I use to actually do parallel work in this program. And we can now see at a glance that, well, these worker threads are spending 88% of the time just waiting for work to do. Hmm. Well, let's see, why are they waiting?

If I pop this open, We can now -- I've actually labeled the routines here so you can see here's the start line barrier and here's the finish line barrier. So the finish line barrier, we're having a couple percent there. That's, you know, that's not zero, but it's not horrible either.

Most of our time, unfortunately, is here at the start line barrier. We're waiting there ridiculous amounts of time. So unfortunately what we're seeing here is that we probably just can't get away with doing those easy parallelizations. This is really dropping way too much time on the floor. Also I'll point out that when your symbol names don't tell you this really clearly right off the bat, you can go -- if you have, say, multiple synchronization points in a single function, you can just double click on these just like you can with time profile and it will toss you into the source code.

As you see here, I kind of just tossed randomly about and there aren't any highlights. That's because only the sort self column is highlighted. So I'm going to go over here and show the advanced settings drawer. You want to make sure you have this -- you want to make sure you have this show total column checkbox clicked. Now we added the total column. And if we move around, now we can actually see the synchronization points where we were calling into routines and blocking. And they've now been highlighted so we can see exactly where those were and where we were spending all our time waiting.

Okay, so going back to the slides, what we learned here was that we're probably going to need to do some more work. So, well, you know, we'll look at that more in a little bit. But for now, I want to look at trying to simplify down some of that clutter.

You saw with just that little session looking at this, well, fairly simple example application, there was a whole lot of clutter to dig through in that display because there was a lot of blocking calls in any real application. And if you have, you know, a full large-scale application, it can be just a nightmare to dig through it.

So, to try to dig through some of that, we're going to use options we call data mining in Shark to actually go and look just at what you're interested in by doing things like hiding symbols or, in fact, entire libraries. which are not of interest to us at any point in time.

[Transcript missing]

So just to give you a little reminder of what Shark is doing under the covers, what it's actually doing is recording call stacks over time at each sample point. And it then goes and takes these call stacks and processes them, starting from the root, going up to the top, and building up a tree of what all functions are being called and who's calling each other.

These trees are then what's used to display the browsers and get the various trees you see there. Moreover, you may have noticed the total and self columns in your browser. Well, these are actually done by the-- Shark is doing accounting under the covers as it's counting up all these functions. And what we have here is self is the actual number of samples that fall within a particular function. While total is the number of samples that fall either within that function, the self samples, or in any of the functions that it calls underneath.

So with this kind of understanding, keep an eye on this and you'll see how this changes as we data mine in different ways. So first, and probably foremost, it's one of the first things I almost always do here when looking at a complicated Shark session is charging libraries. So, for example, up here in this tree, we have some code from, like, system libraries provided by Apple.

And that's code you can't actually touch and modify yourself. So, you know, you'd really rather just not worry about the fact that the samples fell deep in there. You're instead worried about your own code that you can actually change that's calling into these libraries. So let's try to get rid of some of those.

What we can do is we can charge the lib system calls under here, get rid of them. But since we do want to know which of our functions we're spending time calling into those libraries, because we may want to optimize those functions, we can have Shark take all those samples and coalesce them up into your functions that are making the calls. So we can still see that those functions are important. And they may, in fact, then pop up to the top of the list and say, hey, look at me, I'm the most important part here.

And as you can see here, what we have is that up in the baz function, after we've done this coalescing, all the samples that have been brought up from below are actually mixed in with the ones that are already there. So now the self number in baz is pretty high.

Well, in fact, you may not actually want that at all times. What you may want is something a little less aggressive. And so for that, you might want to do flattening of your libraries. So in this case, like, let's say, you know, we don't really care about all the guts of the library, but we do want to know what samples we're having that are calling lock and unlock.

So let's simplify by getting everything below the lock and unlock, but keeping the actual calls themselves so we know when we're making calls. So now we just take those samples and we bring them up into the first level of callers. and Chris Higgins will be joining us for a sneak peak at how Shark is evolving for future releases.

Another thing you might want to do is say, you know, I don't want to worry about this portion of my trace profile at this point in time. I want to focus in on this part over here. Well, let's just simply remove it. Like, let's say we don't care about the bar function at all. Let's just get rid of that because I want to focus in on what's happening over at the other side of the tree.

And so we can just simply cross those out, nuke them entirely. And what will happen here is that unlike the previous flattening and so on, where we're coalescing samples, we're going to take the totals and drop them down because these three samples are now removed from the profile, at least from the view you're currently getting with data mining.

Well, okay, that works pretty well, but what if instead of trying to knock out one symbol at a time, we instead want to go in and look at one particular symbol? Let's say bar is actually what we decided was really the most important thing. Let's go in and focus on that.

And once we focus in on bar, well, then all the other samples will disappear. And we'll drop the totals for the remaining callers down appropriately. We remove those four samples. And one thing I really want to point out here is that all the self samples from Baz are gone. All the self samples didn't actually involve bar. So hence, since bar is not in the call stack, those are removed.

Okay, that's pretty good. But what focusing actually does more than just simply removing samples. It also takes the tree and hides the part of it above your symbol. So all that stuff like start and main and whatnot that's all at the root of the tree, you know, frankly, if you're focusing on that symbol, you probably don't care so much about those.

You want to know about your symbol and everything below it and what it's actually calling. Or at other times, you might want to know how it's being reached, what's going on. So if you do actually want to know about all that start and main and so on, well, you can do focus callers instead. And that will go in and show you just the paths down to your function while leaving out everything below it.

Okay, that's all pretty good. Well, but in a real application, just doing one or two of these options probably won't simplify things down enough. So really, you're going to want to mix and match a lot of them. And Shark's able to do that. It can go and apply many of these operations serially, one after another, as you've selected them.

And so, for example, let's say we want to focus in on foo. That's our area of most interest. So that chops out part of the tree. But, you know, we also don't care about the stuff inside the lib system as well. So let's go ahead and flatten that out as well. And now by applying these two operations at once, we've now taken our big tree there and simplified it down to just a small section that we can really focus our attention in on and figure out how to optimize that part of the program.

Well, okay, so this is a lot of stuff. You know, how can we actually control all this? Well, there's a data mining menu that you may have seen up in the menu bar, potentially ignored. And all these commands are just simply up in that menu. You just need to select your symbol in the browser display and go up and choose the menu. Or if you have a two-button mouse or a control click, you can pull up a contextual menu on any symbol in the profile browser displays and choose to go in and do any of these options on that particular symbol our library.

But, hey, there's more as well. So if you go over here to the Advanced Settings drawer, you'll notice that whenever you have any view in Shark, some sort of profile browser from many different configurations where data mining can occur, well, this pane over here will actually appear so that you can use it.

And it actually provides a lot of useful controls. First off, there's just the on/off control at the very top. It's really great to just simply flick that on and off to kind of see a quick before and after comparison because when you flick it off, it won't forget the sequence of data mining options. So you can actually just go back and forth.

Also, there are several preset really useful techniques. Charging and flattening system libraries, like I mentioned, are just right there so you can get them at a click. and I will be joined by Eric Miller, Lance Hammond, Rick Altherr, Rick Altherr, and I will be joined by Eric Miller.

Also, you can say, okay, well, kernel samples, well, I can't change the kernel unless I work at Apple all of a sudden. So just forget those and I'll go and worry about those at another time. There are also a couple unique features that are only available through the Advanced Settings drawer here. One is you can flatten recursion.

So if you have a really long list of recursive functions that when you go to the profile browser and you do that option click, they just zip down forever. Well, you can just simply click on Flatten Recursion, and Shark will notice that, hey, that's the same function being called over and over, and it'll flatten that stack down to the first entry.

Also, you can tell Shark to knock out any lightweight symbols, and you can actually select the number of samples that you want to allow as the cutoff point in order to get rid of symbols where you only have a couple of samples, so it's really not that important. It may even be statistically insignificant.

Okay, now going back to our demo. Okay, so we went in and based on the information we saw before, we went, well, okay, we've got to bite the bullet. We need to parallelize that decode. And so we went in and found the picture slice loop in it and went in and parallelized on that.

In order to do that, we had to go and find all of the reading of the file and concentrate that all at the beginning of the slice in order to prevent that serial code from preventing parallel execution. That was a fair amount of work, but ultimately, it did seem to work out. But we were really concerned that because different slices of the picture will be of different portions of the picture, and hence maybe showing different items, and so some things may require more or less encoding depending on how complicated the picture is there.

We were worried about load imbalance between these different slices. So we actually made two slightly different versions. The sequenced version takes the slices of the pictures and assigns them to threads in our parallel loop in order. In the order that they were in the original file, they get assigned to our threads, 1, 2, 3, 4. And this allows very simple synchronization.

It's kind of the best analogy as to like relay runners at the Olympics. You know, they take the baton, they read their portion of the file, and they pass the baton off to the next. And then they go off and do their parallel decoding work. And the threads just keep passing the baton as they need to do their file access.

simple enough. We also wanted to look at the unordered case. As soon as the thread finishes up with reading its portion of the file it says, hey, whoever is free, come in and go ahead and grab the next slice here. It's ready for you to go. And this should in theory tolerate load imbalance better because if we have a slice which is really short, that thread can zip through its short slice and then be free and immediately go and grab a new thread. So we wondered which is better and also a little bit about why once we saw that. So going back over here to the demo machine.

Here we have our application again, and we went and did the all parallel case. And here's the unordered case. So we can see it play it. And now those seals are zipping around considerably faster. They've gone from about 150 frames per second up to about 300 by going with four threads. 2x performance increase of four threads.

That's actually not bad if you look at a lot of parallel programs. But let's see if we can figure out a little bit more about why. So let's use our hot key again and start sampling. Do do do do do do, it's on a new lap or two.

And now what we're going to do is go through and we will be analyzing to see why how the parallel decode loop was affecting our performance. So the browser here looks a lot like what we just saw before with a lot of these, you know, kind of strange blocking calls up at the top.

And again, we could go in and click on them and go, well, let's see, that's a whole bunch of UI stuff. No, I don't really care about that. But let's do something about it. Let's use some data mining to try to clear out this clutter. So the first thing I like to do is get rid of the system libraries. You can charge them in some cases, but I do actually care about, you know, when I'm in the various locking and waiting calls, I do want to know that that's blocking, not running.

I'm just going to flatten those libraries instead. And now you'll see that the view changes a little bit here. The speech ball comes up as Shark thinks a bit about what all these system libraries are. And now you can see that instead of the name of library routine is deep inside the libraries, we're instead seeing a lot of the data that we're getting from the system libraries. And so we're going to go ahead and do a little bit of a search. And we're going to go ahead and do a search.

And we're going to go ahead and do a little bit of a search. And we're going to go ahead and do a little bit of a search. And we're going to go ahead and do a little bit of a search. And we're going to go ahead and do a little bit of a search. And we're going to go ahead and do a little bit of a search. And we're going to go ahead and do a little bit of a search.

And we're going to go ahead and do a search. And we're going to go ahead and do a little bit of a search. And we're going to go ahead and do a names of routines which were the ones that we were actually calling, pthread_con_wait, select pthread_con_timed_wait and the like. We can now dig into these. And well, we're still seeing that -- hmm, yeah, that's like UI stuff, okay. Well, I'm going to go and just simply tell Shrk to remove that symbol.

It's clearly stuff I don't care about. Pop this down. That's a bunch of UI stuff, too. Okay. And then if I hold down the control key here or right click, we can go in and say remove that as well.

[Transcript missing]

and with this you can actually see the sequence of operations and maybe take notes for later on.

Okay, there's also a restore button down here if you want to get rid of all those. Okay, well, we can continue the, you know, removing symbols here for a while. But I happen to know, you know, I'm really interested in that parallel decode, and I really don't care about anything else. So I'd rather go in and focus in on parallel decode. Well, first of all, to find it. Finding it is usually, for a high-level function like this, is usually easier if I switch to the tree view.

So now I'm starting at the top and working down. And indeed, if I go at thread start here, parallel worker shell, parallel worker, aha, there's parallel decode just a few levels down. So I'm going to go and right click on this and do focus in on parallel decode. Okay, so now everything above parallel decode, as I said, is chopped off and thrown away. And we're just looking at it from parallel decode on down.

So let's see, I'm going to go in. This is to actually see where we're spending time. I'm going to flip back to heavy view. And now we can see, that, well, we're spending our -- wherever we're in parallel decoding, we're spending about 27% of the time waiting here, about 1% in locking, and most of the rest we're actually executing. And, frankly, this is not too bad. There's a little bit of spin locks and so on.

So we're getting something like 30% of the time waiting. And it would be interesting to compare that against what we can do in the ordered case. So I'll do Let's do a quick session here. That's pretty similar. Maybe not quite as fast. Maybe a few frames per second slower, but it's in the same ballpark. So I'm going to start that.

We'll get another session here, which should look pretty much the same. And now also, I can see just what I've been doing here before. So I'm going to apply the same operations really quickly. So here I've got Flattened System Libraries. It's already been checked for me now that Shark has seen that that's what I like to do. And so I'm going to go in and just flip over to Tree View and go to pop this guy open. There's Parallel Decode. I'll go in and focus on that.

and flip back over again. And we can see here that, well, you know, these are actually looking a lot alike. What we see with the sequence case is we're getting just a few percent more time spent waiting. You know, it's only a few frames per second, but it is noticeable. So that's kind of an interesting finding.

And we were able to find that and narrow it down and isolate it exactly for that loop, just with a few quick data mining options here. Instead of having this buried in among all sorts of data from all of our various loops and operations throughout. So what did we learn?

Going back to slides, what we've learned really is that with data mining, instead of having to dig through all this information painfully, routine by routine, we were able to go in and have Shark isolate this parallel decode function really easily. So we could precisely examine just that loop while excluding everything else.

And also what we saw looking at the application is unordered is a bit faster on four cores. Almost the same, just a bit faster. So, you know, it kind of leads us to think, well, maybe load imbalance is the bigger problem. But to actually figure out exactly why that's the case, we're going to need a new tool.

Because while time profile on all thread states is great for going in and finding which blocking operations are important, and it's great because you can go in and you can lower the sample rate and sample over large portions of your application and get a real overview as to where your application is working and synchronizing, it can't really give you enough information to go in and debug individual locking and blocking operations in your code.

You often need to know how and why the synchronization and serialization is occurring between your threads, and so you need to see the time sequencing of how these threads are passing off from one thread to another. And for that, we need another tool, System Trace, which allows you to see the system and threading interaction very precisely.

So what a system trace is, it's an exact trace of all the OS entry points in your program. And this includes all the blocking, because whenever your threads block, they go into the kernel. So instead of with the time profile view where we get these evenly spaced samples throughout, system trace will instead record data whenever we go in and out of the kernel. So like this very first entry point into the kernel, this small red square on the left, we'll actually see that instead of having it fall between samples.

This is a lot better than, say, taking your time profile and cranking up the sample rate really high. We've encountered several developers who've cranked the sample rate to 20 microseconds, which, while you can see a lot of information, it really blows through memory really fast, and you can't actually take very long sessions. And it gives you a whole lot of data, which is hard to dig through. System Price gives you this a lot more efficiently and presents it a lot better.

So you can go in and you can examine these multi-threaded applications really in close behavior. So you can see all the lock contention events that occur. Anytime a lock is contended, it'll pop up. If a lock doesn't have contention, it actually won't be visible on a system trace, because in Mac OS X, you don't go into the kernel if there's no contention for a lock, only if there's actually contention. But just with this, we can go in and see how our semaphores are communicating and how threads are waiting and releasing and blocking and blocking. And see the patterns of how we're communicating.

Back and forth from one thread to another and running from one thread to another. You can also see some other things here, too. How your thread is going in and interacting with Mac OS X. System calls, page faults, and so on. And moreover, just like time profile or time profile thread states, you don't have to use any modifications to your source code in order to use system trace.

You can just simply run it out of the box and users can use it. But if you like, you can actually go in and put tweaks. And you can use the tweaks in your code to do things like starting and stopping shark with the programmatic remote control. That still works.

Okay, well, once you've actually used System Trace, what will pop up is there are several views that are possible. But in general, for really detailed analysis, I usually prefer to use the timeline view shown here, which shows the threads executing over the course of time and allows you to actually go in and click on the various events and examine them in detail.

If you're interested in learning more about the other displays, I recommend you come to our introductory session on Thursday because the second half of that talk is going to go in and give you a very detailed tour of all the aspects of System Trace. But given the limited time we have today, I'm going to go in and focus in on just a few elements.

So in particular, the events are of most interest to me. So various system call events are indicated by these little telephones, and we can see both non-blocking system calls and blocking system calls when we bridge between two thread-run intervals. Most importantly for any multi-threaded program, you're going to want to probably note the locks, which show up as these little lock icons and indicate that you're doing various multi-threaded operations and you're actually locking and having lock contention.

You can also see virtual memory events. Page faults appear here. Most significantly, these are usually page in and out operations where you have to actually go and wait for the disk to occur. But in modern multi-threaded applications, a lot of these other faults can actually come up in places where you might not think about them.

Because oftentimes the OS communicates between different threads by actually page faulting data and moving entire pages from one thread's address space to another. So you can get actually sequences of these faults when you're doing a lot of inter-process communication. Also, Leopard has added a couple new faults, guard and failed, which indicate when we're actually having problems in our program and we're simply not fail, VM events are simply not working for some reason, such as, say, a segmentation fault.

You can also see things like interrupts, any hardware interrupts that come in from things like IO operations, complete DMAs, timer interrupts and the like. And if you like, you can actually go in and add events to your code using what we call signposts. These come in two big varieties. First are the point signposts, which just simply say, "Hey, here's where I am.

Look here." So that way, if you have any area in your code which doesn't normally do system calls, you can actually have a record of that on the system trace showing, "Here's where I am. Here's where I am. Here's where I am." Also, if you want to measure sequences of time, you can use interval signposts, which, just like the system calls, have a beginning and end with a little underbar.

With each signpost, you can record up to four integer auxiliary values, which can then be displayed in Shark when you click on the event. And you can pretty much put these in anywhere you want, but keeping in mind that there is some time penalty involved. If you put it in user code, it's about 20 to 50 microseconds or so per signpost.

So you have to be a little careful about these. In the kernel, where you can put them in KEXT as well if you're a KEXT developer, the overhead is pretty insignificant because you're already in the kernel, but there is a little bit. So you can't just put a bazillion of them in.

How can we use this? Let's say we have a key loop in our program that we're really interested in seeing on the system trace, but we're not doing any system calls in it, so normally no events appear. Well, let's add some. What you can do is let's put in a couple of signposts here. So first off, let's put in one which is simply a timer for the whole loop.

So we can put in a beginning and an interval signpost around the entire loop, and we can see how long that loop takes very precisely. Also, then we can put point signposts in at the beginning of each iteration, so we can at a glance tell where each loop iteration is starting, and this will all appear on the timeline.

[Transcript missing]

and any auxiliary values that we want to supply. Or just a zero if you don't want one. That's all there is to it. Of course, in some cases, you may not necessarily want to link with the ChED framework. For example, if it doesn't exist on your system intentionally for one reason or another, well, you can do this without linking to the ChED framework by using direct syscalls here. These are a little uglier looking, but they do the exact same thing. No linking with the ChED framework necessary.

or if you're down in the KEXT, you can't, in a KEXT of any kind, you can't actually link with the CHED framework down there, so you just use the kernel debug facility directly down there. It's almost exactly like doing the sys calls up in user space. And I should point out that for this session we have sample code associated, which has these exact samples from the slides and also the two that Eric had. So you can go in and actually download this and then use this as a starting point to go in and put in your own code in case of the do work functions and start working with these directly.

Of course, there is one more little thing we need to do. We need to actually tell Shark which signposts are important. So when Shark starts, it automatically scans the Library Application Support Shark KDebug Codes folder and looks for any text files in this particular format, which consists of these hex numbers and the names of your signposts.

And when it reads those in, it will go, "Okay, these are the signposts that you're interested in." And when you're sampling, if it gets any signposts with those values, it will record them and then present them on the screen by name when you click on them. If it gets any signposts with different values, it will assume that you're not interested in those particular signposts and will discard them. So this way, you can actually filter which signposts are of interest if you have a lot of them that you start adding to your program.

Okay, to kind of show you how you can actually use this for reality, let's go in and look at how we did our MPEG-2 decode program and look at it on a microscopic level. Well, so we could see just from what you saw in the demo, the unordered parallel decoding is a little bit better, just a hair better. And this is probably because of load imbalance.

But, you know, we want to verify that our intuition here is exactly what's really happening on the real system. Because, of course, as I'm sure all of you have determined, you know, everybody's intuition is always exactly right when it comes to performance problems. So, well, let's go and use System Trace to examine this.

As it happens, in our parallel decode program, we actually do make several system calls in any parallel decoding segment. So we actually just went in and looked around in the timeline until we found these segments where when we clicked on the events, up in the call stacks, parallel decode was in the call stack.

And we could see where we were and, oh, yes, this is actually a parallel decoding slice illustrated here on screen. Now, if we didn't have all those conveniently placed system calls, well, this would have been a perfect spot to go in and add signposts to say, hey, we're beginning a parallel decode slice here, we're ending a parallel decode slice here, and the like.

And then once we can see this, well, we can now look back at our code and figure out exactly what it was doing here on the timeline. In particular, the baton was being handed to us. We were told that, hey, the file is ready for this slice to read. Then we did a little bit of serial file access. Then we handed off the baton to the next thread and said, okay, I'm done with the file. It's your turn. Well, we went on and did some parallel decoding work for the slice.

Pretty simple. Let's look and see how that actually works. Well, what we can see here, just from a quick view of this, and I went in and picked this out and put it up on a slide because, unfortunately, it does take a few minutes to go through and find just the right spot in the timeline, and I figured you probably didn't want to watch me move back and forth for several minutes.

But feel free to come to the lab, and I'll be happy to demonstrate in person. What we can see here is that, well, you know, the sequence timeline actually looks almost better because it's just smoothly passing from one thread to the next as we do the serial region here highlighted in orange.

The Unordered, on the other hand, comes in and you have the threads popping up and say, "Hey, I want to try to--" Oh, no, you got it first. Oops, oh well. So there's this fighting action, which is making it look a little messier. Yet the sequence is going faster. Well, this is looking at one spot on the timeline. If we scroll a little further, you can actually see that the OS X is pulling a little operation on us and that it's actually occasionally stealing a processor out to do other stuff.

And so what can happen here is that in this case, when we go and see where the serial regions are occurring, well, now the unordered case looks a lot better and a lot smoother than the sequence case. Because the sequence one has four threads that are all fighting over three processors and continually knocking one another off of the processors. While in the unordered case, three threads are smoothly using the processors, while the fourth thread just sits here running along and going, "Oh, well, there's no space for me.

Okay, I'll just simply, you know, lay back and let you guys handle the work until I actually have some room to get on a processor." And then that way, no threads have to be hopping continuously from processor to processor. And with this, we're able to get a small percentage better, because while the unordered is messier in general, it can deal with this percentage of time when OS X steals a processor out from under us better.

Now, there's another thing, if you were looking closely at this display, that we could also tell from looking at the system trace. And that is, at the four-core level, our parallelism is really limited by the serial code. This is why I reduced the number of cores down to four before. Because what we're seeing here, with four cores, these serial regions, if we bring them down and I'll put them in line, actually form a continuous line across.

This is now the critical path. And if we actually go to more cores than this, we're simply, you know, this is going to be the bottleneck and we're not going to get any more speed up. In fact, it slows down as synchronization gets more fighting and battling going on.

Now, to give you another example, we were given a case where we found signal handling was a problem. So there was a video conferencing application we were given a trace to look at, and it was noted that the CPU utilization was much higher than expected, which caused the power consumption to go through the roof. So, you know, what was happening here? Well, with our system trace here, we could see that, well, not to be expected, kind of to be expected. You had lots of threads running, and they were running a lot.

And this is kind of unusual for a video application where you tend to have some decoding occurs, and then you get to wait until the next frame, and then some decoding occurs, and it waits for a little while, and so on. So we went in and zoomed in and looked at this. And what you can see is here is we have almost entire lines, solid lines, of failed faults occurring.

That surely wasn't what we were expecting. And also, when you think about this, these are coming in at hundreds of thousands per second. So in each one of these is maybe a microsecond or so. You can do the math here, and this is actually taking a pretty big chunk of time, handling these failed faults and basically doing nothing. This kind of puzzled us a bit.

So we went and switched to the trace view, which gives a complete listing of all these events. And it's normally not the first place you want to go in and try to look at your system trace. But once you've kind of figured out a sequence of interest, it lets you go in and actually examine those sequences in detail and see all the layout. And what we can see here in this big table, if we go and look at the address column, well, all these faults are coming into the same address. One after another, boom, boom, boom, boom, boom.

And what we see is that the address column is actually going to be the same address. So we can see that the address column is going to be the same address. And what we see is that the address column is going to be the same address. So we can see that the address column is going to be the same address. And what we see is that the address column is going to be the same address.

So what we saw here is basically all this power consumption is happening because something was almost always running. You know, there's nothing, you know, there's no idle time as a result. And so the power chip, processor chip could never power down and let us, you know, let us cool down between frames. And it's caused by all these failing page faults. And basically instead of sleeping between frames, our threads were getting stuck in this cycle of sitting on these failing page faults until another thread or network event finally came in and knocked them out.

And so effectively, what should have been idle time was spinning here in the fault handler. And really, it wasn't necessarily the fault of the guys who did this. What we think happened was that a change was made in the OS between Tiger and Leopard, and all of a sudden, their signal handler, which was catching an occasional mistake happening now and then, all of a sudden came to hide this big, huge problem, which they otherwise wouldn't have noticed. But still, the signal handling was causing them a lot of grief until we came along.

and the QuickTime team came to us and asked us to help them with their high definition video. So they found with 720p playback on one of the early MacBook prototypes, they were dropping frames left and right. And given the processor speeds and everything we were supporting, this didn't make any sense. And also the power consumption was going through the roof and well even standard defection was pretty power hungry, much more so than expected. So hey, the QuickTime guys had been using time profiles, going in, banging their head, optimizing their decoder like crazy.

It wasn't helping. Well, we pulled out System Trace and went in and looked at it. The problem jumped out really fast in that, well, all their effort on the decoder, well, it was actually -- they did a good job. It was decoding and getting out of there right away.

The problem was instead all the rest of the time where we were trying to take that information from the frame and move it up on screen so you could actually see it. And up in the standard definition case, it was taking like half the screen time just to move this video frame up to the screen. And with the high definition case, it was actually dropping frames because it was taking longer than an entire frame time just to move one frame up on screen.

And this seemed a little weird. You'll notice how this is kind of lightly colored on here, too. If you actually go in and magnify it and zoom in on it, well, that's all that's happening. Little bit, little bit, little bit. And needless to say, that's not very efficient.

Well, so, you know, basically, just with a couple of quick system traces, we were able to figure out that decoding was not the problem. No, instead it actually turned out to be a GPU driver problem with a new GPU. And instead, it was having problems. It was just simply allocating VRAM. And so not only was that process slow, moreover, it was tying up the CPU in knots because the CPU was busy pulling it and saying, are you done yet?

Are you done yet? Are you done yet? The whole time. And those little gaps between were so closely together, the CPU could never actually pull it out. So, we just simply sat there spinning along, burning power like crazy. Well, you know, once we pointed this out to them, that really helped them a lot with their debugging.

And moreover, it didn't just help them their debugging just by knowing it as fact, but also by knowing which thread to go and look at, the thread that was actually moving the data. They could actually go back and look at their original time profiles and time profile all thread states. And actually, instead of looking at the decoder thread that they'd been spending all their time on, they looked instead at those threads and went, oh, hey, we can get a whole lot more information out of this.

And so, we were able to get a lot more information out of these profiles that we didn't actually know before. And it really helped them a lot. So with that, I'm going to pass the baton over to our esteemed Rick Altherr, who's going to talk to you about the future of CHUD and what to expect.

Thank you, Lance. So before we talk about where Chud is headed, let's take a brief look at where Chud's history has come from, where this all originated. - Chud was originally created as a collection of internal tools to aid in the initial bring up and performance analysis of G3 systems.

At the time, a lot of performance improvements revolved around making these low-level, machine-specific optimizations within the OS and its frameworks. And so we wrote a collection of tools that basically collected low-level, machine-specific information and aggregated it and showed it in a variety of ways. And this was really useful, But over time, things have changed a bit.

We gained a couple of new architectures, most recently the Intel architecture. And Chud was enhanced for these new architectures and then used extensively within Apple to ensure that performance on these new platforms would be phenomenal. We also gained support for both Tiger and Leopard, 64-bit applications, as well as a change to the Dwarf debugging format.

Today, Chud's been extended for the new platform, the iPhone, including the arm and thumb architectures, as well as gained a lot of enhancement for dealing with multi-core processors. Looking towards the future, we're also going to be gaining a 64-bit kernel and a huge increase in the use of multithreading within applications. So once again, Shad will adapt and gain some support for providing performance analysis of these new directions. But why stop where? We could do a little bit better.

Today I'm introducing Chud 5, codenamed Hawaii. Hawaii has been designed from the ground up to support multiple architectures. This way, that same low-level, machine-specific detail that we've had in Shark all along now extends uniformly to all of the different architectures supported by the OS X and iPhone platforms.

Hawaii has also been written exclusively for Snow Leopard. We're going to use many of the same features that you as developers will be using, like enhanced multi-core support in Snow Leopard. Hawaii also gains support for coexistence. Now, this was introduced with Xcode last year, and it basically lets you install multiple versions of the tools on the same OS installation.

Further, you can also relocate Hawaii onto non-OS partitions, like external drives, so you can use it between OS installations or even move the external drive to a completely different system and use it over there. We're also going to consolidate the functionality of Chud into two applications. We're going to use Shark for all the profiling needs, and we're going to use Reggie for doing the direct hardware access. Well, this is all great, but we could still add a little bit more.

We're also going to fundamentally change how we collect the performance information. We're going to separate the "when to record" from the "what to record." The when become known as triggers. These are events that occur in the system, things like OS events, timers, hardware events. And we're going to call the what to record, we're going to call actions.

And these are very specific types of information. We want to collect specific things, call stacks, event counts, what process was running. And what we want to let you do is hook multiple actions to a single trigger. So every time an event comes in, you'll collect multiple types of data. Okay, let's look at a few examples of how this might work.

L2 cache miss profile of everything. This exists in CHUD 4 today, works reasonably well. Let's look at how we would do this in Hawaii. We simply take a timer trigger and attach and record L2 cache miss count action to it. Really simple, really easy. Let's make it a little bit more challenging. Let's look at a time profile of Safari, just Safari.

Well, again, we take a timer and we take a call stack action and we link them up, but we add this other piece in the middle. We add a conditional. We say, only record a sample if Safari is actually running on the processor when the timer event occurs.

Okay, we can do this in CHUD 4 already, so let's take it one step further. What if you wanted to take a system trace of just the main thread in your application? Again, we simply take a system call trigger, add two conditionals to it. Is it my app? Is it the main thread in my app? And if so, and only if so, do we actually record the call stack?

This is a bit different from how we do things in CHUD 4. Now, instead of recording everything and filling up your sample buffer very quickly and then filtering it out at the end, we're just not going to record it in the first place. So you'll get even lower overhead when doing things like system trace. Of course, this is all great, but let's go one step further. How about we do all three of them at the same time? Yes, we can finally do time profile and system trace at the same time and display the results together.

Now, Hawaii is still currently under development, so expect more to come out of this. But I wanted to give you a brief introduction. So let's do a quick summary of what we've learned today. We talked mainly about a variety of ways to actually trigger recording in Shark. This way you can look more specifically at what you're interested in and also have more accurate control over when you record. We also talked about how you can use Shark to look at multi-threaded programs using time-profile thread states and system trace to get both an overview and a very detailed look at what's actually happening on the system.

and we introduced Hawaii. Please look forward to many improvements in performance analysis on the OS X platform. And with that, Please feel free to contact Michael Jurowicz, our evangelist, with any questions. His email address is [email protected]. You can also send us feedback directly. It's at [email protected]. We like both good and bad feedback.

We'd also like to invite you to, ask you to invite your coworkers, people that might not be familiar with Shark, to our introductory session on Thursday at 5:00. We'll also be in a variety of labs. We'll be in the OS X and iPhone Performance Labs. We'll also be kind of mulling around in the open hours in the various labs.