Performance Tuning Your Application with Shark - WWDC 2008

Tools • 1:06:18

Shark is a powerful performance analysis tool for discovering where your application is spending time in iPhone OS and Mac OS X. Learn practical ways to find performance bottlenecks, obtain tips on optimization, and understand how your software interacts with the system. Discover how Shark can help your application perform at its best.

Speakers: Wade Tregaskis, Ryan Du Bois

Unlisted on Apple Developer site

Downloads from Apple

SD Video (807.6 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Welcome to session 932, Performance Tuning Your Application with Shark. My name is Wade Tregaskiis. I work in the Architecture and Performance group here at Apple. In this session, we're going to show you how you can use Shark to improve the performance of your applications. But before we get right down into how to use Shark, I'd like to lay the groundwork a bit, starting with why you're here.

Hopefully, you're all here for the same reason. You're about performance. You want your applications to be faster and more responsive. This makes end users happier, of course, and makes them more productive. It also gives you room to add additional features to your application and enhancements. Another way of looking at performance, though, is in terms of power and economy and efficiency.

Typically, a program that's more optimized, that runs faster, uses less power to do the same amount of work. This is, of course, really important today with so many of our users on portable devices running off batteries, where they want to get the most life out of those batteries for each charge. In a similar vein, but at the other end of the spectrum, you also have your server people who care about power much the same because it goes right to their bottom line in terms of cooling and electricity costs. So these are the reasons we care about performance.

So let's look at when you should start optimizing your application. If we just consider your development workflow to start with, you start out by writing your code, of course. Now, for this, you probably use some high-level tools like Xcode, Interface Builder, so forth. And these allow you to write your code, get it compiling, and get an application built. But this isn't an application you'd probably want to release to end users, right? Because it's possibly going to have some bugs still left in it.

So, of course, you don't. You first test it and you debug it. Now, debugging you can do using just Printf, right? But hopefully no one here is just using Printf. Hopefully you're using the high-level tools that we provide that allow you to do this much more easily. Things like Xcode with its GDB integration, Instruments, malloc-debug, leaks, and so forth. What these tools are about is telling you, "Okay, not just that a problem occurred, but where it occurred, what code was executing, what your code was doing when it occurred, what was leading up to that point."

They basically let you go in and see what your program was up to and figure out what's going on.

[Transcript missing]

And they, much the same, let you get in there and see what your program is actually doing and understand what's happening. And having used these tools, they give your program that real final polish, that real gloss and finish that users are going to love.

All right, so let's get down to it. What do you need to do in order to use Shark with your program? Technically nothing. I mean, you can fire up Shark whenever you feel like it and start using it on your program, but there's some things that you should do in order to get the most out of it. First up, debug your program. It's hard enough at times improving your performance, let alone if it's not even working properly to start with. So worry about getting it working, and then worry about the speed.

When you do get to the stage where you want to improve the performance, you want to be profiling what end users are going to see, because that's ultimately what you care about. So you want to be profiling your release build. Turn on all your optimizations, turn on any flags, do any extra work that you would normally do. With one exception, you want to keep your symbol information in, so don't strip those. Simply so that when you see your results in Shark, you can still see all your symbol names and your source code and that sort of information.

Now, when you're looking at a performance problem, you want to start out by picking a metric, some way to quantify its performance so that after every code change, you can measure it and see if you actually did get a performance benefit. What this metric is will, of course, depend on what problem you're looking at and what your application does and various factors, but typical things are, of course, response time, throughput, frame rate. Having picked one of these, you then want to go and take a baseline measurement.

Now, this step's really about coming up with a benchmark, some kind of sequence of steps or data that you use with your program that you can go and run again in a repeatable manner so that every time you make a change, you can rerun this little benchmark and see how the performance varies.

I want to point out here, although we're not going to really cover it in this session, that Shark can be automated in a lot of different ways. So you can integrate it with your scripts and with your programs and do all sorts of automated testing and profiling. If you want to know more about that, check out the Shark manual.

There's a whole section on it. Also, we talked about this in more detail in the advanced session on Tuesday. Of course, if you missed that, you can check it out on ADC on iTunes after the conference. So these are the things that you should do first. There's also some things that you're going to need to know in order to really use Shark effectively. Now, it's true that Shark is actually really easy to use. We go to a lot of effort to make this true. But it does produce a lot of data. So knowing how to refine it to focus it in on just your particular issue.

Is really important. And this really boils down to three things. First up is knowing which configuration to use for a given problem. Second is knowing how to focus Shark at just the part of the system that's of interest. And the third thing is having gotten your results using Shark's data mining tools. To focus in on just the parts of your program and the execution that are actually relevant and that you care about.

Now, we're going to cover all of these in this session. So you'll have this down pat by the end. But there's one thing that you also need to know that we can't really fit into one session. Which is how to interpret your results. This basically boils down to experience. It's like debugging or programming in general. It's just a matter of learning and having dealt with similar problems and building on those.

So the best way, of course, to do this is to experiment. You know, go into Shark. Try different configurations. Try different programs. Look at different problems. Change your code. See how it affects your performance. You know, play around. You can also talk to other people. And we're going to do some demos later in this session to effectively.

show you some common problems and how you're going to fix them and give you a bit of a jump start. All right, so that's what you should do first and what you need to know. So let's go in and I'll introduce you to Shark. So we'll just switch to the demo machine. So, the other demo machine. Thank you, Ryan.

So, fire up Shark. I'll just hide that for the moment. So, when you launch Shark, this is the main window. This is what you see first. Now, this is where you choose what profiling technique you're going to use and the scope of it, and also, of course, control profiling with the start and stop button here. The first menu is the configuration. This is the profiling technique and analysis techniques you're going to use. You can see here that there's a whole bunch of these. We'll just go with time profile for now.

The second menu is the scope. You can see there's three options here. You can choose to profile everything, which is the whole system, so you can see everything running at, you know, maybe the expense of a little more overhead. You can choose to profile a specific process, which is really good. It's lower overhead, so if you're just interested in your program, then you can do that.

And for some configurations, you can even profile a binary file on disk without even having to run it. Now, if you choose to profile a particular process, the window expands and you get a list of processes to target. I'd like to point out that this is a very simple process. You can choose to profile a binary file on disk without even having to run it. And I'd like to point out just quickly that this is how you would go about launching a process as well.

This will then, when you click the start button, having chosen launch, it'll put up a dialog and you can set your executable path, your arguments and all that sort of stuff. Now, I've got a demo application here that I want to use, and we'll get to this a little bit in detail a little later, but first up, I'll just show it to you. So what it is is an MPEG-2 decoder that we just downloaded off the Internet.

Thank you. And I'll just build it and run it so you can see it. And we just slapped a real simple Cocoa GUI on it so it would display movies. And we want to use this to display HD MPEG-2 movies. So I have a test movie here. I'll just pull that up and we'll play that so you can see it.

So this is one of the astronauts up on the International Space Station. He's cruising around. He's having fun. Unfortunately, he's cruising a little slower than I'd like, because this is a 30-frame-per-second video. And you can see here, we're only getting 22 frames a second. So that's obviously our performance problem, and that's also our metric we're going to use. We want to get up to 30 frames a second or beyond.

Now, this is where we fire up Shark. We choose our process, our MPEG-2 decoder, and we start profiling. So we click Start. Shark's doing its sampling. We give it a few seconds to collect some samples. That's plenty. So we'll stop it, and we can stop our movie now.

And now it's doing its analysis. This is where it goes through and crunches the numbers and looks at the data, and it pulls symbol information out of your binaries, so it can display those to you later. And when it's done this, it'll produce a session window, which is where you spend your time looking at what your program's doing and making your optimizations.

So when it does that, in a few seconds, please. Thank you. So here we have a session window. Now, before I go in and we start optimizing this program, I want to jump back to slides and talk about what a time profile actually is. how it works, and how you can use it.

So, I mentioned briefly, I mean, you saw in the configuration menu, there's a whole bunch of them that you can use. I'm only going to cover, well, we're only going to cover the main three today, the three most common ones that will get most of what you need done.

We're also going to do a little call out to some of our Java tracing options towards the end. But if you want to know more, as I said, play around with Shark or read the manual or check out the advanced session where we covered some additional configurations and system trace in more detail.

[Transcript missing]

All right, so that's time profile and how you use it. What are the advantages and why would you want to use time profile as compared to other profiling techniques? Well, in its favor, time profile is really easy. It requires no work on your part, no code changes, no recompilation, nothing at all. In fact, you don't even need the developer tools installed, just Shark, so even end users can use time profile. So that's one good thing, it's really easy. The second thing is those call stacks we were talking about, they're accurate right down to the instruction level.

So you can see exactly where you were at that particular moment in time. And the third good thing, which is really the most important, is that it's really low overhead. On a modern machine, less than a percent of CPU, a few megabytes of RAM, really nothing. And this has the important consequence that it doesn't generally affect your profiling.

And this is a really important point. I can't overstress this enough. You really want to be profiling, as I said before, what the end user is going to see. Now, if your profiling technique that you're using has really high overhead, that's going to impact your program. And so you're no longer really seeing what the end user is seeing. You're kind of seeing something else. And you can go in and make changes then that look good, but then aren't actually what you want when you go and run them on a standalone end user system. So low overhead is a really, really good thing.

Of course, you know, all these things don't come without a trade-off. There is a con. And that brings us to the main disadvantage of time profiling, which is that it's not exact. It's not telling you every single function call, every single line of code executed. It's just a statistical method. So let's take an example. Here we have a program running, and the white lines here are where Shark has come in and taken a sample. So from Shark's point of view, what the program was doing was actually this.

Now, you can see that this is quite different. There's two real problems here. First, in the first half, Shark has come in and sampled on either side of these functions. So it didn't see them at all. It doesn't think they happened. On the other hand, these functions happen to fall on a sample, but they may not have actually been running for very long. But Shark doesn't know that, so it just assumes they were running for the whole interval. And you can see that the result is kind of like, well, that's not what my program's doing. I mean, how is this going to be helpful, right?

Luckily, because these are opposites, when you get enough samples, they average out, and time profile still gives you an accurate picture of what your program is actually doing. Now, the default sampling rate for time profile is 1 kilohertz, 1,000 times a second. Now, that's actually pretty high. And for most programs, nearly all of them, that's going to give you an accurate picture of what's going on. If you find yourself sort of cranking that up, you know, thinking that you need to make it higher, and especially if you're going right up to the limit, you probably don't actually want to use time profile.

You probably want to try different techniques, such as function tracing or system trace. And we'll get to system trace in the second half of this talk. But let's focus on time profile. So you saw before at the end of our little brief demo that this is what a Shark session looks like, and this is a time profile. So you have these tabs at the top, and we're looking at profile by default. And this shows you your call stacks.

Now, when it first opens, it's in heavy mode, and this is showing you the tops of your call stacks, the leafs of the call trees. And this is a really great place to start, because straightaway, it's answering the question, what code is executing? What is my program actually doing?

Furthermore, since it's sorted, you can see that the functions that are executing most go straight to the top. And this is where you play a game of whack-a-mole. You focus in on the functions at the top and just ignore, for the moment at least, everything below, because they don't matter.

This is a great way of looking from the bottom up. You're going straight to where you're actually executing, and by clicking on the disclosure triangles, you can trace back through the callers to see how you got there, you know, all the way back from your main function. However, sometimes you kind of want to look at it the other way. You want to start from your main function and sort of work down from there, you know, see what main called, where its time was divided, and then look down to the next level, where the time went there, and sort of work your way down instead of up.

This is the standard TreeView, which we just call TreeView. This is a really good way of looking at things for a couple of reasons, but mainly because it shows you different ways, different code paths you can use to reach a given function. For example, the square root function here. You can see it's called in two completely different code paths. This is important from an optimization perspective because how you optimize that function may depend or may be completely different for one code path as opposed to another.

In Shark, of course, the tree view looks like this, just the outline view, like a heavy view. Now, we got here by going to the View menu down here and choosing Tree. You can also, from there, choose to view both tree and heavy view together. If you want to look at both, look at it from both angles.

Now, this is all good and well for a simple program with all of three functions, right? But in a real program, these views get really busy. You have a lot of different functions, your call tree is very large, there's all different code paths, and it can quickly get to a point where the parts that are interesting are obscured by parts that really don't matter.

This is where Shark's data mining comes in. For example, we had this call tree before, and we've got these trig functions here, cos and square root, which are from the system's math library. This is a library that comes with the system, so it's something that you as third-party developers can't actually change, and it's already pretty well optimized anyway, right?

So you really don't care about that. What you're interested in is how you're using that library. So you'd rather get rid of these functions and just take their samples and push them up into their callers in your code. And then you have a much simpler call tree, and you can still see where the time was going, but you don't have to deal with all the details.

Now, that's just an example of one data mining technique called charging to callers. Shark has a whole host of other ones you can use, but we don't have time, unfortunately, in this session to go through them. They were covered in the advanced session on Tuesday, and they're also in the manual.

So that's one good way of looking at your time profile results using call stacks. Another way is using the chart view. This simply shows your program's execution over time where the y-axis is the call stack depth. And this actually turns out to be a really good way of seeing how your execution changes over time, the different phases of execution your program goes through. For example, you can see here on the left, the first 400 samples, your program's doing something. I mean, it has some sort of pattern. But over here, it's doing something else entirely. It's got a completely different pattern.

And in between, there's these spires where something else crazy is going on. And the red here actually denotes kernel time. So this is probably a system call we made, and it's doing some work on our behalf. Of course, you can zoom in on these, so you can get a closer look and a better feel for what's going on using either the zoom slider or simply by drag selecting the area of interest.

And that can take you right down into the individual samples themselves. If you click on one of those, it highlights, as you can see. And over here on the right, you can see the call stack for that particular sample. This is a great way to see what your program is doing at that particular moment in time.

Now, so far we've shown that you can get down to the individual functions. Now, that's really good for finding out where your time is going. But knowing which function is executing may not be enough to help you figure out why it's taking so long. For that, what you can do is just double click on a function anywhere you see it in a profile view or a call stack somewhere, and that will open a tab with a source code for that function and go straight to the lines in that function where you're spending most of your time.

You can see on the left here that we have the breakdown by line of how much time is being spent on each one, and there's also the color coding. So straight away you can see what the hottest code is, where you're spending all your time. This color coding also extends to the scroll bar over on the right, so you can navigate around large functions or large files with lots of different hotspots.

At this point, you may have figured out why your function is running so slow and you want to go in and fix it. You can simply click the Edit button to open it in Xcode or whatever your favorite editor is, and then go and make your changes. Alternatively, you may not still know at this point why this is so slow, so you know now what line is slow, but you might still think, well, okay, say here we have some floating point. Well, that looks sort of okay. I mean, what's slow about it, right?

To get a closer look, you can click on the Assembly button to toggle to the Assembly view, and now you can see down to the instruction level where your time is going. Again, color-coded with the percentages broken down. And even if you're not an Assembly person, this can still be a great way to figure out exactly why a particular line is slow.

And I pointed out also, because you may find yourself double-clicking on a function, and it goes straight to this. You don't see the source code at all. This typically happens if you're looking at something you don't have the source code for. You know, a kernel function, or a system library, or you're an end user, and you simply don't have the source code for the application. Now, if that's the case, or in any other case, and you find yourself in this assembly view, and you're not really sure, you know, what it all means, that's cool, because Shark's going to try and help you.

You can see here that there's a column there with an exclamation mark and this little button in it. This is Shark saying, hey, I've noticed that here you're doing something which could be a performance problem. If you click on it, it'll tell you what it's seeing, and it'll suggest a fix for you. In this case, we've got a loop doing some floating point code, but using scalars, just the normal FPU. So this is maybe something where we could go and look at vectorizing it.

Now, the second thing that Shark does that's helpful in this view, if you're not familiar with it, is if you see an instruction that seems to be important, but you're not sure what it does, just select it. And then click on the Help button down the bottom right there. And that'll take you straight to the part of the instruction set reference describing that function.

So you can read up on it. All right, so that's time profile from the very top right down to the very bottom. So let's go and use all this in a real example to see how we can optimize our MPEG-2 decoder. So let's jump over to the demo machine.

Okay, so this is where we left off. And we've got our little time profile here. And now we know how to use it. So straightaway we can see that we want to focus in on this reference IDCT. That's simply where we're spending most of our time. But while we're here, I'd just like to point out that floor is also showing up.

This is the standard floor in the math library. And it's kind of odd that it would be showing up so much. And Shark kind of agrees with this intuition. It's offering us a tip here. If we click on this, it's saying that because we're using floor so much, maybe we're interested in this alternative.

So we're going to use the alternative version, which isn't quite as accurate, but it's a little bit faster. So we thought, well, all right, we'll give that a shot, see if it works. So we copied that into our program, took nearly 30 seconds to do. And that gave us this version of our decoder. Well, that's not bad.

I mean, you can see now we're getting 25, 26 frames a second up from, what was it, 22 before. So that's pretty good. That's a 15, 20% improvement from a 30-second code change. I was pretty happy with that. But it's still not playing We want to play in real-time.

We want 30 frames a second. So we're going to have to do some more work. Let's go back to our profile and go into this Reference IDCT function and see what's going on. So I just double-clicked on it. And oops. We'll go to the source view. So now you can see -- hopefully you can see. I'll just increase the font size for you.

Now you can see what this function is doing. It's just spending all this time inside these loops doing all this floating-point arithmetic. We looked at this for a minute or two and we came to -- we realized two things. The first thing is it's taking as its input and its output -- it's the same thing -- an array of shorts, integers. But it's converting these to doubles, doing its arithmetic using the floating-point unit, and then converting it back to integers as output. Now, this is slow in a lot of ways. The conversion is slow. The floating-point is slower than the equivalent integer arithmetic.

And when we actually look at this, There's no actual floating point values here. These are all integers, whole values. So we could get exactly the same result using integer arithmetic. We figured we'd give that a shot and see if it actually made a performance difference. And that gave us this version.

Well, now you can see that that did make a big performance difference. We finally got more than 30 frames a second. In fact, we got 35 frames a second, 34. So that's really cool. Now our movie's playing back in better than real time, and we can watch it all nice and smoothly. Unfortunately, I cheated a bit.

I thought we were going to have a smaller screen on our projector, so I actually shrank the movie down so it's not even a 720p. It turns out, if we go and play it in 720p, it's still not running quite smoothly enough. So let's go back and see what other changes we can make.

I said there were two things before. The second thing we noticed, and Shark is even pointing it out for us, is that here we have some loops doing some math. This is a classical sort of case of where you want to look at vectorizing something. We had one of our engineers go in. He spent about a day on it. He looked into it.

He figured out how to vectorize it. and we thought, well, this is going to be great. You know, vectorization is awesome. Everyone tells you, you vectorize things, it'll make it four times faster. It'll be fantastic. So we went in and we did that, and that gave us this version.

Huh. 40 frames a second, I mean, it's better. It's faster. It's not four times faster. So that didn't really work the way we expected. And this is the point where our expectations fell apart. We made a change we thought was going to have some really good benefit, and it didn't. So we need to go in and reorientate ourselves, get our bearings again, and try and figure out what's going on and why that change didn't work as expected. So you take another time profile.

So I'll play a movie. I'll jump over to Shark again, and we'll take another time profile. Again, just a few seconds is plenty for something like this. So we'll stop that. Shark will do its analysis again. It'll be a bit faster this time because we've already got our symbol information handy. And when that pops up, hopefully we'll be able to figure out why that optimization apparently didn't work.

When it pops up, please. Any moment now. We do Shark Shark as well, so it works on itself. There we go. All right. I'll just crank up the volume -- font size again. So this isn't obvious because we renamed the function, but what used to be reference IDCT, our integer vectorized version is this iweight IDCT. So you can see that this is actually now taking up 1.8% from 20-something before. So it's actually an order of magnitude faster.

So it turns out our vectorization did work. It worked really, really well, except it worked so well that now this function is not taking up any time, so the benefit is comparatively tiny. What we need to do instead is refocus on these functions up here, which are now taking up most of our time. Now, these first two in particular are very similar, and if I just double-click on one and go into it, you can see that, like our previous function, this is just some math in a loop. And again, this is something that could probably be vectorized really well.

It turns out in this particular decoder, there are a lot of different places where you can vectorize it. So our vectorization was working. We just didn't do enough of it. So we had an engineer, he sat down, he spent another three or four days on it. He went through and he vectorized functions one after another, you know, starting with the slowest ones. And long story short, that gave us this version.

Now that's more like it, right? We're getting over 100 frames a second. This is the kind of benefit that everyone promises with vectorization. So that's really cool. We were really happy with that. We can now play our huge 1080p movies really smoothly. That was great. All right, so that's how we optimize that particular program.

What I'd like to do while I'm also here is show you how you can do the same sort of thing on an iPhone. Now, we don't have time for a full demo, but I just want to show you the important bits in order to profile on an iPhone.

So first up, in order to actually get to the phone, you go up to the sampling menu and choose Network/iPhone Profiling. This adds to Shark's main window, this little list here you can see, which will show up any connected iPhones or iPod Touches, and also any Macs you have on the network in remote mode.

So to use the phone for profiling, you click on Use. This will go to the phone, launch Shark, get it ready for profiling, get it configured, and then you can choose which configuration you want to use, which profiling technique, and also the scope of the profiling, much like these pop-up menus from before. And you can control profiling on the phone using the Start and Stop button.

Now, I mentioned we didn't have time to go through this, so I just had a session that I prepared earlier, so we can just open it straight away. I took a profile of a particular application I was working on, which is just drawing a sine wave. So I can actually show it to you here. I'll launch it in the simulator. It's a really simple app. I was just interested in getting an idea on OpenGL performance on the phone. So we'll build that.

When it runs, there it is. So very, very simple. But on the phone, it was running a little slower than I thought it should be. So I took a time profile to see why it was, you know, what it was doing, why it was so slow. And that gave me this. Now, you can see this is okay. We've got some symbols for our system libraries. But for our application, we've got this. No symbols, just instruction addresses. That's not very helpful.

This is because when Xcode installs your application on the phone, it removes all the symbol information. It's trying to save space because the phone doesn't have much space. That's not really convenient for us. But luckily, you can always add it in afterwards. So you go up to the file menu and choose symbolicate.

That brings up a dialog where you can select the original binary on your Mac, the one that still has symbols info in it. And you just open that. And Shark can pull the symbol information out of that. And you can see now that we can see what's actually going on inside our application. what's actually going on inside our program.

[Transcript missing]

And I'll just search for it here. Compile for Thumb. So this is telling Xcode that it's allowed to generate thumb code. By turning this off, and I rebuilt it and put it back on the phone, I went from about 30 frames a second to 40 frames per second. I was pretty happy with that.

33% for unticking a checkbox. I like these kind of optimizations. Now, There's a lot more to it than that. There's a lot more optimizations that I made to this program. Unfortunately, I mentioned we don't have time to go into them now. This is a demo for another time. So let's just jump back to our slides and wrap up what we've done so far.

So in the first half of that demo, I showed you how we used Time Profile to optimize our MPEG-2 decoder. And just by following that iterative process, focusing in on one function at a time and optimizing them in turn, we ended up running, what was it, nearly five times faster. So that was pretty cool. We were very happy with that.

In the second half of the demo, you saw how I used Shark to Time Profile on an iPhone. Now, these are the takeaway points from that demo. Some of them I explained in the demo. You know, you need to add search paths so Shark can see your symbol information. There's also a couple I just want to talk to briefly. The second point is kind of important. On a Mac, Shark's overhead is tiny, like I mentioned. But on a phone, it's a much less powerful system.

But Shark still needs to do the same amount of work, so its overhead is significantly higher. This is just important to remember in terms of, you know, try to profile just your application. Don't profile the whole system if you can avoid it. Don't crank up the sample rate to crazy levels, things like that.

The third point is also important. I mentioned before that you should always be profiling your release build so you're seeing what the end user is seeing. On the phone with the ARM processor, this is particularly important because GCC's code generation between unoptimized and optimized varies very widely, to put it politely. So you really want to be using your optimized version. Otherwise, you're probably not going to be seeing what's actually going to be happening.

The fourth point, I mentioned the ARM versus Thumb thing. I just unchecked that box, which changed my entire program from using Thumb code to using ARM. You can also go in using these compiler flags and change this on a per file basis using the info inspector for a file.

So that's a great way to selectively choose which instruction set you use for different parts of your program. The last point is simply, as you saw, the phone is kind of a new device. It's a new architecture. It's a new type of system. The performance problems you have on it are going to be quite different to what you may be familiar with on a Mac. And they're probably going to require some original solutions. So this is just an acknowledgment of the fact that even if you're familiar with performance work on a Mac, this might be like going back to school a little bit once you move to the iPhone.

Now, the other thing we saw in the demo, just as a last point, is you saw me open a saved session file. So, yes, Shark sessions can be saved, and they can be copied between any machine, Intel to PowerPC to ARM, whatever. There's no end in this issues or anything like that. They just work. Now, they include, of course, the sample information you took, obviously. They also include any data mining settings you had active when you saved the session. All right. But they can also include your symbol information and your source code.

Now, this is really cool because it means the session file is self-contained. You can delete your source code, rebuild your application, move to another machine, whatever. When you open the session, it's still going to look exactly the same as when you took it. It'll still have all the information.

Of course, if you're going to send the session file to someone and you don't want them to have this information, you need to not include it to begin with. Luckily, this is easy to remember. When you go to save a session, Shark will pop down a sheet, which will give you the choice of whether you want to include that information or not.

So if you're going to send this to someone, you don't want them having the info, remember to click strip. As we saw also in the iPhone demo, even if you get a session file that doesn't have that information, you can always add it in again later. So that's not really an issue.

Now, obviously you can use Shark session files just so you can save them and you can come back later and continue working on them. But there's another two good uses of this. The first up is that you can take these as an end user or as a developer if you're sending them to someone else and send them in alongside your performance problem reports so that even if the person who gets your bug report can't reproduce the problem, it might be an intermittent issue or requires some special configuration, they'll still be able to open the session, see what the program is doing when the problem does occur, and that may be enough for them to figure out what's going on and how to improve it.

In fact, that's what we at Apple recommend if you're sending performance bugs into us. Grab a Shark session if you can, because that helps us immensely when trying to figure out what's going on. And the last point is going back to that whole methodical approach to performance analysis.

Use your Shark session files to document your progress. Every time you make a change, take a new time profile or whatever you're using and record that. So you can go back and see each step of the way, okay, we made this change, this is what the performance change was, and so on and so forth.

All right. That's time profile, and that's how to use Shark. But that's, while very useful, certainly not the only thing that Shark can do. I mentioned one other thing before, system trace, that is also very useful. And to talk about that, I'd like to welcome on stage my colleague, Ryan Du Bois.

Thank you, Wade. As he said, my name is Ryan Du Bois, and I'm here to talk about system trace. So by now, you've gained a general familiarity with time profile and how it can be useful in tracking down your CPU cycles. Where are they going? Oftentimes, however, that's not the complete picture you're looking for. That tells you one part of the piece. But what if, for example, you just added multithreading to your application? What kind of tools do we have that allow you to look into that?

That's where System Trace comes in. System Trace allows you to get a new understanding of the multi-threaded behavior of your application. It gives you visibility into how your application interacts with OS X. It even lets you view things like virtual memory behavior in a very visual way, so you can begin to understand that. So let's take a quick look at how it's different from Time Profile. And a Time Profile, as Wade said, will come in periodically and ask your program, hey, what are you doing?

Now in this example here, you see that that red chunk of system code is completely missed by Time Profile. That's that sampling error that Wade was talking about. System Trace, on the other hand, is an exact trace of every single operating system event that happens during the course of profiling. So we don't miss a thing.

And this happens for every single thread running in the system all at once. The really beautiful part about this is you don't need to alter your programs in any way. You don't have to change any instrumentation, do any of that stuff. It just works. So when you take a system trace, it'll be presented with three main views. The first one here is the summary view.

We also have the Trace View and the Timeline View. So let's talk about these a little more in depth. The Summary View, as the name implies, is just kind of an overall summary of what was going on during that profiling session. In the upper right here, you see kind of a general breakdown of your CPU time, how much was spent in user code, how much was spent in system code, how much was doing actual work, and how much was idle. In the upper left here, we break it down even further with a nifty little pie chart, breaking it into system calls, VM faults, driver time interrupts, and idle time.

Down below, you'll see three tabs. These are the more detailed views of the summary. The first one here is the scheduler tab. This gives you kind of summary statistics for all the thread scheduling decisions made during the course of profiling. So as you see here, we can organize these statistics by busy time, user time, system time, and even dynamic priority of your threads as they're running.

The next tab over is the system call profile. This looks very much like a timed profile with the exception that it's a profile of every single system call made during the course of profiling. Now, just like a time profile, you can switch between heavy and tree view using a little pop-up here.

And the real power here is using these disclosure triangles, you can walk back from each and every system call back to the line of code in your application where that system call came from. So, for example, you make a couple OpenGL calls, you're not sure if they're going to result in a system call. System trace now gives you that visibility into your application.

And the third tab is the VM fault tab. This is just like the system call tab. It's an exact profile of every VM fault that happened. And again, using the little disclosure triangles, you can walk back into your function and your line of code, which causes VM faults.

The next main view I want to talk about is the trace view. As the name implies, this is a complete list of all the traced events that happen during the course of profiling. Now, if you're looking at system calls, for example, we give you all the arguments to every system call made right here. You poke this little button there, and the little pane will slide open, and you can see the full call stack for each and every system call as you're walking through the trace.

Now, that's really cool, and that gives you a lot of detail, but it doesn't really give you a visual picture. So for that, if you double-click on any one of these system calls, it'll take you right to that in the timeline view, which looks like this. As the name implies, it's a graph of all your threads over time and what they're doing.

Now you can navigate this view using the interactive zoomer up here, which will zoom in on the timeline view, and the scroll bar below. And as I said before, we record trace events for every single thread in the system at once. So down at the bottom here, we give you some options for filtering that down to just your process.

So I'd like to take a brief interlude here and talk about some advanced settings when using System Trace. I find it very helpful when doing a System Trace to pop open the Advanced Settings drawer. That gives you a couple extra options for your visualizations. First one I'd like to talk about is the thread coloring options up at the top here. Before I move on, however, you can get to this by going to Window, Show Advanced Settings, just so you know.

So here's that timeline view. Each one of these little colored rectangles represents the time that any given thread was running on a CPU. By default, as you can see here, they're all colored the same yellow. And that doesn't really tell us what was running where. We have three different ways to add colors to these thread run intervals to make them more visually understandable.

The first one I'd like to talk about is color by CPU. Now this is the one that I find myself using the most. So it's the default setting on my machine. Let's see how this can help us visualize multi-threaded behavior of our application. So if we focus in on these four Noble Ape threads here, these are worker threads for an application that's supposed to be doing some CPU intensive calculations. And we went and added multi-threading and wanted to make sure that they were sharing all four cores, just kind of do a sanity check. And that's what this system trace is.

Now, if we look right here, using these coloring options, we can see at a glance that all four of these worker threads are sharing all four cores in this machine pretty well. That's good news. That's what we expected. However, if we look a little bit to the right, we notice the color green disappears from those threads. And furthermore, we notice two of them are fighting for the same core. And that's maybe going to mess up some cache locality, maybe cause a big performance hit. And just by using that color down here, we can identify which thread came in and started running.

The next option I'd like to talk about is color by priority. In case you weren't aware, the priority of each thread in your system gets adjusted dynamically over time according to what it's doing. Using the timeline view and color by priority, you can see how those adjustments were made for each one of your threads. So for example, here, these bottom three threads are high priority, and that's denoted by kind of a pinkish-red color. And just above it, you can see three hourglass threads here running at low priority, blue color.

The final option I'd like to talk about is color by context switch reason. So every thread, when it stops running, stops running for one of these five reasons. And just at a glance, you can kind of get a handle on, are these threads making a lot of blocking system calls? Are they getting preempted? Are they just running out of time? Just at a glance with the timeline view.

The next set of options I want to talk about is the event selection here. So here we have a list of the icons that will show in the timeline view, and you can turn each one of these on and off. So let's go back to that timeline view and let's see what else we can get out of it. Remember each one of those run intervals. If you click on one of those, you'll get a little inspector that'll give you a lot more information about it. How much time was spent in user mode? How much in system mode?

Why did it stop running? What was its priority when it stopped running? Now, if you turn on some of those event icons, they will show up in these run intervals telling you what those threads were doing. Here we see a normal system call. The little underbar represents how long that it was running on the CPU.

Just above you can see a blocking system call, and I think you'll notice the underbar extends beyond the end of one run interval on into the next one. Now, if you were to click on any one of these, you'll get another inspector. It looks slightly different. This gives you the full call stack of where that system call came from.

Down below there, you can see all the arguments to it and its return value even. So if you have an error case, you can click on it in the timeline view and see the return value. Also, in the bottom left there, you can see how much time was spent blocked waiting on whatever resource that system call was called for.

Now we have three main types of system calls we display icons for. You can see them here, BSD, Mach, and MIG messages. One thing I'd like to call out is the lock icon. Now this happens any time your Pthread mutex operations result in a system call to the kernel.

So that's not going to be every single Pthread mutex lock or unlock, but just the ones that happen as a result of lock contention. So if you see that show up in your timeline view, you know you've got some lock contention going on that you need to track down.

So again, here's those page faults. Here you see a couple of zero fills. Again, more information there. If you click on them, you'll get a little inspector that'll show you the full call stack, in this case very short. It'll show you the address and the size of the page. And it'll even tell you if it was a code fault by telling you which library was faulting a page in there. As you can see, we display icons for a number of different VM fault types. My personal favorite is copy on right, little cow skin icon.

We also show icons for interrupts that happen on the processors. These happen for a number of reasons. I/O operations, DMAs, the timer interrupts used to enforce a time quantum. As I said before, you don't need to add any instrumentation to your application, but you can if you're interested.

We have two ways to do that. They're both signposts. The first one is called a point signpost. You can think of this as kind of like a print F. Your program kind of raises its hand and says, "Hey, this is what I'm doing right now." The other kind is an interval signpost, which again shows up with that little underbar that tells you how long any given operation took.

And with each one of these, you can record four auxiliary values, whatever you like, and you can put these in your code anywhere, in your user application, in your driver even. To see how to do that, check out the Shark manual. There's a couple of really good examples in there. Or after the conference, check out the video of the advanced session that happened on Tuesday.

So with that, let's go into a little quick demo of System Trace. See how we can use it to visualize the multi-threaded behavior of an application. So here's Shark window here. You can find system trace about the fourth position over here in the config pop-up. Let me go ahead and get my demo set up here.

All right, so I have this video camera that unfortunately records direct-to-DVD format. It's kind of a pain for me because I want to watch these on my home computer, maybe even take them on my iPod. So I'm using an application here some of you may be familiar with called Handbrake.

And what it does is it transcodes video from one format to another. In this case, I'm going to be looking at the X264 encoder, which if you go read documentation online, says that it's multi-threaded. So on these eight-core machines, it should go pretty fast. So what I'm going to do is just start to encode this video that I've taken.

And again, using that whole performance workflow, I'm going to write down the metric, which is going to be the frames per second right over here. And it looks like it reaches steady state right around 126, 127. Now I'm going to go ahead and take a system trace using the hotkey here, Option Escape. And that starts, and you'll notice that starts, the icon down there changes to red. I'm going to go ahead and pause this while Shark is processing. In fact, I'll even quit it.

Shark gives you a little progress bar, tell you how far along it is in its processing. All right, so here's our system trace. This is the summary tab I was talking about. And immediately, something should jump out at you here. You see that this is supposed to be a CPU-intensive application. It's supposed to be using as many cycles as possible to transcode this video from one format to another. Problem I see here is we have 45% idle time. That's a lot of time sitting around doing nothing when we could be working on this video.

So let's go see if we can kind of get a handle on what's going on. If we click that little disclosure triangle, it'll show us all the threads in this application. You notice there's quite a few. Nothing too exciting there. Let's go over and check out the system calls.

All right, the first one here, these are kind of interesting. They don't really have a name, but if you open up their call stack, you see that they're coming from Thread Start. Another way to see that is by clicking this button here. It'll pull out the full call stack.

So what this is telling me here is our x264 encoder is kicking off threads and maybe joining them later, and it's doing this really fast, probably once for each frame. So the problem here on OS X, it takes about 90 microseconds of time to make a new thread. So we're bleeding all that time for every single frame of encoding. So realizing this, I went in and tried to optimize that. Now I'm just going to go ahead and give you a quick tour of this over here in the trace view too.

Switch over to the system calls here. There's one of those without a name, BSD system call. Again, you can poke this button to see the full call stack. Now what I'm going to do is double click on this and go to the timeline view. Now again, you'll notice that's really busy, and it's all the same color.

It's not telling us a whole lot of information. So let's go to window, show advanced settings. - Let's scroll all the way up here, and we'll enable thread coloring. There we go. So now at a glance, we can see which threads were running on which CPUs at which time. It was very useful.

Now again, I said it would take me right to that event, but I'm not exactly sure where it is. So I'm going to press Enter. There it is. And it gives me the pop-up, that little inspector for the event that is highlighted. That pops right up to the front there. So let's go ahead and zoom in on that. And let's use these filtering options here to just look at Handbrake. And let's see if we can find this new thread. So here's our call-- oh, this is an exit. All right, let's find the other one.

Here it is. So there's the thread creation system call. Let's see if we can find where that thread comes in and how long it took. So there's the whole call stack. Here's how much CPU time it took, 77 microseconds there. And there it looks like the first thread. So you can see the big delay chunk here between where we asked for a new thread to be created and where it actually starts running. And with System Trace, we didn't have to do anything to our application to measure this amount of time.

All right. So that's all well and good. I spent about a week working with the source code here, mainly because I'd never seen it before. I wasn't too familiar with how it was working. And I made it so that these worker threads that it's using to share all the CPUs, they stick around when they're done with the work, and they wait for the main thread to say, "Hey, here's another frame to work on." So I was thinking, if I can get rid of this 80-ish microseconds of overhead, maybe I can get some more frames per second.

So let's go ahead and do another encoding here. So I'm going to start it. Let's overwrite that. And again, I'm going to start it with that hot key. There it goes. Let's go ahead and quit this application, give Shark all of the CPU power it needs. We'll wait for that session to come up.

Oh, yeah. Forgot to show you this. One main thing I wanted to show you here was-- There we go. After spending all that time working with it, what did I get out of it with the frames per second? Not a whole lot. It's still, in fact, it even looks about a frame per second worse. So let's see if we can figure out why that is.

So I'm just going to go ahead and jump right to the timeline here. Again, you can see that idle time is still pretty high. The thing that jumps out at me at the timeline here was, whoa, look at that. Over here on the left, we have all of these worker threads sharing the CPUs reasonably well. But over on the right there, we've got that one thread just doing whatever it's doing for such a long amount of time. It's blocking all the other threads, preventing them from doing their parallel work.

So just at a glance here, we can get an idea of where our performance issue is. It's over in this thread that's doing all this serial execution. So if we can figure out why that is, then we can go in and address that issue. Unfortunately, I didn't have time to do that, so I'm gonna show you another demo as well.

So here's an application that some of you who've been here before are probably familiar with called Noble Ape. And what it does is just simulates these apes running around the island. It does a bunch of calculations and simulates how many thoughts a second it gets. We originally got it. We get about 2,000 ape thoughts a second. And we spent a lot of time with time profile to go ahead and optimize that. And we got it up to about 14,000, almost 15,000 ape thoughts per second.

We also went in and multithreaded it. And we're getting up to, what are we at there? Topping out about 60,000 ape thoughts per second. That's pretty good. Now my theory here is we've done kind of a first pass multithreading here. So we're probably some performance we can still pull out of it. So let's take a quick system trace and see what's going on.

Go, and I'm just going to quit that. Let Shark do its processing. Here we go. Again, you see that high idle time number. That's probably not very good. Let's go focus in on Noble Ape. Now, we click this up. We can see all of the threads in the application.

These here are the worker threads that are supposed to be sharing all eight of the cores in the machine. And one thing that jumps out at me here is, see this number here, the average run interval, the average amount of busy time each one of these threads is taking? That's really low. It's on the order of 500 microseconds. Now, the time quantum for any thread in OS X is topped out at 10 milliseconds. So these could be using a whole lot more CPU time before blocking.

So my theory is we can probably improve the performance by giving them more work to do. Now, if we go look at the timeline view, we see kind of a regular pattern. We see this main thread, which doing the UI update here, the worker threads doing some work and synchronizing.

So if we go zoom in, we can see a couple of things jump out at us. First of all, we've got all this lock contention here. Second of all,

[Transcript missing]

Wade on cue, coming from Noble Abe. So that tells me they're using the kind of work cue model, which is pretty good in my opinion.

You have a work cue, you have a thread attached to it, you give it some work, it wakes up and does it, and then tells the cue that it's done. That's great. Problem is it looks like they're not getting enough work to do because they just run for very short here and then go back to sleep.

So I'm going to close this session. Long story short, I spent some time trying to increase the amount of work these are doing. And I actually had to add an extra digit here to the eight thoughts per second. So I've got it up to 102,000 eight thoughts per second. Now let's go take a system trace and see what that looks like real quick. Looks like I pressed the wrong button. There we go. All right, quit that.

All right, so remember, the idle time previously was up at 40-something. It's now down to 20, so we've improved that quite a bit. If we look at Noble Ape here and we look at its worker threads, we notice their average run interval is up to 5 milliseconds, that order of magnitude, and that's a really good improvement. All right, so let's go back to the slides. All right.

So just a quick recap, what did we see there? When we were working with Handbrake, we saw the worker threads were being created and destroyed for every single frame, and that's kind of a lot of overhead. So we went ahead and removed all that overhead, but we really didn't get much boost out of it. But using System Trace, we could immediately identify why we didn't get much boost out of it, and that's because there was that large chunk of serial execution masking any performance gain that we got out of the worker threads.

Now with Noble Ape, however, we were able to get 1.8 speed up there, 1.8x in roughly 45 minutes of work. And that's not bad, not having a strong familiarity with the code. So by now you've seen time profile and you've seen system trace. You've seen a couple of very powerful ways to optimize your application. But what happens if you're not working on the CPU and you're not strictly having multi-threading issues? Maybe your performance issue is more centered around memory usage. Well, for that we offer Malik Trace.

Now this is easily found in the configuration pop-up. It's very cleverly disguised, named Malik Trace. What this does is it takes an exact trace of every allocation and every destruction of memory in your application. So any kind of time you call Malik or new, free or delete, it'll show up in this.

That looks exactly like a time profile. All the data mining options work in much the same way. A couple of things to notice here. We've got the total column, just like you do in a time profile. So this allows you to find excessive numbers of small allocations, perhaps. We also have the Alex size column here. This is the new column we added for Malik Trace, which will allow you to find maybe small numbers of really large allocations.

Now if you flip over to the chart view, we've got this extra chart here, which represents the change in heap size over time. So any bar above the midline represents an allocation. Any bar below represents a free. So just by looking at this chart here, we can get two pieces of really key information out of it. First, it does not look like we're leaking any memory, because for every bar above the midline, we have an equally sized one below later.

The second thing is, we see this repeating pattern of allocate a bunch of stuff, free a bunch of stuff. Allocate, free. And that makes me think, maybe we can be reusing these buffers instead of releasing them back to the system and then getting them again. If we do that, we'd probably see a performance speedup in this application.

The other real powerful thing here is just like with the system trace system call profile, sometimes you don't know that you're going to be making allocations. Maybe you call a library method and you're not sure. Well, those will show up here. And using the disclosure triangles, you can walk back into the function in your application that called that library, which in turn made the allocation. Now, if that isn't quite enough level of detail, you can double click on that and it'll take you right to the line of source code in your application where that allocation was made.

All right, so you've seen time profile. You've seen system trace. You've seen malloc trace. All the Java guys in the audience are going, what about Java? Well, let's talk about Java real quick. As you're probably beginning to understand, Shark is a native machine profiler, so it tells you right down to the hardware what the hardware is doing over time.

Now, if you were to do this for a Java application, that would kind of give you an idea of what the Java VM was doing, which isn't very useful for your Java application. Shark's Java profiling lifts us one level up into the Java application. So now we can give you meaningful feedback into your Java classes and what are they doing over time.

Now we've got three main analysis techniques that we support for Java. We have the time profile, which looks and behaves exactly like a time profile. We've got the call trace, and this is an exact function trace of every method called in your application. And we also have the allocation trace. And again, you'll notice the Alexize column looks just like Malik trace.

So let's wrap things up here. What have we learned today in this session? First of all, Shark is a really powerful tool for giving you performance profiling feedback right down to the instruction level. You've seen three main types of profiling, and you've seen their analogs for what we have in the Java VM and how it can be useful.

If you're interested in more, check out the video of the advanced session on ADC and iTunes after the conference. That's session 911. Or feel free to send us an email. We always love hearing from our developers. For more information, contact the developer tools or performance evangelist, Michael Jurowicz. His email is jurowicz at apple.com. Also, feel free to send us feedback directly. You can send that to perftools-feedback at group.apple.com.

Our whole team is on that, and we always love hearing from developers, whether it be good or bad. Again, check out sample codes associated with these sessions, and check out the attendee website. For any more advanced questions that we don't get to in Q&A, feel free to come hit us up in the optimization labs tomorrow. We'll be there all day.