Getting Started Performance Tuning with Shark - WWDC 2007

Developer Tools • 1:15:59

Shark is very powerful, yet easy-to-use tool for discovering where your application is spending its time. Learn practical ways to find performance bottlenecks, obtain tips on optimization, and understand how your software interacts with the system. Gain an understanding of how Shark can help your application perform at its best in today's multi-core, 64-bit world.

Speakers: Rick Altherr, Ryan Du Bois, Lance Hammond

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

First off since you have choose to come to this session today instead of one of the other very interesting ones happening simultaneously we suspect that you, like us, believe that performance of your applications is a very important and key part of how you actually create those applications. First and foremost because, well, nobody likes to sit around and wait for their computer to compute; they actually want to really be doing productive work. There're also obviously a lot of smaller and more subtle details like the fact that humans actually work more efficiently and better if after they click on a button like a real button in the real world it responds immediately to their actions and doesn't have a noticeable delay.

That's also a performance issue or also when you think about it as you optimize your code and make it run faster you sort of squish down the amount of time it takes to execute you'll give yourself additional head room where the processor isn't doing anything that you can use in order to provide additional features to your application.

To give a basic example several years back everyone started doing background spelling check in word processors because they had the head room in order to do that and finally if you are running an application like a game well you know you can always use more performance because you can use that additional performance in order to provide, you know, higher frame rates, additional quality and detail to your graphics and so on.

Well there's another way to look at performance as well in that it's very economical. Now once you've optimized your software and crunched it down so to do a certain fixed amount of work it requires lessons and fewer instructions and so on; well if you think about it this actually means it's taking less power to do that same amount of work and if you have to do that work in a finite amount of time you'll be able to use less hardware, smaller system, fewer systems and so on to do that same amount of work in your fixed time and in the real world this means that say, you know, with portables well you'll be able to do more work on a battery charge if you're running optimized code or in the server space well, you know, to do that fixed amount of work and provide your customers with a certain level of service for their application well you'll be able to do it with fewer servers in your rack which take up less space and less power in your expensive server farms.

Okay, so how do we get optimized code? Well, you know, as developers I'm sure you're all quite familiar with the programming cycle. You write your code, you build it, you test and well if you're like me you haven't done it perfect the first time so something goes wrong, you go back around and you fix it and go through again. Well of course in reality we don't just simply say, "Well it failed and oh gee, maybe I should try doing something."

No, what we really do is we go in and we apply debugger such as say GVD built an X code so it can tell us exactly where we failed. We failed at say function foo line 25 and not just well it crashed somewhere in general. Also we can do auxiliary things like say displaying the programs state around where we have our crash and so on as well.

So, performance semi-cycle is actually very similar to this process. Now we're still going around the same loop of writing our code, building it, testing it only now instead of crashing and failing well our users will come back and say, "this is to slow, I really want it to be faster, you know, I'm getting tired of waiting for my computer."

You know, we could in theory go in and manually tune and optimize this but unfortunately it's even more difficult to close this loop manually than it is with the original just fixing your bugs loop because well first off do you even have a clue as to where you might want to go in and do your performance tuning, you have to have pretty intimate knowledge of the code and probably only the person who wrote it or someone who's worked with them very closely would have even a clue as to where to start and secondly it's real easy to become side tracked and your intuition, your intuition tells you to go in and look at one part of the code but actually it's something which you weren't even thinking about in another part of the code which is a real problem and it's way to easy to get caught in a situation. You can spend a couple of weeks trying to optimize and really beat the heck out of one portion of your code that you think is really the problem only to have your coworker come in and go, "Sorry, these were not the functions you're looking for."

So as a result, what you want to do is use a tool to actually help you focus in on your activity and that's what we have in Shark. Shark is effectively a performance debugger to tell you that which part of your program is slowest; foo 25 is the slow part of your program; that's where you should go in and optimize.

So, effectively when you think about it what Shark helps you do is it helps you go in and find the performance bottle necks in your code and it does this by as you're running your code it goes in and measures and analyzes, it pokes and prods your program to see what it's doing and gets this information and presents it back to you. As it's presenting it back it might also give you some hints on how to optimize and what you might want to look at in your program and we really think that this is a great and optimal performance tool solution.

First off because it's really easy. I'll be able to show you basically how to use Shark in just a few slides here coming up next and once we've taken this and analyzed your program in a very easy manner it presents the information acquired with very simple and straight forward graphical interface displays instead of saying making you go through reams and reams of textual output like you would say with something like gprof.

Now once you've started to use Shark for a while you realize that it's also quite powerful. It offers many different analyzing profiling techniques you can use to go in and look at your application in a wide variety of ways in order to look at different types of performance issues you may encounter.

In each one of these options has many different configuration options to go and play with it in order to fine tune it to look at just your issues that you may have. Also it works seamlessly with both 32 and 64 bit applications so you can look at any application you might write for Leopard.

So, I just said Shark was easy to use; well here let me try to prove it to you. In order to actually use Shark well, the first thing you want to do is debug your program because trying to actually fix performance problems when your program is still crashing is usually kind of pointless.

Then once you've got your program working well you want to go in and set the build options like you're planning to release. You want to get all the optimization flags at the full optimization and so on. So, you're looking at just as your users will look at your code with one big exception. You want to make sure you have all the symbols enabled in the code and you're not stripping anything out because you actually want to build and use that information to allow Shark to tell you where to go in and look.

So, now that we've got this what you want to do is get ready to run your program and when you do you want to choose the part you reprogrammed, you want to optimize at any point in time. You know, do you want to speed up the response to button clicks on your interface? Do you want to improve the wait times when you're doing long computation? Do you want, if you're a game, do you want to increase the frame rate or do you want to go for higher resolution graphics? You know, what is your measurement?

And once you've got this go in and take a base line measurement so you kind of have a place to start; this is where we're starting our measurement; we want to try and improve it and you know, you want, a key part of this is you need to build, take this measurement in a repeatable manner which can sometimes be very difficult if it's something that's very user interface intensive because you don't want to take different measurements just depending on how fast a person actually clicked on their mouse and once you got this picked out you can go ahead and use Shark on your program while it's running, you know, to analyze it and to do Shark; well when you start it up this little window which I can fit on a slide and actually have to magnify to make it look like it's filling up a slide, control is really you need to use to get going and what you'll need to do is first pick a way for Shark to look at your program and there are several methods and we'll talk about several of these here today. Once you've picked a sample configuration from this list, you go in and figure out how you want to aim Shark at your program.

Do you want to just have Shark slurp up what's happening in the entire system while it's going and executing and you know it can do this and this is actually surprising useful in many cases. You can say, "Look at how you're interacting with other applications with the system and so on" but you might also want to go in and focus it on say just your application or you know, have it actually it can go in and start an executable file and look at your file starting from binary disk.

Now once you've told Shark what you want to look at well pull the trigger, press start. It'll be off and running and doing some measurements of your application. Alternatively you can use option escape the hot key listed right here above start which will allow you to do this without actually touching your mouse and you can actually change this hot key to be anything you want should this a problem with your particular program.

So, you're on your way and after a little while later you can just press this button again, it changes conveniently to stop or similarly use option escape to stop and you'll stop Shark's measurement and it can start its analysis phase. Alternatively Shark will also stop automatically after a preset time limit by default.

Now once you've got all this information collected and ready to go well, boom, up pops this window all at once containing a very concise summary of what happened in your application during the time it was measured and pointing you right at the area you want to optimize first. We'll talk more on how to interpret this in just a moment.

So, what you can see here is in just a few slides I was able to show you that how to use Shark essentially. So basically it's real easy to use but it can give you a lot of data really fast and this is...so the real key is trying to get Shark to focus in on the key data of importance to you and so the data it produces is really useful and so the rest of this talk we're going to go in and try to give you a basic tutorial on how you might want to choose the analysis options from the analysis options menu and how to go in and focus it in on the part of your program that's of most interest to you in order to allow you to actually see and measure just those items of interest.

So, expanding upon this a bit what we're going to do is we're going to show you how to use Shark to examine a variety of issues. First is your basic CPU bound job, you just need to speed up something that's processor bound and for this we use a time profile technique.

Another common thing you may face is dealing with how your application deals with Mac OS 10 and how it interacts and for this we suggest the system trace tool. If you have a multi threaded application and are concerned with how the different threads are interacting with each other we would also let you consider the system trace or you might want to consider a slight variation on time profile called all thread states.

We have a Malloc trace for memory allocation issues, a variety of options for Java tracing and if you are encountering actual problems with how your application interacts with the hardware below it well we allow access to the actual hardware performance counters and OS performance counters in order to see what's really working on a low level and then to wrap up we're going to go in and look at how to actually go in and aim Shark at your code, going beyond just the start and stop buttons which I showed you here a moment ago and at the several other ways you control how Shark is pointed directly at issues that may be of interest to you.

So, for the first part here I'm going to go and talk about how we can use Shark to look at CPU bound jobs using the time profiling technique. So, first off just how would we really want to address a CPU bound job, for those of you who don't have experience with performance tuning?

Well to give a quick example, let's say we have an application here where we're currently spending 100% of our time for the key area of interest doing some activity and so we say that's our time baseline and we're just going to call it, however many seconds, going to call it 100%.

Well the first thing you really want to do in order to do performance optimization is we don't want to simply see this block of time, we want to actually see what was happening; so this is where Shark comes in. It takes this time and breaks it down and shows what code was actually executing over the duration of this time so we can see all the different parts of what that time was actually consisting of and the key thing you should see take from this graph is that, "Well hey, we're spending half of our time in function foo; this is probably the best place for us to spend our time and going and optimizing because we'll get the most bang for our buck." So just to give a quick example let's say we go in and look at the, and interloop of function foo and spending a long time on that interloop.

So we can go in and look at that loop and if we go in and optimize it, let's say we manage to get a 5 times speed up by looking at this little bit of code on foo. We'll just by going in and tweaking one little bit of code here we can cut the amount of time that our application has taken during this area of interest down almost in half, with one little application of Shark.

Now of course what'll typically happen here is that your boss will say, "Oh great, well it's still not fast enough. You should optimize some more." So then what you can do is look at your application again and what you'll see here is that the second bar has popped up and it's now the biggest because we've pushed down the first one so far and you know I actually like to kind of compare those to the game of whack a mole.

If you've ever gone to the arcade and you know, seen the little moles popping out of the hole and you take the big huge hammer and you go and beat them down and of course as soon as you beat one down the next one pops up and you go and beat it down and so on and so forth. Well performance optimization is the same way; you beat one down the next one pops up and is your next challenge to go after.

So let's say we're able to go in and tune an interloop here on bar and we manage to beat that one down as well and well you can see here now just by going in and tuning a couple of interloops in our program which Shark basically pointed us at we're able to get a two and a half times performance speed up on this application without even looking at the rest of the code or working with it in any way.

Now you can continue on this path ad infin item, you know, continue to try to get better performance but at some point either your product will ship or you'll get tired of this; your boss will say, "You know, hey, that's good enough; we're fast enough" and usually by this point in time what you get is a distribution more like this where you have a variety of functions that are more or less equal because you've beaten down all the big heavy hitters and so it becomes more of an issue of going in and picking out which ones are the easiest to optimize but still even looking at something like this picking from one among three functions to maybe look at optimizing it's much better than, say, to have to look over the hundreds you may have in your entire program.

So, how does Shark actually allow us to go in and see inside our program and actually get this information to see where we want to go and optimize. Well what Shark does is it goes in and takes periodical samples of our program execution; so over the course of time as our application executes Shark will go in and peek in and see what we're doing on an even rate and it's going to see everything that our application is doing; there's no hiding from it here. It's going to see the kernel time, drivers, you name it; whatever is happening it's going to peek in there and see it and it can do this with a configured goal and variable sample rate.

By default we start with a thousand samples a second because that works pretty well for most applications but you can change it if you want if your application needs a higher resolution or lower resolution sampling and then at each sample point we actually don't do very much work.

All we do is we go in and slurp out the call stack of what your program is actually doing at that particular point in time and a few other bits of information such as the program counter of where the processor is executing, processor thread IDs, and a couple of other numbers and we do this on all the processors in the system for today's multi processors systems.

So, this had a lot of key advantages. First and most importantly it has very low system overhead in order to get the amount of information we're actually getting. If you actually look at a whole system trace with Shark you can see that its impact on the CPU with typical sample rates is usually less than 1% of the CPU load; very little factor.

Similarly in order to record a typical length traces of maybe 30 seconds, a mi0nute or so on, what it actually only requires a few megabits of RAM in order to store all this information and in today's multi gigabyte system this a kind of a drop in the bucket and not really a big deal.

So a key factor to think about all this is that Shark does not affect your results and with any sort of performance optimization tool this is very important because much as with you know, particle physics, you know, Heisenberg's uncertainty principle says "you know our measuring tool can actually affect what we're measuring and make our measurement invalid."

This is also a very applicable to performance analysis tools. You don't want to end up measuring what your performance analysis tool did you want to measure your application and so Shark gets out of the way and lets you measure your application and not Shark. Now going even a step further in this, sort of a corollary is that Shark doesn't require you to instrument or adjust your code or screw it up in any way in order to take your measurements.

No code modifications are required in order to use Shark; there are a few optional ones you can add but in most cases you won't need to modify your code in any way and this even goes to the compilation stage; you don't need to recompile with any special options such as say with gprof you have to actually add additional code to get your information from that and as a result sort of an interesting thing about Shark is that even end users can Shark your application with their own data sets after it's out in the field and subsequently send this data back to you. Now I'll talk a little bit more about this a little later.

So one final issue here which is of key importance; it's one of our key differences from Xray if you went to that talk yesterday is that Shark can get the execution profile down to the program counter level; we can tell you the exact instructions you're working on and we're able to use that information to really aim you in exactly where you want to optimize.

Now of course all these great advantages do come with some disadvantages but really they're not that bad when you really think about them. So if we have some execution going along here, our main function is running and it's (inaudible) is very small foo and bar functions popping up and then at the white points, white lines where we're actually sampling; so what Shark will see when it samples this is actually this measurement instead which looks quite a bit different and for a couple of reasons. Over here on the left side well the foo and bars are falling between our sample points so we're just missing them completely; we're not getting them at all.

Over on the right side in contrast the foo and bar functions are falling right on the sample lines and so we're getting them and thinking that they're larger than they actually are. Now the nice effect about this is that statistically speaking when you look over a large enough set of samples which for a typical function is if you get more than 10 or 20 samples in a function that's usually statistically significant is that these 2 effects will average out over time and therefore for all the important heavy hitter functions in your program these problems will essentially completely cancel out and be meaningless.

Okay, so we've established how Shark can actually get this information, well how can we get that how that time is breakdown and feed it back to you? Shark is going to take all these samples and it's going to grab all of the functions where we were executing off the top of the stack, leaf functions and it just with those it's going to go and compile the exact graph that I showed you before, showing exactly where we were executing over the span of time while we were taking these samples broken down into the various functions we were executing.

Now we can take this and this looks really good for 5 samples. In the real world of course you're going to have a lot more samples and probably a whole lot more functions than just 3 so what we do in order to make this readable is we take this graph we tilt it over on it's side and then we present it in this textual format but it's the same thing; what you're seeing is instead of the big heavy hitters going over to the left is they're just coming up to the top and being right there obvious and so effectively just with this one window we're telling you that, "Hey, optimize this; you can ignore these down at the bottom."

Now, this view is what we call the heavy view; it shows you the leaf functions, the functions that are actually executing code. It's usually the best place to start your analysis of your program for the simple reason in that it answers the question, you know, what code is actually giving? What code should I look at...and effectively what we're doing though if you really step back and think about this is we're providing sort of a bottom up view of your code.

We're starting from the codes actually executing at the bottom of the call stack, the leaf functions and then going back up. You can see up here in the upper part I've actually opened up some of these little disclosure triangles that appear next to the function name so we can actually see the call stacks backing up.

Now this is really great for going and finding out say an interloop which is executing all the time but sometimes it's not always what you want. So in other cases what you may actually want to do is go in and look at your code and at the samples in more of a top down fashion. Shark can handle this as well. What we can do is take your call stacks and we can walk them down starting from the bottom and build up a whole tree of what is being executed and what is being called by other functions.

Once we've built up this call stack tree well we then can go in and see patterns on how these different calls are related to each other. For example, in this case well we're calling square root here in 2 completely different ways through 2 completely different code paths and as a result we may want to optimize those 2 different code paths in 2 completely different ways; even assuming that we can't actually optimize inside this square root function itself. So basically what we're answering here is how do we reach executing code in case we want to consider, you know, backing up a level or 2 in the tree and optimizing a higher level function instead of down, clear down to the bottom.

So in order to present this again we present it in sort of the textual format, the kind of columnar chart format and this looks exactly the same as what you saw before except we've just simply flipped over everything so now the start function is up a the top; we go to underscore start main and then we branch off into the various functions that occurring. This is a very simple little program which nicely, but a real one, which nicely fits on a slide but with your programs it looks very similar.

So even using the little menu in the lower right hand corner there we can choose between those 2 or hey we can split the difference look at them both side by side and do some comparing contrast with the heavy view up on top, the tree view down in the bottom and see, we can look at our code both ways at once.

Okay, well for a little simple toy example like what I can show you in this particular session, you know, this looks great. We can just look at the whole tree; see all the patterns right there, no problem. Of course with the real application which may have hundreds of thousands of functions well the tree can get so big that it can be really hard to follow. So we need to simplify this down.

Shark can help us out here too. What we can simply do is eliminate parts of the tree. For example lets say we want to go in and eliminate all the math library calls; you know this is code supplied to you from apple; it's not like you're going to be able to go in and optimize the internal guts of square root because we haven't provided you with the source.

So by eliminating those what you'll typically want to do is not just simply throw them out completely but take their samples and move them up into the your function which is calling them; so therefore it's basically telling you that, "hey I want to look at my function that's calling them because I want to try to optimize like the number of times I'm calling those functions which is a place I can actually make an adjustment to the program."

Once we do that now you'll see we have a much simpler and easier to read tree without the stuff which we can't actually have any control over and we can now see the appropriate waiting of all the functions that we do have control over and this is just one simple example of what we call Shark's data mining options which can go in and allow you to simplify these large complicated sets of potential call trees by doing things like, the example here, I charged the library to the caller; it can completely chop off tree branches which he just knows are not of interest to you and any other possibilities and then going with Shark's ethos of a fairly simple GUI user interface in order to do this all you need to do is select a function in a browser window, go up to the data mining menu at the top and choose one of these options and it'll immediately start chopping off the functions that aren't of interest to you.

Now there's a second way as well in that at any point in time you can go up to the window menu and choose show advance settings and this little drawer will pop out of the side of the window Now in this little drawer I have highlighted the area summarizing all the call stack information which you can go and do pretty much the same operation you can from the menu.

There are also other controls in here too and basically they control various ways that Shark is displaying information in the browser to allow you to customize that display to highlight the areas that are of interest to you and the controls in this drawer are actually very similar over the course of time depending on what you're looking at in order to provide you with the options that could be helpful with that particular view and we'll actually talk about a few more of these options over the course of the talk.

Okay, so the browser view profile heavy are the most common ways to look at your code. There are other ways too; for example we have what is called the chart view. It kind of gives you an overview of the code; it graphically displays all the samples in our code on the basis of call stack depths; that's what the height of each bar is and what you can see here is very much how your program changes over time and in many programs we may measure or sort of steady state for the time of measurement and doing pretty much the same thing but if you're actually doing multiple different things over the course of time the chart view can make them very obvious.

For example, we can see over here on the left we're doing what looks like one thing and over here on the right well this looks completely different and then in the middle we have kind of these big funny red spikes popping up; so, you know, maybe we want to know what the heck are those; so we can go in and actually go in and magnify the interesting sections using this little slider at the bottom which controls the magnification factor on the display and go in and zero in on those interesting parts that we want to look at.

So in this case we want to go in and zero in on this red big spike which is the left one of the two you just saw in order to see what that really was. So we can magnify it and we can go in and if you actually click on the graph what will happen is over here on the right side we'll show you the call stack from that particular sample that was taken for you and as a result we can see just from looking and examining this call stack that, "Oh, okay, these 2 big red spikes are getting a lot of VM under square functions, virtual memory functions, we were allocating a whole bunch of memory here and the kernel actually mapping those pages in for us so we could use them." You can also go in and double click on the various samples here to go in and go back to the profile exactly where you were.

Now everything I've shown you so far has basically allowed you to see down to say the function level but as I mentioned before, well sampling allows you to see down to the instruction level where you were executing, what you might want to optimize. How can we show you that?

Well, it's really simple. We'll go beyond those function names, just come in double click anywhere in the source name, the window changes. What it does is it shows you the code which is executing and not only are we going to show you the code here we've also color coded it.

So you can see here that the code which is actually executing the most has been color coded to, in these hot colors so you can see that, "Hey, this is where you should look. This particular line of code is what you should optimize" and if you happen to have multiple of these spots scattered about your function well over here in the scroll bar you can see this isn't quite your typical run of the mill scroll bar; it actually has the coloring extended out and on to the scroll bar so you can grab the thumb and just simply move to these different yellow points and to see all of the different hot spots in your function.

Now once you pick out the part you want to go in and you thought of all of it, "Hey, I think I can optimize this in a new way" and you can pick that spot and go in and press the edit button, take you right back to your editor and you can start the cycle of optimization again.

So, now that's the way code browser usually works. Now there are some cases you'll try double clicking and this will come up instead. You'll see a bunch of assembly code from your code and it'll happen if you say double click on a library function, any sort of kernel function, or if your end users try to Shark your code and they double click on anything this is what they'll tend to see or you may just simply, you know, switch over; you can actually bounce back and forth and look at both source and assembly if you like and even in this view we can offer a lot of useful information We can offer you optimization hints simply by clicking on these little exclamation point buttons which can tell you how you might want to consider optimizing that code, even on an instruction by instruction level or potentially give hints to the compiler on how I might want to generate code.

In fact this is really just a good way to check on the compiler in order to make sure, well, first off that you actually did enable an optimization for that particular function because un-optimized code looks a lot different than optimized code once you've seen it a couple of times and also, you know, make sure that a particular way you wrote your code wasn't a pathological case for the compiler that caused it to generate something that was really ugly and slow just because of the way you happen to write your syntax.

Now of course assembly view is great but I for one don't bother to memorize every last one of those assembly line which is pneumonics and I actually do use assembly on a regular basis. So for most of us what you're going to want to do is pick an instruction of interest and click on this little ASM help button down in the lower right corner and what will happen is this window will pop up and it will have the actual pages from the architectural reference manual paged up to the page describing the instruction you're looking at so you can actually tell what that move SS really means in real life.

Now I'm going to take a small tangent in here for a moment to mention one other key feature of Shark is that once you have these session windows as we call them you can actually save them at anytime. The most obviously just too simply save all the samples but optionally we can also embed all the symbols and source code from your program into the file right along with sample information.

Now this is optional if you're worried about security or so on when you try to do this savings circle will actually prominent you, "Do you actually really want to save the symbols with it" and so you can so no if you're concerned about someone else seeing your code or symbols or what not but if you do put them in you can then examine them later or if you don't put them in you can maybe decide later on that, "Yea, I actually did want to see my symbols, that person is legitimate, they can see the code;" so you can go up to the file menu choose the symbolic 8 command and actually go and add them back in later. Shark will take them from your binary and beyond the basic obvious just saving a session away so you can look at it next week this has many other very useful applications.

For example you may have a performance problem which you find in the code written by someone else halfway across your company; well you can take the Shark session, package it up, send it to them in e-mail saying, "Hey this is your code, could you tell me why this so slow and maybe tweak it and optimize it for me" or you can, if you have an issue say that you don't time to do it right now but for the next dot release of your product you'll have some time to deal with it, you can save this as sort of a documentation of a future thing you might want to do.

Another great application is that if your users go and use Shark on your application after they've run their own data set with it they may have a data set which of course you didn't have at the time you were developing the application; they may use Shark and find a performance problem that you missed and they can actually take the Shark sessions, package them up, send them into you, you can add the symbols afterwards and figure out your performance problem using the data from their code, their data set.

Also the Shark sessions are great to put in bug reports and one last note is that these sessions are crossed platform so if you have a mixed house with both X86 and power PC Macs anybody can look at the session made on any of those Mac, no problem; they go back and forth seamlessly.

So in order to illustrate how time profile works I'm going to go and give you a quick demo here on how we can use time profile to optimize a very simple MPEG decode application. In order to make this I just simply went to the internet and got the reference in MPEG 2 decode off of the internet site.

So going here to the first demo machine, here's the simple application, here's Shark's window up here above. So what I'm going to do with my little reference application here is I'm going to start up a little movie of a guy pedaling around his lab and you know, "hey this is going pretty fast."

One thing I did was unlike most video decoders I just simply let the processor go as fast as it possibly could; it's not stopping at a fixed frame rate and so we're seeing how fast the processor can actually decode this information at full blast and he's going pretty fast.

You might think, "Oh great, no problem we don't really need to optimize this" but you need to really think about it well 1 this is a pretty small movie, fits in on a small portion of the screen and 2 this is a really high end 3 giga hertz Mac Pro down here and well a lot of people out there in the field are going to have their laptops or older machines and so on and this is occupying an entire processor on this 3 gigahertz machine in order to get this guy riding around like this. So in reality we probably want to optimize this to account for those other systems or for potentially larger movies.

So what we're going to want to do is use Shark on this; so I'll go up here to the Shark window and I've choose time profile option, first one on the menu, and I'm going to choose to look in at just at my process, the MPEG 2 decode process which I can choose from this menu off to the right. I've chosen that and press start.

Shark makes this little beep and you can see down here at the bottom the icon turn bright red when it's sampling and...this is a pretty simply application...I think I've probably sampled enough so I'm going to go ahead and stop it. So you can see that bar went up and Shark in order to avoid impacting the performance of your application only does the analysis in actually sorting out what was executed when after we pressed stop; it just simply takes the samples while we're doing the measurement and all the heavy analysis work is saved until afterwards; this is purely to keep your program, to keep the overhead as low as possible.

So now we've got this time profile window brought up and we can go in and zoom in on that and what we can see here is that we've got an obvious heavy hitter routine, this reference IDC routine that came right up to the top. So it looks like that's probably where we're going to want to spend most of our time doing our optimization but before I do that what might be good is to go ahead, Shark put up this little button here with an exclamation point on it; it's offering a hint and this is in a function that's pretty high up; it's only the fourth one down taking almost 11% of the time; so you might want to go ahead and check out this hint and basically what Shark is offering here is that this is the fore function of the math library and well, you know, here's a little short bit of code you might consider in-lining in your program in order to avoid calling the floor function so much and all the overhead associated with making a function call. So, hey it's pretty simple and quick and easy for us to cut and paste a little bit of code in our application so, heck, it's not the top function but why not.

So if we can go back and in our application we've already done this and I've got several copies of the same movie so that we can show them simultaneously here and showing these two, well you can see that when we put in this fast floor, we changed the floor command, we're getting a little performance boost, you know, 10 or 15%, somewhere in that vicinity which is not too impressive but on the other hand all we had to do is copy and paste a little bit of code so for just a few minutes of work this is pretty good but if we actually want to go back and look at that reference site, (inaudible) I think we can do better. So go ahead and close this.

Going back in and looking at our profile again let's go in and actually look at the reference IDCT routine. I'll go and double click on that, we'll go to the code browser and look at that; so what we're seeing here it this routine is our code and you can see how the, in the whole function here, there's one area over here in the scroll bar where we're having our various hot lines; it's all in this function reference IDCT and we can go in; it has a couple of hot loops here where we're spending our time but it's all in this very one small function.

So we went in and spent a day or so looking at this and you know what is happening is it's taking a bunch of short integers in, converting them to floating point doubles, doing a whole bunch of math on it's floating point doubles and then outputting back to the short integers, the same variable it came in.

So, you know, there's a whole lot of overhead involved with converting back and forth between integer and floating point and also floating point math is slower than integer math. So what if we can convert this to integer math in order to avoid all this excess overhead? So this is the really hot place in the program where a little bit of overhead goes a long way.

So what we can do, what we did here is let's go ahead and we went in and changed this code to integer code instead and the code is a little uglier with integers because integer math is not quite as, you have to do a little bit more work with that, but after we just change this one function to use integers all the way you can see that now we're going a frame rate which is nearly twice as fast. He's now boogying around that lab with pretty good abandon and we can see that just by changing one function in the whole program and this is code I didn't write so the rest of it I don't even know what it does.

I didn't pay any attention to it. You know, that's pretty good for going in and tweaking one function but if we go in and look at this a little more...you can see that in...let's see, if we go in and look at the code browser here is the actual assembly code and...on this particular Mac it's not wanting to show it to me; back at the office when I did this right up here on this function it had a hint here and so I clicked on it and it said, "Hey you might want to consider using SIMD functions in this particular function; so it's doing a lot of very regular mathematics.

So I went ahead and took that hint and say, "Well this does look like a good opportunity for SIMD instructions," they're kind of designed for video coding work and I said, "Well sure, let's go ahead and put those in" and so let's make a third copy of him with this converted to SIMD instructions.

So now the third version you can see here, well he's going a little bit faster but it's really not that much. It's maybe another 10% or something which considering all the what we heard about the SIMD instructions it's like they're doing like 4 rate operations at once, sometimes even 16 operations at once, shouldn't we be seeing like a 4 rate or 16X improvement not just 10%.

This is where our intuition is starting to fail us again. So this is where intuition starts to fail us, it's a good time to go back and use Shark. So we can go back and do another Shark profile...yea, that's probably enough...Shark will analyze our samples for us now and toss up a new window.

So, I go in and zoom in on this...well the first problem I'm seeing is that I actually left all those windows open so when I sampled my decoder I actually saw everything and I was leaving those up there for the demo. Actually, no, Shark has a solution to this; down here at the bottom you'll see we actually have these menus so as well as focusing in before we start we can actually do some post run focusing. So I can actually choose both the process and in this case a thread of interest.

So we can see we have our main thread which is doing the GUI application and we have a thread for each of the different decoders, each of which is getting a balanced amount of time, about a third of the execution time. So I can go in and look at these and that is the integer version...this is the vectorized version and here is the...oh yea here is our original one and you can see that the original one has a breakdown if we just look at the thread, almost exactly identical to what we saw before with the original analysis with the reference ICD routine running up to the top. Let's go back and look at that vectorized one.

So if we go back and look at just this vectorized one the problem that immediately rears it's ugly head; here is the IDCT down here; it's number 6 in line now; so basically we were playing whack a mole again; we'd beaten down this operation a whole bunch; it's now nice and short and efficient.

So instead what's the problem is these other routines which popped up and are now the new heavy hitters in our program since we optimized the IDCT. So in order to actually optimize our program and get it to go faster we're going to have to go in and optimize these instead.

So, I will skip forward through all the different potential steps there of actually going and optimizing those and we went and most of all these were also amendable to vectorization just like the IDCT was. So, I went in and applied vectorization to pretty much all these heavy hitter routines and after doing all those about 4 I believe, went in and got the motion compensation, color space conversion, pixel interpolation and so on, all these vectors; now we can see that with just these top 4 or 5 routines optimized he is now really boogying around that lab; I don't know, someone clearly lit a fire under him and you can see just with by going in and using Shark to find and isolate and give us hints on how to optimize just a handful of functions in this actually fairly large program, most of which consist of code which is executed at start up and never again; we were able to go in and just with one engineer for a few days go and optimize this program so it's nearly a 6X performance boost and if we could switch back to the slides; you can see here that with our baseline of 1 time the fast floor operation just in a few minutes we got an additional 15%.

When we switched to the integer IDCT with about a days work or so we were able to get nearly 2X performance boost; vector IDCT added a little bit but as it turns out that we had already gotten most of the benefit just by converting integer and then back taking that and going to just a few more functions, about 4 or so, we were able to get almost 6X performance improvement just with a few days, one person; thanks to the fact that Shark was able to point us right in at the issues.

So that's how to use Shark to look at your CPU bound problems. Next we're going to go in and look at how you can use it to go in and look at system and threading interaction in order to speed up how you're dealing with Mac OS X or your multi threaded applications using a tool we call system trace analysis. For this we're going to bring up Mr. Ryan Du Bois.

( Applause )

Thank you Lance; as you said my name is Ryan Dubois and I'm here to talk about system trace. Let me get out of your way here. So you guys just saw time profile, you saw all about it, how it helps you boost the performance of your CPU bound jobs; that's all well and good.

What if your application is more just I/O intensive or maybe you just added multi threading after you optimized the CPU work? How about if you want to understand the virtual memory behavior of your application; what do you use then? You could probably use time profile but it might not be as helpful as system trace; let's find out why.

So time profile as Lance showed you interrupts your application at a predetermined interval; it figures out what it's doing. Let's say for example here this blue block of code on the left there of execution what if that was VM call or system call. You'll notice that time profile completely missed it.

To this end we've created system trace which is an exact trace of operating system entry and exit; this will catch everything like system calls, VM faults, interrupts, and it'll catch all the threads running on your system. The really beautiful part about this is you don't have to add any instrumentation to your application just like with time profile but if you did want to you could add sign posts and we'll talk a little bit about those later.

When you first take a system trace you have 3 views. The first view here is a summary view, there's also the trace view and the timeline view. Let's dive into each one of these. You look at the summary view, the top right there, it gives you kind of an overall breakdown; how many CPUs are on the system? How much total CPU time? Was it user, was it system, was it busy, was it idle?

To go even further the pie chart in the upper left breaks it down even more to user code, system calls, VM faults, interrupts and idle time and down below that you'll notice there's 3 more tabs. These are more detailed summaries. First one here is the scheduler summary. This is a basic overall statistical summary of the thread scheduling behavior.

Down on the bottom right you kind pick the different metrics you're interested in; there's busy time, user time, system time, and priority. This will give you kind of a min/average/max breakdown of how much time your threads are getting and in what metric. The next tab over is the system call; you'll notice this looks a lot like a time profile and in fact it kind of is. It's a profile of your system calls.

Down at the bottom just like with the time profile you can switch the view between tree and heavy and opening the disclosure triangles allows you to tie the system call time directly to your user space application. The same thing for VM faults over here; again open the disclosure triangles, tie your VM fault time directly to your user space code.

The next main view that we talked about was the trace view; this is the complete view of traced events. It allows you to inspect a call stack, the arguments and return values of every system call that's made during the system trace and similarly for VM faults allows you to inspect the address and the size and the call stack where they happened. If you were to double click on any one of these events it will take you to that event in the timeline view; so it looks like this.

Timeline view and the most obvious part of it is the timeline; basically every row there corresponds to one thread of execution. Total colored bars are time that that thread was actually executing on a CPU; within these colored bars we'll draw icons for system calls, the M faults, interrupts.

Speaking of system calls if you were to click on one like this one you'd get an inspector that looks like this, list, the system call name, the call stack, how you got there. The first 4 or 5 arguments and the return value and in the lower left of that inspector you'll see how much time was actually executed on the CPU and how much time it was waiting, possibly on resources.

You can zoom in on this using the top slider there just like you can with the chart view in time profile and the bottom scroll bar behaves just like you would expect, just scroll through time. You can filter this display just like with time profile based on the process, the thread, the CPU and you can even turn on and off the different classes of icons. In case you're not interested in system calls you can turn them off.

Now we offer 3 different ways to color this view that allows a more kind of instantly visually interpretable way to look at this. The first one seen here is colored by CPU so each on of the colored rectangles is colored according to the CPU on which it was executing and what you can do with this is at a glance identify your work or threads and make sure they're showing the processes correctly.

So you'll notice here these top 4 threads are NobleApe. The processor called NobleApe and just by looking over here in this section and seeing the 4 different colors we can verify visually that all 4 worker threads are showing the CPUs as we would expect. Now if you look a little bit further to the right you notice they're not sharing as well anymore; 2 of them are kind of arguing over which CPU they want and just by looking down and finding the color that's missing we can see that one of the hour glass threads has stolen that CPU.

The other way that you can color this is by priority. This is that same system trace; we just switched the coloring to priority. The red down here are for the iTunes and NobleApe threads, denotes high priority. For these hourglass threads the blue denotes low priority and finally you can color by the context which reason so this allows you to kind of visually interpret why these threads switched.

Let's go break down this timeline view even more. Like a said, these colored rectangles, these are run intervals. The little gray lines on the ends of the run interval represent a context which to a different thread on the same CPU. Within these run intervals we draw the icons like I said; here you can see a normal system call; a little black underline represents how much time that system call actually took. If you look up above you can see a system call that causes a thread to block; again that black underline represents how much time that system call took and similarly we'll draw icons for page faults; again black underlines how much time that page fault took.

If you were to click on one of these run intervals you'd see an inspector much like this with list of thread, the process to which it belonged, how much time was user time, how much time was system time and importantly the context which reasons. So you can click on each one of these and inspect it individually.

Now we'll add icons to this; like I said, you saw some system calls but what do they really mean. Well there's 4 system call icons that you'll see in a timeline The red phone denotes BSD system calls; these are you POSIX APIs, things that allow you to talk to devices and an abstract way, that sort of a thing.

The blue phone are mock system calls basic services like ports, locks, tasks and threads. Speaking of locks a subset of these will be represented with the lock icon; this will happen for any pthread mutex lock call that ends up in a trap to the kernel. So you can visually at a glance identify lock intention in your system and the green phone here represent MIG messages; these are mock interface generator routines more low level stuff.

We'll also draw events for virtual memory. For those of you not familiar with virtual memory it takes your physical memory divides it into small 4 kilobit chunks called pages and moves these in and out of physical memory on demand in a process called faulting. There are 5 types of faults on OS 10.

The page in, the page is brought from the disk into physical memory, that's represented by a green up arrow. Page out with a red down arrow, pages taken from physical memory pushed out to disk because it's no longer used. Zero fill has a little 0 there, just a new page you got marked 0 fill on demand; the first time you touch it, it fills it with 0.

Copy on right fault with the clever cow icon...these are shared pages so say you do a process fork, all your pages are marked copy on right so when the new process goes to write onto those pages it gets a whole new copy and finally the check mark is the page cache hit.

This means that the page is actually resident in physical memory but it's just not mapped into your process. We'll also display icons for hardware interrupts from anything of from I/O operations, DNA operations, even the clock interrupts that enforce the time quantum and as I said before you can add signposts to your system trace; it would show up like this.

There are 2 main types of signposts; the first type is a point signpost; you can think of this like a tracer bullet or a print F, you application kind of raises its hand and says, "Hey this is what I'm doing." The other type is an interval signpost; these have a beginning and an end record; they'll be represented with the little black under bar just like the system calls them in VM faults you just saw.

With each one of these events you can record auxiliary data values, anything you want and you can add these directly to your code both user land and driver code anywhere you want. You can come find us after the talk in the labs and check out the Shark manual to find out how to do that.

So now I'd like to expand upon Lance's demo where we've already optimized an application using time profile and we added some multi threading as kind of a first pass; let's take it one step further and let's see what system trace tells us, see if we can squeeze out a little bit more performance.

So we're going to be working with an application today called NobleApe; it looks like this. It's just a basic simulation, each one of the little red dots represents an ape running around on an island. Up right here you have the little simulation of the ape brain; you'll notice it keeps track of ape thoughts per second; that's are metric we're going to focus on. So when we originally looked at this code it was a completely single threaded scale in implementation, some of you may be familiar with this already, gets about 2008 thoughts per second.

So in the past we've gone through and we've optimized it with vector code using time profile just as Lance showed you and that gets us up to around 12,008 thoughts per second and we even went one step further and these are 4 core machines up here so the other 3 right now are just kind of idle.

So adding multi threading that gets us up around the neighborhood of 25,000 ape thoughts per second; that's pretty good coming from 2,000 but I have a suspicion that we might be able to take it a little bit further. So what we're going to do now is just leave this running; we're going to take a system trace with Shark; again all I do is poke the start button, let it go for a little bit, poke the stop button.

We'll go ahead and pause NobleApe here. So this system trace comes up; it's got the same things I just walked you through. It has the basic summary up here in the right; on the left you have the pie chart. You notice most of the time is in the user code; this is expected and you know we want it to be calculating and not doing a bunch of blocking stuff but you also notice the second major chunk there, it's just sitting idle; that's the time we're going to and break into. We're going to try and get some more idle time. Let's go look at the timeline.

Here's the timeline; the window menu show advanced settings allows you to play with the coloring so you can see the different coloring schemes here. You can zoom in just by clicking and dragging. So we're going to go color by CPU. Now these are the worker threads here, these top 4 and we can look like I said at a glance right here and just determine that they're sharing all 4 CPUs as we would expect; so that's good; that's not the problem; they're not blocking on each other. Let's go back to the summary view here. Let's focus in on NobleApe. You'll notice down on the right here we've selected busy time.

These 4 here, these are the worker threads that are doing the calculation and upon first inspection one thing that really jumps out to me is the average amount of busy time that each one of these gets is less than a millisecond. Now the time slicing quantum on OS X is 10 milliseconds so these are CPU bound threads; they are using less than one-tenth of the amount time they've been given. So if we can break into that, get some more busy time on average for each one of these threads we could probably get some more ape thoughts per second.

So we're going to use that, the average busy time as one of our metrics. Let's go figure out why they're blocking. We're going to go to system calls here; we're going to open this, this little icon in the bottom right allows us to see the heaviest call stack there.

I'm a little bit familiar with this code because I've looked at it now but once you get down to NobleApe you'll see, you'll get to something called NPtaskWrapper. So I went in, I found that code over in NobleApe and this was the original implementation; kind of like what you'd expect basic worker thread model, you know, wait on a request queue do some stuff and post to a results queue, makes sense.

Inspect it a little bit further, walk through the code a little bit and this variable here for each one of these worker threads they're 8 troops globally and there are 4 worker threads and we just divide the work up evenly. So each one of these worker threads in only cycling through 2 troops before it's posting to the queue and going to sleep again. That's not a lot of work for these guys to do and obviously as you just saw it's not filling up the time quan.

So this is the first pass, all I simply did was just go ahead and double the amount of apes on the island and see what that does. Get it back in to vector threaded mode...that's a little bit higher now; it's breaking into the 28s, hanging around 26 mostly. We got a little bit of a boost there. Let's go ahead and take another system trace. ( Pause ) Let Shark do its thing; again we're going to focus on NobleApe, open up these and look at the worker threads.

Look at that; we took it from the around 700 or 800 microseconds on average up to 1.3 milliseconds; so we're breaking into it; this is a good start. A couple of other things to try are if you remember this loop here we can go through that loop twice before we post the results; you're just calculating thoughts and changing some parameters on the ape. So let's go ahead and do that; let's walk through this loop a couple of times over our amount of apes that we've been given and then post the results after that.

So I went ahead and did that; went back to the original amount of apes, doubled the amount of times we go through that loop; let's see what that does to our ape thoughts per second.

( Pause )

It looks like it's getting a little bit higher there, hovering around 28, 29 breaking up into the 30s; that's not bad.

Similarly we do another system trace.

( Pause )

Let it process the samples and let's go look at that other metric again. Still about the same; okay so that's...that's not too bad but we could even break into this even further. So I went through this cycle a bit more, found out that I could double the number of apes and double the number of cycles and do them simultaneously and get a big boost but I just went ahead and I went crazy with it. I said, "You know, why don't we just do 8 of these synchronization, 8 of these thought cycles per synchronization and let's double the number of apes and let's see how many ape thoughts per second we can get.

( Pause )

So we went ahead and did that, let this computer average for a while...there we go...and we've broken into the 40s now; @ we're hovering around forty two, 41,000 ape thoughts per second up from about 25. % Let's go ahead and take one final system trace and let's see what our other metric is doing; let's see how those run intervals are going.

Focus in on NobleApe, go check the worker threads, look at that; on average we're up in 5 to 6 millisecond range, up from 600 microseconds. So here what you've seen is how to build upon what you've done with time profile in your multi threaded applications. You've gone in and you've optimized all your CPU work loops to make them as fast as possible. System trace allows you to do that for your multi threading. It allows you to identify where the overhead is, see how much time you worker threads are getting on average and increase that. Let's go ahead and go back to the slides.

( Pause ) So in review when I did this Mac at work these are the results that I got; so what do we do? Well first we ran NobleApe, we identified our one metric we wanted to improve, the ape thoughts per second. We went in with system trace, we saw what it was doing, we identified the short run interval for the worker threads. From there we used the system called summary view there to identify key point in our code to rethink, attempted a few really simple, really quick optimizations; we got around a 1.7X speed up in about 15 minutes of actual work; that's not bad.

So now you've seen time profile for optimizing your CPU bound jobs. You've seen system trace for optimizing your multi threaded and system interaction. These are just 2 of the many options that Shark allows you to do for profiling your applications and here to talk about some of the other options for profiling with Shark is Rick Altherr.

( Applause )

Thanks Ryan.

So like Ryan said we've talked about system trace, we've talked about time profile, let's talk about a few others. So what's a common theme in applications today? Well they start using a lot of memory; this can have some pretty adverse affects; you can, you know, run out of memory, you can be using enough memory that the operating system has a hard time managing it.

Well how can we take a look and see what's going on memory wise? Well for Shark we have Malloc tracing. So we can actually trace all the allocation events that happen, this is both allocations and frees so we trace both C, C++, new Malloc free, delete and we actually create an exact trace of all these events.

This way we can look at a variety of statistics. Now what you get when you collect one of these is what looks like a time profile. In fact we even sort it by the same type of statistics; the number of allocation events that happen in each function. You can also go to the source level and look at allocations per source line but we also added an extra column and this tells you the actual size of the allocations.

Now this is a summation of all of the allocations that happened in this function. This is really useful to find runaway functions where you're allocating things in a loop and you can say, "Well, you know, I allocated 50 megs in one function; perhaps I should go through and think about that for a second."

You can also go to the chart view and look at this in more of a behavioral sense; instead of looking at "well where am I doing lots of allocations" you can look at what's the overall behavior of my program. So you get the normal chart view that time profile gives you but you also get a secondary view that illustrates the change in heap size.

So bars going above the zero line are allocations and lines going below the zero line are actually frees. So you can look and at a glance see the pattern of my application is allocate, allocate, allocate, free and you can also see I allocate a whole bunch and I freed very little which indicates that you have leaks.

Now another area that has become more interesting over time is Java and virtual machines. Traditional profilers like time profile and system trace look at compiled code. They look at what is the native machine actually executing; this is great for lots of things; in fact it's great for looking at the virtual machine itself but what if you want to look at what the virtual machine is running? Well with Shark we've actually implemented hooks into the VM so that we can actually lift you out of the native machine into the virtual machine and see this is what my code is doing. So what do we offer in terms of Java support?

Well for time profile looks exactly like an old time profile, has the same features, same display, same data mining capabilities; it just happens to show you, "Hey this is the Java class that I'm actually working with, this is the method I'm in, this is the package it's in." This also will take you directly to the source code; you get all the exact same features just for Java.

Well for another type that we haven't talked about before but this is what we call a call trace. Now this is an exact trace of every method, entry and exit that happens while you actually take a trace of your program. Now this is quite invasive and will make your application considerably slower while you do it but if you really want to know what your application does this will tell you. It shows it in the exact same format; it's a time profile style format and you can look and say, "Well, I make lots of calls as function, I spend lots of time in these functions and I know it exactly."

We also offer allocation tracing. We watch for you memory allocations by allocating objects in Java. We also catch garbage collection events to catch the frees and we display it in the exact same way that you're used to with the Malloc trace. We add the allocation size so you can even see I'm allocating lots in this particular method.

We also offer even more techniques than what I just talked about and we're going to cover these really briefly. Sometimes you want to deal with early testing of binaries. You know, perhaps you really haven't debugged your application completely; it's really hard to get into a case where you actually exhibit a performance issue but you want to take a look anyway.

Well you can point Shark right at your binary on a disk, not even run it and Shark can come back and give you a lot of those hints about what you might be able to do such as vectorizing your code or things like that and you can do this without even running the application; does it all statically.

Now in today's world you have lots of threads for doing lots of different things; sometimes you think of doing everything in parallel and sometimes they're not, like you saw in system trace. Well sometimes you're interested in what they're not doing; specifically what they're blocked on and system trace can be very useful for this but another way of looking at this is saying, "Well okay, it took 10 seconds on my watch and of that 10 seconds how long was I blocked and where was I blocked?"

Well for profile called time profile all threads states it's just like a normal time profile except that we record not only execution time, time where you were actually executing on the CPU but we also take time where you were not executing in every thread on the system simultaneously. This is displayed as a normal time profile and you can look at this for things like figuring out why is my launch time slow for my application? What am I actually waiting on to happen during my application launch?

We also offer performance counter recording; we can get OS X events from the operating system; we can get counts from the processor, from memory controllers on power PC systems and we can actually look at things like L2 cache misses. We can look at memory bandwidth; this way if you suspect that you're actually running into a hardware limitation you can actually go in and look at this.

You can look at this in both a time profile type format and also in a spreadsheet type format with charting so that you can actually look at bandwidth overtime. For more information about any of these techniques please come see us in the lab after the session and tomorrow we're looking at Shark documentation which has full descriptions of all these different techniques. Now for some other ways to start and stop Shark back to Ryan.

( Applause )

Thank you Rick. Alright, so by now hopefully you're all familiar with the normal start button, the way to start and stop Shark; super easy, you've got configurable hot key, really awesome. The only down side here is it requires UI interaction and it might show up in a really short time profile. So let's talk about some other ways to use Shark, some other ways to point it at different processes and ways to trigger it.

( Pause )

An extension of process attach I last showed you is process launch; say you want to profile the launch time of your application; let it profile say preference loading, plug in loading, that sort of a thing. Go to process menu as you just saw, pick launch, you'll get this menu here. It has many options you can start sampling immediately; this will start time profile before the first instruction is ever executed.

You can give it user-supplied arguments working directory and environment variables. It's really perfect for really short live applications like command mine tools. It's also great for really large applications like Photoshop; say you want to profile the plug in loading and the preference checking at start of time, that sort of a thing.

Some other things we can do. What if the machine you're interested in is half way around the world; no problem there; run Shark on the machine, put it in network mode; you can point another Shark at it on different computer using Bonjour or TCP/IP. This allows you to simultaneously profile multiple machines say in a parallel cluster. You can have Shark running on all the different machines. You can the remote configurations on the fly; it's great for games, headless servers, parallel clusters, that sort of thing. We're all sadly familiar with the spinning pizza of death.

Why does it happen, how to get rid of it, lot of times that's hard to answer. The combination of time profiles all threads states and unresponsive application profiling is here to solve that question for you. Anytime the spinning pizza of death pops up, Shark will start taking a session for you. A couple of awesome features; you can give it application name filters, partial application name or minimum hang threshold; say you're only interested in stuff over a second.

What about if you've got it down to one function or one hot loop in your program; that's cool too; go in and add a couple of lines of API to your code then you can start and stop Shark from directly within your application. There's a full example of this in the Shark manual or you can find us in the lab.

This is really great for really fine grain control of sampling; it's got the easy API. The only downside here is you have to link with the CHUD framework so it's not always the best for release stuff. Similarly we have command line; so same kind of API in the command line, put Shark in remote mode, run these from the command line; it's good for scripting stuff in remote log ins. What if you don't have any UI? There's also consult Shark.

Quickly we're running out of time. I want to talk about something we introduced last year called the window time facility. You're normal profiling, you start, you stop, you get all the samples in between; it's great for normal stuff. What happens if you don't know when your problems going to occur?

To that end we've made the window time facility; it allows you to press stop or press start, sorry, pick a sampling window and forget; go about trying to reproduce your bug, as soon as it happens, wait until it's done, you press stop and Shark gives you that last window samples. This is selectable with time profile, they're marked in the config menu as you see here and it works with any other methods that we've talked about before.

So in summary Shark is really easy; it allows you to quickly identify exactly what code you need to optimize. There's many different options available for different types of tasks including hardware performance counters and as you've just seen there's many different ways to trigger and target these operations. For more information send us an e-mail.