Performance Tuning with Shark on Mac and iPhone - WWDC 2009

iPhone • 1:06:16

Shark is a powerful, low-overhead tool for collecting accurate information about the performance of your Mac or iPhone application. Move beyond the basics to discover how to interpret what Shark reveals about the inner workings of your application and to identify and fix difficult performance problems.

Speaker: Wade Tregaskis

Unlisted on Apple Developer site

Downloads from Apple

SD Video (325.7 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

My name is Wade Tregaskis. I work in the Developer Tools team, working on performance tools and primarily Shark, which is of course the subject of today's talk. And how you can use that on Mac and iPhone applications. So this is the bouquet of things, hopefully you're all here for at least one of these things. If you're not, you should be, but last chance to run off to a different session. And I'm not going to go through them, it's pretty self-explanatory. Let's just straight to what Shark can do, how Shark can address these needs. Shark has a bunch of different tools.

We're going to talk about only a few today, but we're going to cover the main ones. The first is of course time profiling. This simply shows you where your CPU time is going when your programs are executing. It's that simple. It's the most general tool. It applies in pretty much all situations. It's usually where you start profiling.

And it's going to be the focus of the first part of our presentation today. In the second part of our presentation, we'll look at system trace, which is also a very useful tool generally, but it particularly excels at multithreading problems, locking, anything you can kind of come across dealing with lots of threads and system interactions.

System calls, VM faults, all those sort of things. Towards the end of our talk, we'll just briefly touch on another two tools which together are basically about showing how you're using the memory in the machine, how you're using the processor cache, whether you're fitting inside it effectively. And how you're using the processor buff, buss, sorry, whether you're bandwidth limited for example. Now the last tool that I want to mention is a tool by itself that you can run on your binary, but it's actually more than that.

It's throughout Shark, it's a static analysis engine. This looks at the machine code of your program and does some calculations and looks at it, looks for common problems, patents that we recognize, things we know about. And it'll provide you things like these tips, which will tell you hey, there's something here, maybe you want to look at it, think about it.

And often it'll even tell you what you can do to work around that or try something else that might be faster. So that all said, let's look at just a few things we should do before we actually go to the demo and start firing up Shark. First up, debugger applications. It's of course a pain to be trying to profile something that's crashing or just generally misbehaving.

But the more important thing is that if your program is not actually doing what it's supposed to be doing, if it's not finished, it's not quite, you know doing exactly what it'll do when you ship, then profiling it could be pointless or worse it could be detrimental because what you see and what you optimize may not actually be meaningful, it may not be something that's a problem when you finish the program.

So just debug it first, get it doing what it's supposed to be doing and then start worrying about performance and optimization. That all said and done, when you do go to profile it, you need to ensure that you're profiling the release build. You want to have all your optimizations on, any additional compile settings, flags, whatever it is that you do before you ship your application to the end user.

Again, you want to be profiling what the end user is seeing. With two exceptions. Note here, if you've turned on F limit frame pointers, and if you don't know what that is, don't worry about it, it's off by default. But if you have turned it on, you want to create a separate configuration for Shark that simply turns that one setting alone off.

Simply because although it's a valid optimization, the tradeoff is that, it prevents collection of call stacks, which is a real problem when you're trying to analyze the performance. Secondly, you need to ensure that Shark can see your source files, your symbol information, so you'll be able to see your function names and whatnot. Now again, if you're using Xcode, you've created a project file with Xcode 3 or later, then this is automatic. You don't have to worry about this.

However, if you're using our Xcode file from a previous version of Xcode, or just for whatever reason you're not using Dwarf with DSIMM as your debug format, you'll need to go in and tweak all the settings appropriately to make sure you're not stripping your symbol information, you're actually generating debug info, that sort of thing. Now that all done, you can then fire up Shark and start profiling.

But although you can just profile sort of speculatively and just start looking around, the best way to do it is to have in mind what you want. And that really boils down to some kind of metric, something that you want to improve. This could be frames per second, time it takes to complete some task, throughput. Whatever is appropriate for your program and whatever seems to be the performance issue of course.

Now having done that, you then of course want to take a measurement, a baseline that you can compare against after you make changes to ensure not just that you did actually make an improvement, but sort of quantify that and get an idea of your progress. And this is really to put it simply, this is about coming up with a benchmark. Something that you can run reliably and reproducibly on your application to get numbers after each change.

And I just want to point out here, although we're not going to go into the how to do it in the session because we don't have time, you can automate Shark in a wide variety of ways. So if you come up with a benchmark that's a Shell script or an Apple script or some kind of you know automated benchmark, you can actually have Shark collect profiles automatically as part of that script. If you wanted, if you're interested in that, check out the Shark manual. There's a whole section on it. But I'm afraid we don't have time for it today. But let's go straight to the demo machine.

And fire up Shark and start taking some profiles.

[ Background noise ]

So I'm going to launch Shark. Now when you first open Shark, this is what you get. This is the main window. This is where you can start and stop Shark control profiling using this button. Choose which tool you want to use, which configuration. And this is where you choose the target of your profiling, either the whole system or a particular process or you know you can launch your process, specify command line arguments, all that sort of thing.

Now, if I wanted to for example profile MDWorkA, I'd be set to go now. I just click start and off I go. In this case though for the first demo, what I'm going to do is profile a little iPhone application. Now I apologize that I don't have a projector here.

This is what it looks like, if you're off the back. It looks more interesting on the phone, which has a microphone of course. It's just displaying the WAV form. So that's going to be my demo app. Now to profile that, I go to the sampling menu and choose networks/iPhoneprofiling.

Now it disables the local settings because these apply only to the current machine. Instead I have a list of all the connected iPhones and any Macs that are on the network, running Shark in remote mode. And I can just click use on whichever one or more of those I want to control.

It launches Shark on the device, gets it ready for profiling and when it does that, in a second, I didn't appease the demo gods apparently.

[ Period of silence ]

Would you believe that this worked perfectly every other time?

[ Laughter ]

All right, well anyway, well let me try a few things.

[ Period of silence ]

First off, go away Shark.

[ Period of silence ]

All right. So in theory, you click this and a few seconds later, it'll work.

[ Period of silence ]

Oh, it figures. All right, this is why we take sessions ahead of time, so we can just skip that whole step. Anyway, imagine that this was working, you'd see the same thing here that you see up here. Under the configuration you can choose which tool you want to use. Under the target you could choose which process or the whole system.

And then you could still use the start and stop button to then control profiling. So let me go grab a session that I took earlier. Now I'm going to take, this is a session I took on your iPhone 3GS. And we probably won't have time, but I want to point out just while this is loading, oops, when I relaunch Shark. That the 3GS, it's a different processor. It has more memory. It's a very different phone. So if you are doing phone stuff and you're worried about performance, you really want to grab a 3GS phone. Oh, come on.

[ Period of silence ]

Ah.

[ Period of silence ]

It's just one of those days. What's that? Reinstall Windows? Yeah.

[ Laughter ]

Ok, so this is a session that I took earlier on the one in a million chance that this would go wrong. And this is something like what you're going to see when you first profile your application.

So this is, this is ready. We can now go through and we can start looking at this, seeing what it's telling us, start optimizing. But before we do, I want to jump back to the slides because I want to go through what time profiling is, just really quickly. Just so you understand what it is that you're seeing and where this data is coming from.

So we jump back to the slides. So the way time profiling works is Shark comes along periodically, every millisecond, ten milliseconds, something like that. And it takes a sample. And in a nutshell a sample is just the call stacks of the target threads at that moment in time. Now this is quite similar to GDB for example, if you stop your program periodically to see what it's doing and look at the back trace.

However, a little different. Shark will also record any kernel stacks or driver stacks. This is time spent in the kernel or the drivers directly on behalf of your program. So that's something you'll see that you might not be used to. But more than that, what makes this a profiling tool is that we take all these call stacks and we combine them into what we call a call tree.

And this is basically a union merge of all the call stacks. And this lets you see at a higher level where you're spending your time, what functions are executing while you're profiling and of course how you got to them through the call tree. Now along with that of course, we record some statistics based on how many times we see that function while we're profiling. Now this is probably pretty similar or pretty familiar I should say to a lot of people if you've used other profiling tools, CPU sampler, tools in other platforms.

But one thing you may not be familiar with is these two different numbers, these two different columns we have here, self and total. Now this is showing slightly different things that are useful in different context. So let me just go over what they are, just to make it just clear. Because it's not necessarily intuitive.

Self includes only the time in the function itself, not including any functions that it calls, just in that function itself. Total on the other hand includes not just the time self, but also all that time in the sub functions. So let me just go through just an example, just to make this a bit more concrete.

If this is our sample code that we took this profile from, then if you're looking at the prepare function here, the 20% of self time corresponds roughly to this loop here. This is inside prepare itself doing math or something, whatever it's doing. The other 20% of prepares total time is actually in the MALIK function that it calls.

So that corresponds to this bit of code just here. Now I wanted to point this out really clearly because it's, as I said, it's really important to understand what the difference is because you use these different columns in different contexts. And sometimes you use them together to nut out some really tricky problems. Now that all said, that's pretty much how time profiling works. But there are some caveats, now the most important thing to understand about time profiling in Shark is that it's not a trace.

It is not a function trace. If you used other tools, they might present data that looks very similar, but they could be using function tracing. It's a different method. And there's tradeoffs, there's pros and cons to each one. But what this means really in Shark is that for example if you have this execution here on the top, the top one here, that's what your program is really doing over time. Now the white lines we're using to denote where Shark comes in and takes a sample. What Shark is going to see as a result is actually this timeline on the bottom. Now that looks different obviously. So why is that?

What is happening here? In the first case, you've got these functions that are running for a relatively brief period. So they actually fall between two sample points. So Shark just doesn't see them. It doesn't know they existed, it doesn't account for them at all. On the other hand, you have these functions here on the other side, which maybe be quite short functions, may have only taken microseconds, yet they just happen to fall coincidently on a sample point.

Now Shark can only assume when it sees them that they were running for the entire interval and that's why it spreads them out like that. So in this case, it's over accounting for them. Now the good thing is of course that these things are complete opposites and when you take a number of samples, they actually average out. And this is the basis of you know statistical sampling.

If you get enough samples, a statistically significant number of samples, then it's still showing you what's actually a very accurate picture of what's going on. Nonetheless, there is you know one down side that we just can't avoid with doing this statistically as opposed to actual function tracing. And that is that we don't know in Shark how often you call a function, how many times.

We can tell you how long you spend in it in total but not whether that was in all one call or spread out over a million calls. And that sort of thing can be very useful, but it's not something that Shark's time profiling will tell you. So, just something to be aware of.

The second thing, the second caveat is that time profiling, it's nice and simple, it's very general, but it's really just showing you where you're spending your time. It's not going to necessarily tell you straight out why you're spending your time there, why your code is slow, why your algorithm, you know is doing what it's doing. So of course knowing where you're spending your time, that's half the battle.

But the other half is interpreting those results and knowing how to fix them. And that just comes down to experience and a bit of wisdom and having dealt with similar problems before. So we can't instill that in you in just one short session, but in the demos we'll run through a bunch of common problems that you might see and also hopefully give you a kick start, let you know what you're seeing in a general typically time profile. And along those lines, I just want to lay down some sort of rules of thumb.

These aren't without exception, but they're general things to keep in mind, sort of goals to have when you're time profiling. The first one is just intuitive. You want to make any given process, any given task take less time. That's you know obvious. An extension of that, if you're going for throughput in particular, you want to be using the CPUs as much as possible to get as much work as physically possible done.

Then you want to avoid blocking. Blocking is where you are waiting on something, so you're sitting there, you're not using the CPU. It's time wasted. Now a third rule and this is a little more nuanced, it's really important to prioritize the changes you make, the optimizations you attempt based not just on how much time is going into a particular part of the code, but how much effort it's going to take to optimize that.

Meaning if you get, if it's going to take you a month to optimize one bit of code, but in that same month you could optimize ten other bits of code that all told will have a greater benefit, then obviously you do the ten little bits of code. So you know fairly intuitive, but it's something that takes a bit of practice to recognize different types of problems, know which ones are going to take the most amount of time and which ones are likely to give you the most benefit.

But with these rules in mind, now let's go back to the demo machine. And let's start looking at the results that we talked just a moment ago. Or at least that I talked a while ago, but let's pretend that they came from a moment ago. So this is our call tree, this is actually looking at the call tree from the bottom up.

So we're seeing all of the leaf functions in the tree which is where we're actually sitting there spinning and doing our time. And if we expand these, we can see how we got there. So drawer view, calls drawOscilloscope and we can go right back up the tree, all the way right back to our main function where everything originates. Now, let's look at this particular example. So we've got drawOscilloscope, that's taking up nearly 17% of our time. So that's obviously something that we want to look at, that's right up here at the top.

But above that we have this GLI destroy context. Now this is part of the GL engine. I don't particularly know what it is, I'm not an OpenGL expert. But it's taking up 20% of my time, so I really want to know what's going on. Now if I look at what called it, supposedly it called itself.

Seems strange, and it called itself again and again. I'm pretty sure this function's not recursive. And obviously this is a bit of a canned example because I know that this function is not recursive. So this is clearly not accurate. Now, in situations like this where I'm a bit suspect about what I'm seeing, a really good way to actually verify this is to go over to the advance drawer here.

Now if this isn't visible, you can bring it up by going to the window menu and show advanced settings. And I'm going to turn on show symbol length. Now this adds a length column and this is showing me the length in bytes of the actual machine code for this function. Not the source code, but the machine code. So this is telling me that GLI destroy context is 1.1 megabytes for one function.

Yeah. I'm going to call shenanigans on that. I don't believe it for one second. I happen to know from digging into this deeper, so this is just a, just a tip that this is bogus, this is completely wrong because this library doesn't export all of its symbol information.

So Shark looks at it, it sees a sample in it and it just picks the nearest function that it can see. Now in this case, because there's so little symbol info, it's picked one that probably completely off. It's probably not calling this function at all. So this is, this is pretty useless. We can't see what's going on inside this library. The time spent is accurate. But this call stack is, is meaningless. Now that's a bummer. That's you know not great.

But in any case, this is a system library. This is something that you can't even look at the source code, let alone make any changes to. So there's not necessarily a lot of merit in being able to see what's going on inside it anyway. What we really care about is how we're calling it. What API we're using, because that's something that we can change in our application to optimize.

So what I'm going to do is make use of Shark's data mining. Now there's a lot of different things you can do, little data mining things. Now in this case, what I want to do is charge, sorry I should say this is a contextual menu obviously so I just right click on this, on the library or the function name. In this case, I want to charge this library to its cause.

Now what that means is it's going to take all the samples that we're seeing in this library and instead account for them in whatever other library or whatever other function in our code actually called into there. So when I do that, you can see that there's no more GL engine entries here and that function is gone. Instead, we have this new GL drawer arrays which wasn't even visible before and our drawOscilloscope method is now showing that it's taking 17% when we account for that GL engine time.

Now in this case I can see that GL Drawer Arrays, see it reckons its 36 bytes long, that's much more plausible, so I'm going to believe this. But again, in a similar sort of thing, I can't do much about this. I don't really care that GL Drawer Arrays is taking 19% itself. I care more about the fact that I'm calling it.

So what I'm going to do is charge this again and now you can see that this drawOscilloscope function, that is taking up 40% of my time. So clearly that's where I was calling GL Drawer Arrays. And now I've sort of narrowed it down to of the code that I can change, 40% of my time is spent here.

This is obviously the place I'm going to look first to really try and optimize. So let's optimize it. How do we know what's going on? Why is this slower? So I'm going to double click on it. And this takes me to the source of the function. And you can see that the source is annotated, it's highlighted, showing me the breakdown of which lines within that function are taking the most time. Now these percentages here are of the total for just this one function. So if 37% of this function's time is going into GL Drawer Arrays, we kind of already figured that.

But here we have 38% of our time on the conditional part of a fault loop. Now that, that should strike most people here as pretty weird I would imagine because you don't normally expect your fault loops conditions to take time. You'd expect this math to take time. This is doing some floating-point calculations. And we see that it's taking some time but not nearly as much as this.

So what's going on here? In this case, just looking at the source isn't enough. I don't, I can't tell just from looking at this what is going on. So what I'm going to do is dig down another, another level and actually look at the assembly as well. You can see that the highlighted line here corresponds to these highlighted symbols here. If I choose a different line, it highlights different ones. So this is our loop here, this indented block just here.

And this is the condition part. And you can see now that there's one instruction here is taking up nearly 36% of this function's time. Now, although this may be a bit scary because this is all strange pneumonics and lots of these, I don't even know ARM assembly myself to be honest. So I don't expect you to and the good thing is you don't necessarily need to. In this case, I was able to figure out what was going on without ever looking up an ARM reference manual.

I looked at this instruction which I guessed was some kind of branch instruction, presumably this is what loops back to the start of our loop here. And I saw over here that Shark had quite a few things to say about this, this little loop. These exclamation marks are Shark's static analysis engine that we talked about before, telling me that it's noticed something. Something that might be interesting.

So for this particular problem, I noticed those run just here, it's not quite on this line, but it's nearby. So I click on that. And I, don't worry about reading through the details of this, this is just specific to this exact problem. But the important thing is that this is telling me that this takes a minimum of 20 cycles to c omplete.

So this is probably accounting for some of that time. There's like a delay here after we execute this before we can execute this. Even more useful, even more telling was when I expand this so you can see, this stall thing, you'll see a lot of these throughout your code.

They're not necessarily bad but they are when you get to big numbers in the, you know double digits here. So this is telling me that before this instruction could execute, every single time through the loop, it had to wait 51 clock cycles. Now that's 51 clock cycles where nothing else was happening. It's just wasted time. The reason for this is very esoteric and it's related through the particular chip that's in the 3GS. And I'm not going to go through the details here.

But suffice to say this indicates to me why it's happening, what's going on at least in a broad sense and if I go through and read this in detail, I can actually dig down and I figured out what's going on. Again it's specific to this architecture and I was able to figure out how to fix it. Now I don't want to go through the details of how to fix this one problem. But that's the general idea.

You find out what functions you're spending most of your time in and then you figure out why by looking at the source shoe and if necessary going right down to the assembly. I'll point out though, just before I leave this, if you do need to know what the assembly means, what these pneumonics are, you can highlight the instruction you're interested in and click the ASN help button down here. And this pops up the instructions that reference that manual for that instruction.

[ Applause ]

So, in fact, perhaps you've realized, this is actually new in Shark the ARM instruction set reference. We've had the Intel in PowerPC ones for a while. So that's a good way to just sort of ease yourself into it, if you have to. Alright, so let's assume that we are either going to optimize that or we did optimize it or whatever. But we no longer want to look at drawOscilloscope. We want to look at the next biggest thing, the next hot spot.

What I'm going to do is use data mining again, this time I'm going to remove call stacks with that function in it. This is going to remove those entirely, just like charging did except it doesn't bother to account for them anywhere, they just filtered out completely. So when I do that, that function disappears and all our percentages sort of just scale up to fill in that space.

And now I can move on to the next thing. Now, having said that, the next thing is this unknown library. Now you may see this because Shark doesn't have symbol information, well that's what it means and that may be that Shark couldn't find the symbols for your application or something like that.

However, on the phone, there's one particular library as such that you don't have symbol information for at all and that's the kernel and all the drivers. So I can see here I have symbols for my application, I have symbols for all these other system libraries. It's a pretty good guess, a pretty good bet that this is the kernel time that I talked about. And if you expand this call tree, it's just unknowns all the way down.

And then it's totals. So that's not useful to us. It would be nice to be able to see that, but we can't. So I'm pointing it out simply so when you see this in your own profiles, you know kind of what it is and that there's not a lot you can do about it Just ignore it and it's not necessarily code you can change directly anyway. But you can influence it by looking at other things. And there's some other tools actually that we'll use later than can help you out with this. Now, so ignoring that one. You see that there's a whole bunch of other things here that we're doing.

There's a lot of audio sort of stuff happening here. And going through every single one of these, one by one, looking at the source, it's one way to do it. It's a little tedious though. I don't want to have to analyze every single function. Instead I could step back up through and see how I got there, where I'm calling this from, try and get kind of a bigger picture. But a much simpler way to do that is to just switch over to tree view.

So I go down to the popup menu here and choose tree view. This is viewing the exact same call tree except now I'm looking at it from the top down. So I'm starting at the, the main function or your pthread body and drilling down through all the functions progressively. And if I fold some of these up, this gives me a much better high-level idea of what's going on.

So for algorithms that encompass more than just one function, this is a great way to see how much the total time is all told.

[ Period of silence ]

And I mean that's pretty much it. That's, that's how you use heavy view, how you use tree view, you've seen how to drill down into things. I want to move on to a second demo application we have.

Now this is called Image Aware. It's a demo app you may have seen in some other sessions. We've been banding it around a bit this year. If you haven't seen it, let me just run it once. All it does is it has a bunch of PDF files on disk, which it reads in, it goes through every single page of them, generates a little thumbnail for them and then sticks them into this view. That's all it does. So this is something that you would think would be very parallelizable. You could make it concurrent and we tried a few different methods.

Ultimately settling on GCD, of course, the new and shiny. Now, notice that this is taking about 6 seconds with just one thread. That's the original single threaded implementation. So GCD, this is a Mac Pro by the way, top of the line Mac Pro. So 16 virtual cores. So this should be 16 times faster, right?

I wouldn't have asked the question if it was going to be, would I? 2.2 seconds, so it's over twice as fast. It's an improvement. It was definitely a big improvement, definitely worth doing, but it's not 16 times. And this is our motivation for profiling it now. We want to know why we didn't get nearly as much speed up as we expected.

So what I'm going to do again is I'm going to take a time profile of this, just so I can see where our time is going. And I'm going to use the keyboard shortcut for this, the option escape to start and stop profiling, just so that I can use the mouse instead to control this application at the same time. So I'll start profiling, start it running, let it run to completion.

My bad, I missed a step. I would like to profile my application not MDWorker in this case. Try that again.

[ Period of silence ]

Alright. So now Shark will do its analysis, pull in all the symbol info, all that kind of stuff. And here's our session. So let me just switch back to heavy view initially. Now the process is the same as before.

You can go through this and look at different things. But I, I'm picking this particular example because it's a very interesting one and this is a very common thing that you'll see, especially as you go into multithreaded applications. Now we've got some time here spent in inflate fast and this Lexus scan thing. This is actually reading the PDF, decompressing it. This is good in a sense. This is actually doing the work we intend to do. And we've got some more stuff down here and whatever.

But of course the biggest two things are these two here. Now these aren't, these are kernel functions, they're not anything that we're calling intentionally. And I want to explain what they are because you'll see them sometimes. This set interrupts enabled function typically indicates that you're context switching a lot. And our context switching isn't necessarily bad, but it has overhead. So if you do it a lot, that overhead adds up and it slows you down. So that's, that's the first hint that something might be going on that we could optimize.

The second this is this VM Map stuff. Now this is virtual memory, VM. It's to do with VM faults. But this is not particularly helpful because this is the end of the call stack, this is all kernel stuff. So alright, we've got 10% of our time spent here, but time profiles aren't really telling us a whole lot. Now this is actually leading on to system trace. But before we go to that, there is something else we can do just with time profiling and I want to take a step back to explain why I decided to do it before I do.

I look down here at this number here, this is how many samples Shark took while it was profiling. So this is a 16 core machine, by default Shark samples 1000 times a second, so if we're running some Macs, we're using CPUs to the limit, we should get 16000 samples a s`econd. This profile was for about two or three seconds, so I'd expect at least 50000 samples. Instead we're getting 14000. This simply means that we're not actually running for most of the time.

A normal time profile only records a call stack, a sample if the thread is running. If it's idle, if it's blocked, if it's doing nothing, then it doesn't record it. This is telling me that while this is valid, this is what we're doing when we are running, our biggest problem here is that we're not running most of the time. So for that, I instead want to do a variation of the basic time profile called Time Profile or Thread States.

Now this produces data that looks much the same. I'm just going to record it while I talk. Except that it records whether we're running or not. Oops. Ignore that. And this shows us something very different. And this is actually much more useful because now we're seeing over all our total time, not just the time we are running, but when we were blocked, and if I can find it here, here's our inflate methods that we saw before. They were 6% of the time profile, but in the grand scheme of things, in wall clock time if you like, they're a percent, they're insignificant. They're completely dwarfed by these first five functions here. Now these are all system things. They're not things you generally call directly.

But I want to explain all five of them because again these are things that you're going to see in your code and it's really helpful to know what they mean because they each indicate particular types of problems. The first one is workqueuekernreturn. If I pull this back, there's not a lot to it. This is a workqueue thread as you might find underlying things like GCD and certain other things. It's used a lot throughout the system.

They tend to pop up automatically, system libraries use them a lot. Workqueuekernreturn is a system call that these kind of sit in when they're waiting for more work. So this is telling me that 40% or 35% of my time is spent waiting for something to do. So that's, that's the first indication of a problem, that we're not perhaps enqueueing enough work or we have some kind of bottleneck that's actually preventing us from getting a full 16 cores' worth of things to do. This next thing, assumingforwaitsignaltrap, if I dig down it's perhaps more self-explanatory.

This is called from within the locking functions in the pthreads library. And it basically says that there's lock contention. Multiple threads are trying to take the same lock at the same time. And all the time spent in here is waiting for that lock to be released by another thread so we can move forward. Again this is dead time. This means that we're waiting, we're doing nothing useful. And this is bad. You pretty much never want to see this in your programs.

This indicates lock contention, which we don't want. And it's just wasted time. Similarly this one is a slightly different pthread, kind of mutual exclusion thing, pthread condition variables. Same deal. It represents time that we're waiting for that condition variable to signal us. We don't really want to be spending any time in that.

Kevent, much the same kind of thing. Again you can probably guess that this underlies GCD as well. This is again, it's in the, it's sitting in the kernel when it's doing this, waiting for an event to come in. So this is idle time. And there's another assumingforwardfunction. So I think you get the idea at this point.

All of these things are telling me that most of our time we're waiting and more importantly, most of our time we're waiting on a lock or a pthread condition variable. So I would say at a guess, that the big problem with this program is lock contention. Now in this time profile, I can drill back through these functions to see what's using that lock and in this case, it's this rendering library is where most of the time is. Now the time profile here is useful. I can drill all the way back to my own code, right down here. So you can see this is you know, somewhere underneath the code that actually creates the pages in memory.

And I can perhaps optimize this. I can look at what I'm doing here, try and find a different API, maybe there's a more efficient way to do this. However, this is kind of a limitation much like it doesn't tell me how many times I call a function. Time profiler doesn't tell me how many times I try and take a lock. It's not telling me if I try to take this lock once and it's simply blocked for seconds at a time. Or if I tried to take it a million times and it was actually only blocking for a microsecond, but all those microseconds added up.

So while I could go ahead right now and start optimizing this or trying to optimize it, what I would instead do is go over to system trace because it might show me something more indicative, more useful, giving a better idea of what I need to optimize. Now to do that, to talk to you about system trace, I want to introduce Eric Miller, one of my coworkers, worked on Shark. And we'll just jump back to the slides, so he can take you through what System Trace is and what the results mean.

Eric.

[ Applause ]

Thanks Wade, appreciate it. Is this on? Yeah, I'm just going to dive right into system trace. I've got plenty of time. So, System Trace is all about what's going on in the system with your threads. Which treads are running, where they're running, when they're, on what core they're running on.

And in particular, what priority they have over time and why they change course or why they don't change course. Now there's several ways to review that data. And some of them are tables much like system, much like the time profile. However the easiest way to look at this is the System Trace timeline. In the System Trace timeline, each row in the table represents one of the threads in the system.

Each of the colored blocks represents some amount of time a thread was running on a particular core. In this particular view, the colored blocks represent the different processors that a given thread was running on over time. There are four kinds of events, major kinds of events that System Trace can show you. The first of which, oh I skipped ahead of myself. With regard to this stuff, try to keep your threads running on their cores as long as possible. The nice long blue line would be an excellent example of a long tenure for a thread.

And et cetera. So when we, when we're working with an application we want a lot of throughput, we want to have the long tenure on our core for our thread. If you have an application on the other hand that is meant to be responsive to external input, you would want your threads to block and have short tenures when they're waiting for input and then have a, a tenure that's just long enough to get the work done before the next input comes.

So there's like I mentioned, there's these four kinds of events, the first of which is system calls. System calls are basically requests from your, your source code to the kernel to do work on your behalf or to make resources available. Some of the things that you might do quite regularly that are system calls are read calls and select calls and IPC programs and the mock messages which are pervasive throughout Mac OS X. System calls underlie locks so that pthreadmutext lock you saw in Wade's trace is a form of a system call.

And because these system calls underlie a number of resources and you have to go and use protected things in the kernel, you can definitely incur what we call a context switch, which is to say your thread can leave the core it was running on, be blocked or set aside by the scheduler while your system call is serviced.

So when things block, they can serialize your otherwise parallelized thread program. When we look at system calls, there are a number of little icons that'll show up in the timeline. The most common are the BFD system calls which are the red telephones. The blue telephones are mock messages and the little happy Macintosh lock icons will show you lock contention.

When you make one of those pthreadmutetext lock calls or unlock calls. And the thing that's interesting about system calls, as we said they can block, so when that blocking shows up in the trace timeline, you'll see the little black underbar leaving the icon on the left and proceeding to the right. What that means is that is time that that system call is blocking the progress of the thread. And you're not, and some other thread is executing most likely in the kernel on behalf of your program.

So with regard to these system calls, what we want to do is try not to use them in time critical areas and there are a number of slow and blocking system calls. So we want to know what these are, learn them through experience. They change with the platform you're on.

Some platforms are better at servicing some system calls than others. So it may vary from say a Mac Book Pro or an iPhone type device or a Mac Pro. Now the other type of System Trace event is a virtual memory fault, is as Wade alluded to. These occur when you have new pages of memory being accessed by your program. There's a couple ways that these comes into play.

When you allocate a block of memory, the first use of a new allocation will cause that, pages of that data, 4 kilobytes at a time, to be brought into your application space. If you share memory with another thread or with another process, the Virtual Memory Manager will make a copy of a page for you and bring it into your space so that you can use that one and not overwrite the other thread's access to the same memory space. And again, one page at a time. So there's, there's one called a pagein and a pageout.

And these are the most painful virtual memory faults because what this means is physical memory is over committed. And the Virtual Memory Manager has to take some pages from an application, write them out to the hard disk and bring in pages from another application that comes to the foreground. Virtual memory faults look like this.

They look like little pages. And they have little different icons. The most obvious one here is the little pink zeroes. The zeroes are what's called a zero page fill operation. Most of the time when you allocate a new piece of memory and begin to use it, the Virtual Memory Manager will zero out that page. That is, takes a little bit of time, but you certainly want a clean slate to work with.

You see little check boxes. The little check page is a page that was in the Virtual Memory Manager's data cache but not in your process yet. So bringing that page back to your process is page cache hit. And of course there's a page cache miss as well. The third type of virtual memory fault we see here is a little black and white spotted page. That's a copy on write.

When you share pages, we call that a cow fault.

[ Background noise ]

Thank you.

[ Laughter ]

So Virtual Memory rules of thumb. Try to minimize the amount of memory in your application to achieve a minimization of the number of page faults you take. I suppose that's fairly obvious.

A good way to avoid this is to reuse memory allocations if you can, allocate a page of memory, a block of memory and use it for work and then recycle that same block to use again for more work as opposed to releasing it and allocating yet another new block of memory.

If, and also, for audio and video type applications, anything that's very sensitive to time, preallocate memory or even wire pages down. The fourth type of event that I'm going to talk about is the interrupt. Interrupts are signals to the processor cores from other devices in the system including other processor cores. They're very fast. But they can take some time if you have a heavy access to a device, like your network interface or your disk access.

PCI devices also generate lots of interrupts. The most popular of which is the graphics processor which sends lots of interrupts to their host core for service. The cores themselves actually use interrupts to help the scheduling software manage thread tenures on each core, every 10 milliseconds a timer is fired on every core in the system.

The interrupts look like little alarm bells. And there's not a lot of data about them other than they occurred. You typically find them at the beginning and end of tenures of threads, as the system has to change what it's doing when interrupts happen. So with regard to interrupts, if you've never heard of interrupts, they probably won't impact your day much.

But if you're a device driver writer or other sort of developer of a KEXT for software purposes, your device driver KEXT is going to generate interrupts to get service from the core. So you want to try to minimize those and kind of plan them strategically around what you're trying to get accomplished. So what I'd like to do now is show a demo. And hopefully Wade used up all the demo demons. Let's see. Letter C. See, get the right mouse here. Alright, let's put that away. I'll going to restart Shark, just for safety purposes.

[ Background noise ]

Although it may help, may or may not help. So I'm going to switch over to System Trace. What I want to show you with System Trace is the mini configuration editor. The mini configuration editor for any of the configurations will show you the popular controls that you might want to modify in order to collect the information you're interested in. You can set a start delay to prevent Shark from starting before you're ready. You can actually start Shark having a delay of some seconds. Set a time limit here, it says 10 seconds.

I'm going to change that to 2. And the sample limit. That is a per core limit, System Trace information is collected in every core. And with 16 cores, you're looking at 16 million events collected, which could take a while to process. And while I like a coffee break as much as the next guy, not all the time. One thing else I'm going to do is I'm going to activate signpost which is a fifth type of System Trace kernel event that you can record.

But signposts are different in that I inject them into the program on purpose so that I can see them in the time line. And I'll describe those later. So let me start the System Trace and start the program in GCD mode and let it run and grab, I grabbed some samples. And I only grabbed a few, but that should be sufficient.

[ Period of silence ]

So we see here, this is the summary view in system trace. And the first thing to look at is the little pie chart in the corner. And the little table that goes with it. That gray area in the pie chart is the entire system as the name suggests. 93% idle with 16 cores. And we look, there's a few things going, even Shark took a little bit of time. We see this table here has an average, a minimum and maximum. This is the length of time each thread in the process spent on a core.

Averaging 45.7 microseconds out of a possible 10 milliseconds. A minimum of 883 nanoseconds with a maximum, we got a couple of good ones, 13.8 milliseconds. So that's interesting, but it's not particularly illustrative. So I'm going to switch over to the timeline as I said which is the most popular way to view the application's results in system trace. So it's on all processes, so I'm just going to switch to image aware to kind of clean things up a bit.

Now having done that, let me just grow this and Wade mentioned the start WQ thread. And as we said, we use Grand Central Dispatch in order to achieve, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 working threads. These, there's this one up here.

This was the master thread that spawned the GCD queues. This is the graphics thread. This one is the refreshing drawing thread of our images that are in the view that you show, those pages popping up. And lastly, the extra thread that is spawned to monitor some of the work we're doing. So and of course if you wrote this application, you would know which threads were which. Interestingly, there wasn't a lot happening because I kind of started Shark and then I pushed the button.

But when you saw in those sample images from the time line view, they were very thickly colored bars. And here they're not particular, there's a few thick bits, but for the most part it's very sparse. Very sparse indeed. Now by default, System Trace comes up and doesn't, it just shows you all the tenures are all one color. So we can enable thread coloring as you saw in those samples and now we can taste the rainbow. We do have 16 cores, all pretty. So that's why threading is, is, can be relaxing.

[ Laughter ]

Maybe the only aspect. You can see that the top, the top bar was, had some thick bits, but for the most part we have this very faint pastel looking aspect here. So I'm going to zoom in on this by using the mouse and just dragging a thin little sliver.

Then this little sliver will now fill this, the timeline. See the little gray lines going in between everything? The little gray lines are context switch marks which means that this little bit of, of execution was time switched out and we can trace it down and it picked up here at this little bit.

You can see that's, it is quite confusing because it's very busy. And as Wade alluded to, there's something going on here. I'm going to zoom in a little more, let me grab another sliver. Now I'm grabbing a very small sliver. Let me point this out. This fat section here actually has a tenure of only 2.7 milliseconds.

So that gives you an idea of the scale of these little slivers of execution. This sliver here is 7.6, or 35.3 microseconds. And what I'm doing is I'm reading the results here in this little info tip, total time, 35.3 microseconds. Priority, idle, reason blocked. There's another way to visualize that. We were coloring by CPU.

If we color by reason why each of these threads is being context switched away, everything turns hot pink. Everything's blocked with a few blue exceptions which is an explicit yield, which generally means that that thread was at the end of a quantum and had to be switched off the processor anyway. So let me just go back to the CPU and again I'm going to zoom in a bit more.

We'll just use this section here because we can get the control thread as well. OK, now we're getting somewhere. We have a lot more white space than we have colored space. So at this point, I'm going to turn on the system call icons we were talking about. Remember we talked about, we showed these earlier.

You can see here in the top thread, lots of system calls. Now when you click on a system call, you're going to get the call stack for that system call. Each of these events is generated by the kernel in order. So this is a trace. It's very precise.

So we can see, we started down here in our application layer, imageawarecontrollercreateimagefromPDF and we work all our way back up to the same thing Wade showed, this font operations while we're trying to create a PDF. And down at the bottom of this, you can see the total time for the system call, 5.9 microseconds. And then there's arguments to the system calls that are meaningful in some cases, although not this case.

Locks, here's the lock contention. For the same reason, we get to a certain point trying to work with the fonts to create a PDF and to create glyphs actually, letters in the PDF. It appears that each letter has to take a lock. So something's not quite right with that. But it's what we have to work with. So now we're pretty sure why the reason is that we have to waste so much time even though we've got 16 cores.

Now I just turned on the signposts. I mentioned that there was a fifth type of event that I can inject and I did. So we can read these sign posts. And before, I'm going to click this just to show you, it has information. These, this information says arg1, 2, 3, 4, 5.

These are arguments are numbers that I injected into the system call. This is the, the arg2 in particular is the particular instance of a call to, oh I missed it, get over one more. There's so many things here, I really need to zoom in some more. OK, there.

So these are arguments. Here is my system call which is called additem. Every time one of the documents showed up in the UI, I have a sign post that tells me when that happened. So this is 22, 23, and so on. Now, another way to, I kind of knew where to drill down because I put the signposts in, and the way to find them without just guessing is to use the trace tab which shows you all the events in order as they were collected. And then when you inject signposts, there'll be a signpost tab. Don't worry about these gray lines. That's because I only took a partial trace and so some of them are incomplete.

It's just telling me that. So here's a dispatch block and additem. Those are the two signposts I put in. So I'm going to double click on dispatchblock and somewhere in this page, Shark has marked that system call for me. I think. Let me go back one more. We'll do it again.

This dispatchblock right here.

[ Period of silence ]

There he is. Dispatchblock, sign post number 39. This tells me where it came from which is just at the very beginning of a GCD dispatch queue and I put its tenure ends, remember I mentioned the little black lines that go under? Let me turn off system calls for a moment.

So this guy has a tenure of 10.6 milliseconds. So the dispatch queue began and 10.6 milliseconds and ran all along. And in that time you can see all the little bits that occurred. So what we know now is our problem with our Image Aware application is the fact that in order to draw all those little glyphs in our PDFs that we were processing is quite expensive. So we aren't getting our full time share for each of these threads. We're just getting bits.

We're getting microseconds. And that is the reason why we have the problem we do with, with our Image Aware program. And so it's, but it's not something we can fix. This is a system library. So we had to contact the guys that developed that library and maintain it and we'll be working with them to speed this up in the future.

Alright, so what I would like to do is show another application running and grab a System Trace from one of our tried and true demos from the past. This here is Noble Ape and these are apes running around on an island, simulating ape thoughts. And by default you can see that we get, let me set this benchmark up, it's the single threaded scalar version oh 20008 thoughts per second.

But if I go ahead and thread this and optimize it, if I optimize it, now it goes way up to a million and change and then if I thread that on top of that, it's vastly improved. So if I System Trace this guy, I'm going to make this a little bit longer.

[ Period of silence ]

Alright. When we look at the system when it's running, we see that in this case we have 69% usage and Noble Ape itself is that 69% usage. Noble Ape has many threads. And these are done by hand with pthreads. So there are 16 of those and then the main thread.

Let's just take a quick look at that timeline. And you can see there's a rainbow here, but the tenures of each thread are much thicker. And we go in and we see a very nice regimented flow where each of the slave threads works for a time and then stops and the master control thread down here does its work.

So once again we turn on the system calls and we see that at the very, at the end of each of these is going to be that you know a lock taken and that lock tenure runs over here until it starts again because the command thread told it to from these system calls here. So this again is a very regular, very regimented worker, slave or master slave relationship. Which is what we want to try to achieve.

Since we had total control over all this code, we actually didn't have to spend much time with worrying about the various effects of our various system calls because we implemented them all our self. So with that I want to go back to the slides. I think that's right. Did I get it?

Ah, sweet. And talk about those signposts. So here's Wade's demonstration software and we want to add signposts to it. And we add a couple of #include files to get the macros and constants that we need. And then we can make a system call with number sys/kdebug trace as the first argument.

Second argument is apps/debugcode macro. And what's important to notice here is the little k prepare constant. That's just a number like 42. That number is used by the apps/debugcode macro if it has a debugfunc/startmodifier and a debugfunc/endmodifier used by Shark to place a signpost and then draw the tenure between start and end.

The other kind of system call you can make is a point system call which you saw those in my add item. It just simply has a different number than the tenured signposts. Then you can see again I mentioned debug func start and func end. There are those arguments at the bottom of the system calls.

In the case of signposts, you as the developer can insert values, 4 32-bit values into the signpost. In this case, we for example in the middle we put units and we put the increment, we put the average of the results and we also put the contents of the particular result we were working on in that loop entry. So that's in user space, like I did in the Image Aware application.

If you're a kernel developer, you can just invoke the kernel debug constant macro directly. And then you see the appsdebugcode macro and its Kprepare and debug func start, debug func stop. But I really want to point out is that the arguments where you had four in user space there are five in the kernel. And the fifth one is always a zero.

But new in Snow Leopard, you can actually add signposts to your detrace scripts by invoking the core profile function. This will give you.

[ Cheering ]

[ Laughter ]

You're welcome. This will give you the same, the same results in the system trace as you would from user space or a kernel extension, so that you can, you can relate what detrace is collecting with what Shark collects.

So instead of the start, end, func end and func start constants, you use Core Profile Signpost Start, Core Profile Signpost End. Now the third type that's not here is Core Profile Signpost Point. Other than that, the arguments are the same. And the four user arguments are optional in the detrace scripts.

They'll be assigned to zeroes if you don't put them in. So that's how signposts are implemented in your code but Shark can't quite see them. What Shark needs to be able to do is know what those codes mean when they come in. So we have to create a signpost name file at this location in our user home directory. So you can name that file anything you want. In fact, you can have multiple files.

But each file has a very simple syntax, a hexadecimal number which you used in your code and the name that you want that signpost to have in the display in Shark. That's about all I had to say about System Tracing, so we'll go over the other two types that Wade mentioned briefly, bandwidth and L2 Cache Miss Profiling. Processor bandwidth profiling is basically how much you are utilizing the processor front side buss.

Memory bandwidth is not usually as high as the front side buss band but then it's primarily due to the frequency of memory versus the frequency of front side buss. What you also will find is that the maximum bandwidth you can achieve varies by platform. Obviously a 16 core Mac Pro has much more bandwidth than a 2-core laptop. And you can't really just take the, you know bytes per cache line and multiply by the frequency of the buss. That would be the electrical maximum, but you won't be able to reach that in practice.

All those interrupts and the timer pops and that sort of thing will limit that. So we have a couple of things to tell you about processor bandwidth and that it's kind of a good thing to remember that the Intel machines are typically capable of about 90% of that theoretical limit that you can calculate. And often times, your application will maybe only use a few hundred megabytes of bandwidth. That could indicate 2 things.

The first thing being your application is not bandwidth limited or your application it is limited in some other way that's preventing you from achieving bandwidth. So I just have a couple of pictures to look at from a bandwidth profile. This is Image Aware running on a single thread. And if you look close you can see that the first number, the first hard line is 500 megabytes per second.

And most of the samples were below that. Whereas after we used GCD on it and threaded it, we were able to increase the bandwidth on average at least to over a gigabyte per second and we got some good spikes up into the 2500 megabytes per second range and a couple of spikes way up at 4 gigabytes a second. But interestingly there are some gaps in the bandwidth but you see they kind of line up. So that's kind of the nature of that application. But bigger is better in terms of bandwidth in an application like this.

Now I showed you a Noble Apes system trace really for the reason of showing you this slide which is the bandwidth from a single thread Noble Ape which is running around 250 megabytes a second. Whereas if you, when we optimized and threaded Noble Ape, you can see we've got some nice spikes that are using most of the bandwidth up around 5 gigabytes a second and the average is up around 2 gigabytes a second.

So you can see, that's one of the reasons why Noble Ape goes so much faster, like 40 times faster because it has about 12 times as much bandwidth on average than it does when it's just a single thread. So L2 Cache Miss Profiling is the last thing we'd like to talk about briefly. And L2 Cache Miss Profiling is very similar in appearance to time profiling.

But instead of showing you how you spend time in a processor, it shows you when you miss processor cache. And of course missing the cache means you have to go out to memory and fetch that line of cache and bring it back. So there's a couple reasons for that.

One of the common ones is that your data layout in your application is inefficient or incompatible with the amount of cache on a processor core has access to or perhaps the pattern in which you access it is inefficient. Time profiles won't really show this. What they tend to show is time spent in your code that's a little bit mysterious. It looks like your code is executing albeit slowly.

However, the tool tips in Shark can definitely help to rule that out or point you in the direction of taking an L2 Cache Miss Profile. Most important thing is to keep your working set within the cache. I can't stress this enough and I will put my own personal spin on it and say that when I try to limit working sets in cache, I consider using half of the reported cache size because most modern systems have 2 cores sharing L2 Cache. So if you have the threaded application, your own threads will be contending for cache lines.

So you don't want your working set of both threads to be the whole cache, otherwise they'll simply eject each other constantly and battle for use of the cache. And the point of that is to keep things in cache. And lastly just a brief note. Garbage collection is an automated process that comes in and cleans up unused data pools. And garbage collection will come in sort of unannounced and can eject data from the cache and cause your application to miss when garbage collection is complete. A couple of quick images. Here is the Image Aware program in GCD mode.

And what we see at the top of the screen is a chunk of percentage, almost 40% of the time in our code from the L2 Cache Miss Profile. And we look down below, we see that darn MLSetInterruptsEnabled. And Wade said that that's an indication you're using the kernel.

And he knows that because Shark already told him ahead of time. This method disables or enables interrupts, et cetera, et cetera. And you look at the last system. It says to get a better idea what's happening, try using System Trace. So Shark will point you in the right direction. On the other hand, if you look at a time profile of the same application, you see slightly different numbers. 51% in our app and only 9% in the interrupts.

But that's where time was spent as opposed to where cache was missed. But you can see that cache misses make up the bulk of the problem in our method. So with that I'm just going to wrap things up. And sort of reflect on Time Profiling as being very useful and System Trace is really good for these application types, multithreading, OI heavy and real time.

I was going to make a joke about hyphens, but they left some out on the slides, multithreaded, IO heavy. If it has a hyphen in it, use System Trace. Bandwidth and Cache Miss Profiles, we saw those. Those will help you categorize your data sets and see if you're on the right track in terms of utilizing the availability in the system. And Shark does a lot more than we've talked about today.

I can't stress enough that you can take a look through our wonderful user guide. Lastly, just want to point out that we're all here for your help, edification, to help you. Mike Jurewitz is our evangelist and he's always online. You can reach Wade and I and the other developers at [email protected]. And the documentation is available in the help menu in Shark and online in developer.apple.com.