Development • 36:47
Want to get the best performance out of your Mac OS X application? This session is all about performance: How to think about it, How to measure it, and How to improve it. We go in depth on how developers can use Shark—one of the performance tools that ships with Xcode—to identify performance problems related to memory bandwidth and algorithmic issues, and we describe some new features that make Shark an even more powerful performance tool.
Speakers: Nathan Slingerland, Sanjay Patel
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Ladies and gentlemen, please welcome performance engineer Nathan Slingerland. Ladies and gentlemen, please welcome performance engineer Nathan Slingerland. we've got in Shark 4. So, and then of course, if you've seen, some of you may have seen the demo on Monday, in the DevTools keynote. We want to go a lot more in depth with what we did there to optimize Celestia using Shark.
So let's go. So what is Shark? Well, it's been called Apple's best-kept secret, but not anymore. Starting this year, in this year's WWDC, we've had a big push for this, and we want you all to know that this is a great way to find and fix your performance problems.
So, what is Shark? Well, it's a really easy-to-use profiling tool. There's a GUI version, as well as a command line version. The command line version is very useful for scripting performance regression testing, so you can archive your performance analysis, later review it, see how you've made progress as you continually optimize your application. So what is Shark? Well, it's a really easy to use profiling tool. There's a GUI version as well as a command line version.
And it runs on Panther and Tiger, so that means, yes, you can use it on your development machines right now. And of course we support basically any kind of executable: Mac OS, CFM, compatible with all the compilers that you can use in Mac OS X, XSIM, debug format, stabs format, so whatever you can throw at it, Shark can show you what's going on.
So let's see how easy it is. What do you do to get started with Shark? It's just one click. There's just this main window with one button: Start and Stop. I mean, this is the typical way that you start profiling and stop it with Shark, and you're on your way. By default, we select a time profile, but we also supply a rich set of other preset configurations. You can also create your own configurations for doing customized performance analysis.
By default, we're going to profile the entire system, but some profiling configs perhaps make more sense to target a specific process. So, for example, malloc tracing, which we'll talk more about later, you want to pick a target process. Or static analysis, you want to pick a particular executable file to look at. But for the most part, you don't even need to go here. You can just click the start button and go. So once you've selected your configuration or just left it at the default, probably you're going to do a time profile, right? So what is that?
This is the most common workflow. Basically, your time is just like optimizing application performance. You want to spend your time where you can affect application performance the most. So that's whatever functions take the most wall clock time, or whatever pieces of your code are taking the most wall clock time.
So we're using—for time profile, this is a statistical form of sampling, and we're taking, by default, a thousand samples per second, and each sample contains backtrace, so how you got to a particular location in your code, which process, which thread, et cetera. So all this information, and we collect that, and when we combine it, we can create a profile. So this is going to show you, over the sampling period, where you were spending time in a given process or thread, or overall in the system. Thank you.
And we're capturing everything. So this includes the driver, any drivers, any kernel code that's executing, applications. So no matter what kind of developer you are, this can help you profile your code. And we're able to do this because we also have a kernel extension, right? So we're sitting in the kernel capturing these samples, and this gives us very low overhead. We really don't perturb the code in the system hardly at all.
So once you've got a profile, you've hit start, you've hit stop, and you run whatever it is you're interested in, you're presented with this by default: the heavy view. And this is really the bottom line. This is showing you, starting from the top, the most sampled to the least sampled function in the selected process. By default, we select the heaviest process on the system. So most of the time, this is your code. This is your application that you're trying to look at.
So that showed you where—the heavy view showed you where the hotspots are. The tree view is going to show you where the hot code paths are. So how did you get to those important functions? Right, so that's just this tree button here at the bottom. And new in Shark 4, heavy entry simultaneously, so you can use one to help you navigate the other.
So those are two ways we use very commonly to look at high-level problems in our code. The third way is the enhanced chart view. So this chart view has been there a while, but it's been greatly enhanced in Shark 4. Basically, what we're looking at is call stack depth—that's on the y-axis—versus time on the x-axis. And this can give you a really good idea of a couple things. First of all, it can show you patterns in your code, right?
So patterns in the execution of your code. And this can be good to look for repetitive behavior, and we'll talk more about that later in relation to malloc tracing and function tracing, which we'll get to. It also is a good way to look for—if you see thin, narrow spikes in this chart view, that usually indicates that you're spending a lot of time in function call overhead, right? Because you had to call many function levels deep, but you're not actually doing all that much work once you get there. So this can be—we see this a lot in object-oriented, modern— you know, layered code, where we just—we have everything hidden in layers.
And so sometimes that can hurt performance. So for the colors here, blue is the user space. You may see there are a few maroon-colored stacks. Those are kernel. And the yellow call stack is what's selected. So that's what's being shown at the right-hand side there is the current call stack. Lightness doesn't mean anything except for layers, so it's just indicating, as you see these gradients in color, each color change is indicating a new function vertically.
So those are the three main tools that Shark is providing for looking at your application. But a lot of you probably are working on some pretty sophisticated code bases, and you have some complex function call behavior, and you really would like to—you want to understand the performance of that application.
So, Shark 4 has introduced a lot of new features to try and help with high-level analysis and understanding these more complex application architectures. The first is this color by library, and what this is going to do is just going to color each different module that your program is using—so different frameworks or libraries—a different color. And it remembers the color and it's consistent between the different views. And this is an easy, really quick visual way to see the difference between the modules, visually to see where you're spending time in what modules.
Okay, so another common problem you probably will see in your code is, this is a tree view, and you may see that you have this important--this intermediate function, this interior function, for example, B in this example, and it's--you'd like--you see it in two different spots in the tree, right?
And you'd really like to know how much time total is being spent in B and its children, right, because we tend to subdivide work. So in this example, we have B calling C1, C2, and C3, and not a whole lot of time is being spent in B itself. So what we've added is a canonical view, and if you select that, we're going to see not just--you know, in the tree view, normally we see main calls--Fu calls bar.
Well, in this case, we're going to see all the symbols, all the interior symbols, every symbol that shows up, and its total contribution, right? So in this case, B and what it calls total up to 100 percent of the time where you're being spent. So this is a good way to to understand how much time you're spending in a function and its children.
So that kind of raises this question, right? Traditional profiling tools tend to look at performance just like the same way that the program is broken down into functions, right? And that works pretty well when you have a few functions where you spend a lot of time, right? A few heavy functions.
But if you have just a lot of functions in your code, you've subdivided the work a lot and you're reusing code everywhere, or you're spending a lot of time in system libraries, that really isn't the best presentation. And that's what data mining allows you to do. It's going to allow you to customize how you're looking at a profile. And by customizing it, you can help ease your understanding of it.
So Shark provides a couple of basic data mining operations. The first is exclude symbol or library. And what this is gonna do, it's gonna hide a symbol or library that you select. And so we're not just taking the samples out. It's important to remember this. We're not just removing these samples, right?
So if you hide the function foo, any samples that landed in foo are just being assigned to whatever called it, right? So if you have a piece of code that calls square root, for example, and you exclude square root, it's gonna just assign the time to that function, your function, right? And really that's a lot easier for a lot of us to understand is time spent in our code, right? 'Cause we can't do a lot about system libraries typically.
So, along those same lines is Flatten Library. We are familiar with the pieces of AppKit or Core Foundation or the other system libraries that we call, those function entry points. But a lot of times you'll see interior functions inside of those libraries that you really don't know much about, and you really don't care much about the implementation interior to that.
So, Flatten Library is going to do that. It's going to hide everything inside of a library except for the entry point. So, you're going to just see all the time it show up in the entry point, which is what you understand, right, because that's what you called from your code.
And finally, Focus Symbol. If you typically navigate very deep call trees, or if you know you have a certain symbol that you only want to look at and what it calls, that's what Focus does. It's going to just start the display of the various call paths from that symbol. So these are different ways to really customize how you're looking at the profile. And internally at Apple, we've found this very helpful in looking at performance analysis for Tiger.
So those are the detailed data mining operations. There are more macro-level operations that are also very useful. Flattened Recursion: So you've probably seen this if you've ever looked at a piece of code. You've seen a recursive call, maybe in NSView or some other—even your own code. And really, that's not the most informative way to look at a piece of a function, right?
But if you're looking at this recursive function, you'd actually like to see all those recursive calls collapse into one, and see all the time together, and what all the recursive calls actually do—collapse together. So this Flattened Recursion checkbox is in this Advanced Settings drawer that we've been looking at. And you check this checkbox, and all the recursive calls anywhere in the profile just get collapsed to one.
So we talked about flattening a single library. This flattened system libraries will flatten all of the system libraries that occur in the profile. So this is just a really quick way to hide all that interior code that you don't really want to see. Another very useful one is Exclude Entries Without Source Info.
Usually when you're working on an application in your system, you've turned on debugging information because you want to see symbols in Shark and you want to get all the source information. Well, the good news is that that's almost the only thing in the system with that information in it. So if you exclude everything without that information, you're going to only see your code. It's going to perform that exclude operation on any symbol that doesn't have source information associated with it. So this, again, this will just focus in on what you know on your source code.
I encourage you to play with all these different operations. You'll see very quickly what they do and how they can help you. Finally, Exclude Supervisor. A lot of us aren't very familiar with the kernel and what it's doing inside, and how did I get these red samples in my profile? This is just a quick way to get rid of those samples and keep them from being a distraction while you're looking at your code.
Okay, so those are the high-level optimization tools that Shark provides. Once you've kind of gotten those out of the way, you should be able to at least understand your code at a level where you want to optimize a specific function, right? So that's where Shark shines for function-level optimization.
The most basic thing that Shark does for you is it shows you your source code. If you double-click on any symbol in those profile views that we talked about—heavier tree—it's going to show the source code. And what it's doing here, it's coloring the source according to how many samples landed on a particular source line. So the brighter the yellow, the more samples landed there.
It's a really quick visual way to see where you're spending time in your source. You also note that along the—in the scroll bar, the coloring in the background of the scroll bar matches the coloring in the background of the table. So this is a very easy way to navigate, you know, quickly between hotspots in your code.
And finally, to make the profile and optimization cycle complete, there's an edit button. And this is going to show your source code inside of Xcode. It's going to bring you back to Xcode. So the typical optimization cycle, right? You profile with Shark, you go to the code browser, you see a problem, you click "Edit," go back to Xcode, recompile, and so on. So it's this iterative process where you continually go back and optimize it again and again until you're happy with the code.
Another great thing that Shark will do for you is it will provide you with optimization tips. So if you see one of these exclamation points in the code browser, in your code, that means Shark has a tip for you. So it has an idea about what might be wrong or why you might be spending extra time on that particular line of source. And so if you click on the exclamation point, it will bring up this help balloon.
And finally, in Shark 4, we've added this ability to look at both source and assembly side by side, right? So now, rather than having to jump back and forth or trying to figure out what the compiler generated for a particular line of source code, when you navigate around in the source viewer, you'll see very quickly, OK, the highlighted line, if you select here, we have a line of source selected on the left. And there are maybe three or four instructions there that are selected on the right. And those are the instructions that go with that source line. So this is a real quick way to see what instructions were being generated.
Okay, well that's great, but I don't know any PowerPC assembly. Well, here you go, right? There's this online help. It will follow along as you navigate through the assembly, and it'll tell you in English, you know, here's what this instruction does. And, you know, assembly language is just like any other programming language. It has these simple operations, and you can read about it here in this online help guide.
[Transcript missing]
That's time profiling. Those are the tools that we use with time profiling. But you'll see that we also use these same tools over and over, but with more sophisticated types of profiling, which Shark also provides. So in Shark 4, we added malloc tracing. And what this is doing is every time an allocation happens in a target program, it's going to take a sample. It's going to remember the call stack and also how much memory was allocated.
So it's not shown here, but you can look at this in the profile view and say, "Okay, this function over the sampling runtime allocated this much memory." And that's useful for tracking down, you know, overall how much memory was allocated in various functions. But another interesting thing to do is to use the chart view to see patterns in the program behavior, because this is not a statistical profile. This is an exact trace of the malloc functions. And most interesting program behavior does some sort of allocation.
You can see patterns in your code, in the execution of your code. So here what we're looking at, this is the dock as we mouse over and it kind of magnifies different icons. We're looking at three icons here, right? And so you can see this repeated behavior. Now sometimes, like in the case of the dock, well, maybe we expect that. We expect repetition. But this can also be a good clue when you're doing extra operations that aren't necessary. So malloc tracing is more general than just looking at where memory was allocated.
and a more general form of malloc tracing, is function tracing. So rather than just tracing the allocation functions in your code, you can trace any functions in your code. And again, this is an exact trace. So there are some predefined presets for file AO APIs, typical network APIs, locking APIs. But you can also add your own functions, right?
So if you have a set of functions that you want to watch in your code and you want to measure exact function count and the amount of time that the function took, this is a great way to do it. You just attach to your application and select the functions that you want to watch. And this is measuring each function call's time very accurately with the PowerPC decrementer, but without having to write any instrumentation code in your application.
And finally, event profiling. For you long-time Shark users, we've had this for a while, but this is also a very good way to correlate hardware and operating system events, so things like cache misses in the hardware or virtual memory faults in the operating system, with your code. Here, in this example, we have L1 and L2 data cache misses that we're event profiling along with our normal time profile.
We see two columns have been added to the profile for each of those counts, L1 and L2 cache misses. Also, in the chart view, we can see there's a chart above, so you can track the L2 cache miss over time. How is that changing over time in your application? And there are literally hundreds and hundreds of these performance events available.
Finally in Shark 4, Java profiling. Everything that you could do with CFM and Mac OS X applications is also available for Java now: time and allocation tracing, function tracing. It's going to show if you double-click on a Java function and the source code is available, it's going to show you the Java source in a code browser.
So those are the built-in configurations that we've been talking about. You can create your own using this configuration editor. And one of the hard things that there was in Shark 3 was finding where performance counter events lived. So which PMC, which performance monitor counter, does an event that counts cache misses live on?
Well, you'd have to either go to the processor user guide and look it up, or you'd just kind of blindly search around. Well, now there's a text search for the performance event. So you can just type in cache miss or cache, and it'll come up with a list of which PMCs have an event with that substring in it.
Okay, in Shark 4, we've also added network profiling. This is using Rendezvous. You can set up your XServe in your closet, or you have an HPC cluster, and you want to monitor what's going on in that machine that doesn't have a monitor and a mouse in front of it and a keyboard.
Well, this will let you monitor one or more of those machines that you've set that up on, so you don't have to sit in front of this machine. You just go out—it works very similar to Xcode's distributed builds. You pick the target machines and hit start and stop on your client machine. And so, let's have a demo of that. Can we switch to the demo machine, please?
Great. All right, so over here I have a machine—this is Sanjay's machine that's running some code on it—but I'll launch Shark here, right? And I'll go to the network profiling window, and I'll say, "Control network profiling of shared computers." And there I can say, "Oh, somebody has shared this WWDC6. Okay, let's use that."
And it's running something over here. Let's hit start. So again, Shark is very simple. Just hit the start button. Stop. And now we're processing on the other machine here. This is the profile taken from this machine. We can see, "Oh, that's Celestia. I wonder what that is." Maybe Sanjay can tell us.
Sanjay Patel, can we have you come up here and explain it? So, for those of you here on Monday, you probably saw our Celestia demo. We'd like to show you a few more details about how we optimize that code. So, if we go back to slides for just a second.
Okay, so Celestia is this open source program we found on the web, and it's really cool. It's a 3D space simulation of all the large bodies in the universe. For more information, you can check them out on this web link. They actually use real data from Pasadena down at JPL to calculate where the planets and the stars are—how they're orbiting. So let's go back to the demo machine.
And so this is Celestia's main window. You can see it's hovering around the earth right now. And we wanted to find out how this thing performs on Mac OS X. So the first thing you have to do if you're going to try to optimize performance is put down a good metric. In our case, that would be how long it takes using a timer, as you can see in the corner, to run our script. And so right now, we're running and timing, so we can bring up Shark.
And you can see, as Nathan said, Shark is pretty simple. There's a start button in the middle here. We can choose our configuration, but by default, most of us are going to use a time profile. Let's just hit the start button. So right now we're actually taking a thousand samples per second of whatever's running on the system beyond just Celestia. If you hit stop, Shark is now processing the samples it took. And it's going to show us by default what we call the heavy view. So this is a list in order of importance of the functions that actually ran on this system.
Now at the bottom here you can see Celestia's taking up less than half of the processor cycles. And if you pop that open you notice the first thing is the Mach kernel. So new for Shark 4, we actually filter out the idle thread, so you won't pop there by default.
Thanks for old Shark fans. And you might wonder why is Celestia getting less than half the CPU cycles on this system? If you pop open the thread pop up, you'll see that it's a single threaded app. So we're running on a dual G5. It can't possibly take all of the CPU cycles on the machine.
Now another way to look at this code is, as Nathan mentioned, we have a tree view. So this is the traditional top-down view of code. So you start at start or main, and you work your way down to the leaves of the function. This is great for understanding the structure of your code: How did you get there? And new for Shark 4, we can view both: heavy and tree simultaneously, so you can get a really good feel for the behavior of your program. And of course, as Nathan mentioned, we have the new chart view.
So this is interesting. We have a dual processor system, so we get two charts. And on the x-axis, we have time for each CPU. Now, we know that Celestia's a single-threaded program, so let's just look at its one thread in one chart. And new for Shark 4, we have this zoom slider. This was a big upgrade compared to previous zooming.
And you can go back and forth. The nice part is if you click on one of these call stacks, you'll get the backtrace for that point in time. So you can see chronologically how your program behaved over time. And you'll notice that there are some of these maroon or red samples.
Those are calls into supervisor space. And one thing I noticed was, for the Celestia profile, we had a lot of these calls into the supervisor space. It turns out we're trying to draw about a thousand times per second. And for those of us who are familiar with graphics applications, you don't really need to try to draw that often because you can't. So we limited it to 30 frames per second. And that's what I call Warp 2. We can go to that.
So you can see things have sped up immensely by not doing all that work of trying to draw too often. We can also go to the threading model, which would be Warp 3. So it turns out you can do drawing in parallel with some of the computations for orbits. And we can take another sample here.
So, I'll put on my hardware hat for a second and say that threading is probably the most important thing you guys can do if you have any serious computation going on in your programs, because no matter what architecture we're talking about, we're going to end up being multi-core, multi-processor, multi-threaded, multi-something. So, if you're not taking advantage of that parallelism in hardware, you're going to lose out.
But now we can see, if we compare the first session to our second session, You'll notice that it's very different. On the first session, we had this called a log at the top, but now that we've limited drawing and done threading, we have a totally different profile. So this brings up an important point: You shouldn't try to micro-optimize right off the bat. You should look for some high-level problems to figure out what your code is actually doing before you start diving into function-level tuning.
But now we have a new profile, and you'll see that Cosine from LibSystem is actually at the top of the chart. And there's not much we can do about Cosine, but we should figure out who's calling Cosine. So if you open up a disclosure triangle, you can see our main caller is this orbit function.
Now, new in Shark 4, we have what we call data mining operations, and as Nathan mentioned, if we want to hide, say, a whole library, for example, lib system, we can just exclude it, and now all of the samples that used to be in cosine are attributed to its callers. So you see our orbit function is now at the top of the chart.
And of course, if you double-click, You get your source code, and it's highlighted according to Hotspot. You can see in this function, almost all of the time it's going into one line, calling cosine. And of course, it's a for loop. What we should do is open up Xcode and see what we can do about that. So you just hit the edit button.
And you can see we're calling cosine in a loop, and we're actually just iterating over a whole array of data and getting the cosine of each value. Now, new in Tiger in the Accelerate framework, there just happens to be a cosine function that does that. It just iterates over an entire array.
And if we jump over to that line, You can see I just made a call here. It's called vvcosinf. It's available on Tiger. And what it does is it just takes an entire array, operates cosine across the whole path, and gives you a return array. So we made that optimization. We call that Warp 4.
You'll see we took about 20 seconds to run through the demo before while we were threading and limiting drawing. Using the Accelerate framework, we doubled the performance. But there is a downside: I kind of cheated. That cosine function is single precision, and the inputs and outputs in Celestia were double precision. So I didn't want to cause any kind of astronomical catastrophe or global warming or anything. So we decided we'll do our own cosine approximation function in double precision. And I called that Warp 5, and it ended up being a little slower than just calling it a V-force.
So, we decided we'll do our own cosine approximation function in double precision. And I called that Warp 5, and it ended up being a little slower than just calling it a V-force. So, we decided we'll do our own cosine approximation function in double precision. And I called that Warp 5, and it ended up being a little slower than just calling it a V-force.
and figure out what our code is doing after we've made these optimizations. So this is performance tuning, of course. It's an iterative process. So again, you can see the profile has changed. Cosine's no longer there. We have our approximation function on top. And what I want to do is take a look at this function called BigFix. It's a C++ constructor. And you look here and you say, "Whoa, Shark is angry about something."
It's got a lot of what we call the callouts there, and if you click on one of those, you get some friendly advice—or not-so-friendly advice, depending on how bad your code was. In this case, we're doing type conversions from floating point to int, so this is one of the big bottlenecks for a G5 or even a G4 processor.
And what we want to do is try to optimize that. But let's take a look at what the compiler generates underneath this So we can now look at source and assembly side by side. Of course, we're trying to sell those big monitors. This looks really great on 30-inch.
Now the really nice part here is you can see Nathan scrolling around, and for each line of source, we're actually in sync with the ASIM view. So that's going to pop as you scroll around in the source. So you can see exactly which lines of ASIM correspond to which lines of source. Now if we just want to look at ASIM and don't really understand ASIM, then we have this new handy PPC help button. As Nathan showed. And the cool part here is, as you click on your ASIM, the right page of the manual pops open with you.
So, lots of information right at your fingertips. You don't have to go fumbling around for manuals. That's a pretty big improvement for someone like me. And of course, if you're really tuning for G5 or PowerPC 970, we have this extra help here. If we click this "Show Dispatch Groups," your source will now be highlighted according to dispatch groups for the G5.
So for those of you familiar with the G5 architecture, you can actually tune down to the hardware. The other nice feature is this 970 details drawer. And if you highlight code, you can see dynamically we're updating the charts at the bottom, so you can see where every ASIM instruction is going.
So on the left we have the functional units, and on the right we have the dispatch slots. So for those of you who really want to tune, this is great stuff. Okay, so let's go back to Celestia. So over a period of about two weeks or so, with a lot of interruptions doing real work, I spent—I used Shark to try to optimize this code, so I ended up writing some AlteVec code, I did some scheduling for the G5, I did some G5-specific optimizations, and the sum result of that is what I call Warp 9. And if we run through that, Okay, so let's go back to Celestia.
So over a period of about two weeks or so, with a lot of interruptions doing real work, I spent—I used Shark to try to optimize this code, so I ended up writing some AlteVec code, I did some scheduling for the G5, I did some G5-specific optimizations, and the sum result of that is what I call Warp 9.
And if we run through that, Here's the result of a couple of weeks of tuning effort. We started out—if you run this demo as is, it'll take over 20, 25 minutes on a G5. And when we were done, of course, we got to five seconds, so around 300 times the performance. So that was pretty nice.
So let's just summarize what we've talked about today. Things we really want to say are, as you can see, hopefully, that it's a really quick process to start using Shark. You don't have to do any kind of hardcore profiling. You can take your optimized binary and simply profile that. In order to see your source, all you have to do is add the -g flag so you don't strip out your debugging symbols.
And of course we provide a lot of high-level views, so if you're used to trees, you can see it in tree view, you can use the heavy view, you can look at your code over time as a chart. And of course, once you've identified functions that are hotspots, you want to tune the source code or tune the assembly, and Shark gives you optimization tips. And of course, once you've identified functions that are hotspots, you want to tune the source code or tune the assembly, and Shark gives you optimization tips.
And of course, hopefully you can tell, Shark we've tried really hard to make easy. There's no magic here. You just push the start button and that should get you going. But there is a lot of machinery underneath here that makes it powerful enough for even the most expert users who want to tune all the way down to the metal.
And of course, hopefully you can tell, Shark we've tried really hard to make easy. There's no magic here. You just push the start button and that should get you going. But there is a lot of machinery underneath here that makes it powerful enough for even the most expert users who want to tune all the way down to the metal.
And of course, hopefully you can tell, Shark we've tried really hard to make easy. There's no magic here. You just push the start button and that should get you going. But there is a lot of machinery underneath here that makes it powerful enough for even the most expert users who want to tune all the way down to the metal.