Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2005-407
$eventId
ID of event: wwdc2005
$eventContentId
ID of session without event part: 407
$eventShortId
Shortened ID of event: wwdc05
$year
Year of session: 2005
$extension
Extension of original filename: mov
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: ...

WWDC05 • Session 407

Swimming with Shark

Development Tools • 55:52

Shark 4 is an extremely powerful performance analysis tool you can't afford to ignore. In this hands-on session, you will learn how to use Shark's capabilities to obtain maximum performance from your application. Bring your laptop and walk through key Shark workflows with us and learn how to use useful features such as time profiling, memory and function tracing, data mining, and VM fault profiling.

Speakers: Nathan Slingerland, Sanjay Patel

Unlisted on Apple Developer site

Transcript

This transcript was generated using Whisper, it may have transcription errors.

Good afternoon. Let's see. So I'm Nathan Slingerland. I'm with the Architecture and Performance group. And welcome to Swimming with Shark. Hopefully, during the sessions this week, you've seen Shark in use. But today, we want to go over the basics first and kind of tell you how to use Shark on your code. But we also have some new features that we're excited to introduce to you today. And so let's get started.

So a few of you may still be wondering, what is Shark? Well, it's a really simple and fast profiling tool for Mac OS X. It's going to help you find your performance bottlenecks. You know, works with the language of your choice-- Java, Objective-C, C, whatever, whatever compiler you're on. And there are two flavors, GUI and command line. So command line version you can use to script for your-- look for performance regressions, automatically profile your code. And today we have CHUD 4.2 preview. So hopefully you've had a chance to download it. It's available on the website here. And of course, it's free, as always.

So Shark has a lot of powerful workflows, the most basic being time profiling. There's also malloc tracing. New in Shark 4.2, a counter spreadsheet, which we'll talk about all these. You can also create your own custom configurations. And of course, any of the configurations that we support, you can use with network profiling. So you can look at the performance of network machines.

So time profiling. This is just the most common workflow we use in performance analysis. Because you want to spend the time optimizing the code that affects performance the most. So that's the code that takes the most time on your system. So how does that work? Well, we interrupt the system periodically. By default, that's once every millisecond. So 1,000 times a second, we record what was running, what process, what thread, all that good stuff, and a back trace.

And it's all very low overhead, because we We have a kernel extension that we're using to collect all this data. And that also lets us see everything. So no matter what you're working on, if it's a driver or an application, command line, daemon, whatever, we're going to pick that up and be able to show you something about it in Shark.

So another important workflow, besides the time profile, is malloc tracing. Of course, many large applications have memory usage as a real performance limiter. So this is going to help you track that by recording a sample on every malloc and every free. And so in simplest terms, you can understand where you're using memory. But it's also useful for visualizing program behavior, Because most sophisticated software is going to allocate memory whenever it performs an operation. So you can look for this pattern of allocation. And if you see repeated patterns, you can hopefully try and remove potentially redundant operations from your code. So this is a good way to get a high-level overview of what's going on in the behavior in your program.

So new in Shark 4.2 is a counter spreadsheet. And if you've ever used the Monster application, that lets you-- this is a similar functionality. It's going to let you look at the actual performance counter values. So we have, of course, in our hardware processors and our PowerPC memory controllers, we have performance monitoring counters. And using these values, you can use this counter spreadsheet to compute metrics. For example, memory bandwidth or CPI. The memory bandwidth config comes built in, and that's useful if you want to see how your memory bandwidth usage compares to the actual hardware limits on the system. That's a good check. You can also use it to design your own metrics and do your own customized performance studies.

So, of course, you know, those are just a few basic ones. We have more built-ins, but you can also create your own, your own configurations. You know, for the performance monitoring counters, you can search by event name. So this is really useful because, you know, each bit of hardware, each processor is a little different and has different performance events. So this will let you just quickly do a text search and find the events that you're interested in without worrying about, you know, which counter is it on and so on. And, of course, if you want to, You can go to advanced mode here, and you can control absolutely every aspect of profiling in Shark and manage every little detail.

So in conjunction with any of these configurations, you can use network profiling. So this is using either Bonjour over TCP/IP. You can control and collect profile sessions over the network. So this is useful if you want to monitor your distributed network computing or your cluster or your headless machine that you have someplace, or a machine you don't have immediate access to. This is useful for doing that.

So we've talked about the basic workflows, but it's probably a pretty good idea to just review the basics. I mean, what are we doing with Shark? And what are we doing when we do performance optimization? Well, the most basic thing is to establish a baseline. So no matter what study you do and no matter what you're looking at, you need to have something that's repeatable and something that you're going to be able to measure the performance of easily. So for example, the time that an important operation takes. So you can instrument your code with a timer, for example, just to have a baseline. Or you can even use a stopwatch to do that. And step two, you want to profile optimize your deployment code, since that's what your end users are going to have. And that's where it's going to include the compiler optimizations. So you don't want to waste time optimizing development code.

But you do want to have debug symbols on, because that's how Shark is going to tell you about your code. And finally, step three, of course, Shark it. Use Shark and see where the time is going. Make sure that makes sense compared to where you think time should be going. So Shark is real simple. There's a Start button, and you're on your way to begin profiling. There's also a global hotkey, so Shark doesn't have to be in the foreground to start and stop profiling. Option Escape is the default hotkey. You can configure that if it conflicts with your app.

And, you know, there are a bunch of built-in sampling configurations. The most basic, of course, time profile, the most commonly used. There are a bunch of these other handy ones that we've also included. And by default, we like to profile everything usually. So we don't miss any important behavior. But sometimes for certain types of workflows, it may be useful to select just a single process or a single file. And that's what this process pop-up does.

So Shark, once you take a profiling session, it's going to give you three ways to look at it. And the first is heavy view. And that's really the most basic and most important, is you just want to first look at which functions were sampled most in your code. So it's going to list them from most sampled to least sampled. The most sampled would be taking the most time. So if you use this pop-up in the bottom right-hand corner, you can switch to a tree view. So rather than seeing hot spots in the heavy view, You can see hot paths. Or we also have-- you can see them both at once.

So that's great. That's going to show you the breakdown over time for all the functions in your code. But if you double click on any of these functions, you're going to get a code browser. And this is great because this shows you your source code highlighted according to where the samples landed. So lines that took more time are sampled more often, and they're brighter yellow. So in this example, you can see there are some bright yellow lines and some dimmer yellow lines. So the brighter yellow, the more time was spent there. So you want to focus on those.

And we have a custom scroll bar that's going to help you quickly jump to hot spots in your code. So you can use that. You can see that here in this example. And of course, the full cycle is to-- you profile your code, and then you want to go back to Xcode, edit it, make a change, and then reprofile. So you click the Edit button, and you're going to go back to Xcode, and you can make whatever change it is that you want to.

So OK, those are the basic workflows. Shark also provides several tools for helping to better understand your code, better understand what's going on in your code. So many of you probably have complex code trees with lots and lots of branches and lots of things going on. If you look at this example, the symbol B is showing up twice in this call tree. And it's called from two different points. And it's calling some other sub functions to do some work for it. And really what we'd like to know is not just how much time one little branch of B is taking, but how much all the little usages of B, how much time they take. So if you click this Show All Branches-- and this is the Advanced Settings drawer. You can show this in Shark. It's going to give you a top level entry for each symbol that occurred in the call tree. So it's pretty useful for finding out how much time you spent in a symbol and everything it calls. So that's one way to look at it. We'll have a few other things to look at, too. Also, color by library. So this is pretty useful for large applications with a lot of modules. It'll let you quickly visually distinguish between where all the time is being spent based on module, so based on which framework you were in.

And of course, Shark gives you data mining. So what this allows you to do is customize the display of the profile so that you can understand it better. By default, we're showing the profile just based on the symbols that are in the application. And that may not necessarily be the easiest way to understand it. So there are a bunch of operators here. First, charge symbol or library. What that's going to do is it's going to take a symbol that you select, and you either right click or if you have a right mouse button, or you can Control-click in the profile on a particular symbol or library. And you can hide any samples that landed in that similar library. So we're not actually removing these samples. They're actually just being assigned to the caller. So the profile is overall staying constant. Focus symbol, so you can show only paths starting from a symbol, foo in this example. So this is useful when you have really deep call stacks. And you just want to really focus your analysis on this part that you understand.

And FlattenLibrary is another very useful operator, because a lot of us, if you call a framework function, you don't so much care about the interior symbols that that function may be calling. I mean, we don't have any control over those things. So this is going to hide all those things, except for the entry point, which is what you'd be familiar with.

So data mining is going to help you in cases where there isn't a function that pops right out in the heavy view where you're spending a lot of time. And it's going to help you with high level analysis. If you spend a lot of time in frameworks, you can't really change the framework code, but you could possibly change how you call that framework.

So in the previous data mining example we showed you, our previous data mining technique, you pick a specific symbol or library. Well, there are also some useful macro level data mining operations where it'll just automatically do apply an operation to the whole profile. And one of the most useful is charge code without debug symbols to callers. So usually as a developer, the code that's in your system that has debug information is only your code, right? System code doesn't have any of that. So if you do that, it's just going to automatically do that to everything in the profile and just show you the code that you have control of, you know, your source code.

So that's high-level analysis, some of the high-level analysis tools in Shark. Of course, Shark also provides tools for tuning your code. In the code browser, there's automated code analysis. So Shark will show this little exclamation point button when you have-- and it has an optimization tip available. If you click that, it's going to put some help. Maybe you could do this. Maybe you could try this. So this is all based on problems we've run across, or different performance optimizations we've run across that we've tried to automate and put into Shark.

Shark also, in the code browser, you can see source code and assembly side by side. So this is really useful for seeing what the compiler generated with your source line. So it's all integrated. If you click a line of source, it will highlight the corresponding instructions and vice versa. So it's all integrated between the two views.

And of course, although you may not be an expert at assembly, there's an online assembly reference guide that's integrated with this. So if you click on an instruction, it's going to bring up the definition of that instruction, in this case, FMAD. So you don't have to memorize every single instruction definition and little variation between the instructions. So those are the basics of how to use Shark. So just to take Shark for a test drive, at this time, let's bring up Sanjay Patel, one of my colleagues from the Architecture and Performance group. Sanjay Patel: Thank you. Can we get the Demo Machine 1?

So this has become kind of an annual tradition for us. We go out and try to find an application that's running on Mac that we think we can help the performance of. In this case, we found this application. It's actually a screensaver in its base form. It's called Einstein at Home. And if you want to launch that. Sure. What it's doing is it's collecting data from different observatories, both in the United States and in Germany. And it's trying to detect gravitational waves. These were predicted by Albert Einstein in 1915. So it's collecting data from satellites and so on and processing them as a screensaver. So it's similar to SETI at home, which you might be familiar with. In this case, what we're looking at is the celestial sphere. And what we see are the purple dots represent pulsars. You'll see red dots. Those are the remnants of supernovas. You can drag around. If you guys have this on your laptops as a download, you can also launch that from the disk image.

And you'll actually see the green L, blue L, and there should be a red L somewhere. Those are the actual points of the observatories on Earth that are collecting data. If you Control-click, you can actually zoom in and out of this application. And so we got the screensaver. We actually got the source from Dr. Khanna to see what we could do with it. And so the first thing, as Nathan mentioned, when you want to do a performance optimization is set a baseline. And you can see down here we added a little timer. And we're going to iterate over a fixed set of data and process that data. And if you hit the space bar, you'll actually see a little orange target. It's on the other side of the sphere right now. It's actually processing data at that point on the celestial sphere.

And you can see here, we're running on a dual G5, and we finish in about 18 seconds. So this is a very repeatable data set, which we can run over and over again. And the next step, of course, since we've already built this optimized and with debugging symbols, is to shark it. So you can launch Shark. And again, it's pretty easy. You've got a Start button, which you can either click or use a hotkey for. Since Einstein is an OpenGL app, we don't want to interfere with that. Let's make it the foreground app.

and hit the space bar. That'll get it going. And then you can use Option Escape to get Shark going. So right now, Shark is actually sampling at one millisecond. So this is our default configuration for time profiling. And it stopped in. We're going to process the samples that we just took over that period of time.

And here you see the default view, the heavy view. And what you're looking at is a list of the functions that actually ran during that sampling period in order of number of samples that landed in each particular function. And you'll notice there's one function that's clearly the most important here, this LALD mod.

But before we even get to that, we should always look at the process pop-up in this case. It's showing us that only a little more than half of the CPU cycles on our system are going into CFS Boink, which is actually the process name for this Einstein at Home app. And if we look at the thread pop-up, You'll notice there are two threads. One of them is eating up basically one whole CPU, and that's the one that's doing the calculations. And the other one is just taking up a little time, and you'll notice that it's pretty much all in graphics drivers and OpenGL code. So that's the graphics thread that's actually drawing stuff on screen. Well, we looked at that and we said, OK, we've got all these empty cycles. We should see if we can thread this app, because clearly that's the direction all future hardware is going to take, is being more multi-threaded, more multi-core, more multiprocessor. So the place to look, we figured, was let's go to the top function and let's see what it's doing and let's see if that code is threadable. So if you double click on that function, you get what we call the code browser. And again, this is showing you your code annotated and highlighted in terms of hot spots. And we have this little Edit button down at the bottom. If you click that-- It'll bring up your source code in Xcode, where you can go and edit. It should jump right to the highlighted line that you had in Shark.

And what we noticed in the case of Einstein here is that all the time is going into this single for loop, and it's got a fixed boundary, k equals 0 to the k limit. And we said, well, what if we just split that for loop n ways for n CPUs? In this particular case, of course, it's going to be two CPUs.

And if you look a little bit further down in the source code, you'll see how we implemented that using Pthreads. So there's a Pthread create, which is highlighted there. So that's actually spinning off threads. And then there's a Pthread join shortly thereafter. That's so when threads are done with their work, they actually get destroyed there. And Pthread create takes a function pointer. So this is this LALDmodCore. And if we jump over to that function, You'll notice that, well, it's just that loop, except that we've changed the iteration bounds to take parameters for the beginning and ending points of the array. So we wrote this up with pthreads. This didn't take very long. And we implemented that back in Einstein. And we called that "optimization level 1."

You can hit the number 1. That'll switch it over to the newly threaded code if you hit the spacebar after that. Well, you'll see it is running, but time isn't ticking, or at least not very fast right now. And what happened is-- well, let's see if it even goes anywhere. We know we have a problem.

Well, it's running, but it's running really, really slow. So we're only updating the screen every so often after we complete a chunk of work, and we haven't even gotten through the first chunk of work. So something must be very wrong. We finally got an update after over 30 seconds. Got half a percent done of our workload. Well, that's not good. So-- Our optimization wasn't really a very good optimization, was it? We should take Shark again. Performance optimization is an iterative process. Sometimes you hit, sometimes you miss. Looks like we missed pretty bad.

Well, we took another time profile, and well, things are definitely different. And what we see here is that a lot of the functions are this maroon or red color, and those signify supervisor space samples. So these are things inside the kernel. And we have a problem with the time profile. We can't actually link a call stack from user space into the kernel and through the kernel, because there's a break point there when you enter the kernel with a normal time profile. So we want to figure out-- what's going on here, right? How do we figure out how we're getting into the kernel so often and what's taking time? Well, that leads us to a new feature in Shark, which I'll let Nathan explain to you. Thank you, Sanjay. Can we have the slides again, please?

So with Shark 4.2, we're introducing a new way to use Shark, a new configuration, and that is system trace. And so what this does, this is going to give you a way to understand how your code is interacting with Mac OS X, to look at multi-threaded behavior. And so the goal here is just to get an eye into these things that we can't normally see with time profile.

So the methodology is an exact trace of boundary crossing. So that's a lot different than time profile, which is a statistical, right? It's taking-- every millisecond it's recording a sample. This is going to record a call stack every time we enter the kernel from user space. And it doesn't require any instrumentation. It's all going to just happen automatically. It's system-wide.

And it provides three new views. The first is a summary view. And we'll go over each of these in more detail. But the basic view is a summary view that's going to show you overall what's going on in the system. Trace view, which is going to give you a list of all the trace records. So things we're looking at here are system calls, VM faults, when your threads were scheduled. And timeline view. And this is a visualization of the threads in your system. So you can see the threads running over time. And we'll talk about each of these. So the summary view. This is all about the biggest piece of the pie.

So here we have a pie chart. Where were we spending time? Was it in user space? Was it calling into the system calls, handling VM faults, interrupts, and so on? Underneath the pie chart, we have a summary view, summarizing each aspect of the system. So first is scheduler. So a lot of times, it'd be nice to know how your threads are You know, how often are they-- how long are they running for? At what priority? You know, how much time do they get to actually run on the CPU before they're suspended and, you know, and something else is run by the scheduler? Well, that's what summary view is gonna do. And so you can use this as kind of a good check to say, "Well, if I'm expecting a CPU-bound thread and it's only getting scheduled for, you know, a few microseconds, and I know, you know, on Mac OS X, and I'll tell you that it's, you know-- you should expect in the range, you know, of tens of milliseconds if you're actually CPU bound. You should actually be on the CPU before you get kicked off. And we'll talk more about how that works. System calls. So this is going to provide you a profile of the system calls in the system. So it's going to give you-- it gives you the ability to attribute system call time to user space code. So that means it records the user space call stack, and it links it to a specific system call. So if you spend a lot of time in a system call, you can see, oh, this piece of code actually called that in my code.

Similarly, for VM faults, when you're taking virtual memory faults, these are otherwise transparent. They're handled by the kernel. And this will allow you to say, oh, this piece of code was actually triggering all those VM faults that maybe you see in VM stat or some other tool. You can actually pin these back to your source code.

And of course, trace view, this is just the complete list of all the trace records. You can look at all the call stacks, the arguments to the system calls. But best of all, this supports hierarchical sorting. So that's a useful technique for finding perhaps maybe the system calls or a specific type of system call that took the most time. You can double click on one of those, and you can go to that event in the timeline view. And this view gives you a way to navigate this trace. So you can interactively zoom and scroll. You can click on an event.

In this case, this is a BSD system call. And it's going to give you an inspector to look at all the details about that particular system call. You can filter it based on a variety of things. System, process, thread, a lot like the normal time profile per CPU, per event type. You can also color these threads based on various parameters, the CPU they ran on, priority, the reason that they were switched off the CPU, just so you can get a good overview of why things are happening the way they are in the system.

So what are all these little pictures? Well, these rectangles, these are the thread run intervals. So this is the duration of time that a thread ran for. And also we can see within the thread, there are system calls, these little telephone icons. And there's a black line underneath the system call. And that indicates the time that the system call took within that thread. Similarly for VM faults, these page icons, this is the time that it took to handle this virtual memory fault. So that's what the system events that we're looking at.

So why-- a lot of you may wonder, well, why would my thread not run for the full amount of time? Well, if it's a CPU bound thread, truly CPU bound, well, it will probably run for the full allotted time slice or quantum before the kernel will kick it off and run something else, assuming it has something else to run.

Another possibility is that it's waiting on a resource, so disk or some other thing that it's gonna-- that isn't available right away. It's gonna pick-- the kernel is gonna pick another thread to run while this one is waiting, so that's blocked. It's also possible that while you were running a higher-priority thread, like a driver's, you know, an interrupt handler thread, was made runnable, and so you're gonna get-- this--the currently running thread will be preempted, and another--this higher-priority thread will be run.

So a little more about system calls. This is work done by the kernel on behalf of the calling process, and it's really just a safe way for user space programs to call into these various kernel APIs-- BSD, Mach, MIG messages. The distinction between these isn't all terribly important, but there's a variety of APIs here, just to keep that in mind. Virtual memory faults-- so Mac OS X, obviously, has a powerful virtual memory system. And what that's doing, it's dividing the overall address space into pages, so into equal sized chunks. And these are brought into physical memory on demand, of course, because we want to be able to have more applications running than we actually have physical memory to contain them all, potentially, or for the data that an application is operating on. And there are a wide variety of these faults.

The simplest, page in and page out, just shuffling things around for the currently used pages. Zero fill is-- these occur when you malloc or allocate a large chunk of memory. The kernel is going to mark those to be zero filled on demand. And so the first time you touch those pages, they're going to be-- the kernel will-- you'll interrupt. It'll go into the kernel, and the kernel will zero that page out automatically. Copy on write happens when you're sharing a page-- you modify a shared page. So if you're sharing a page between two processes and you modify it, it's going to make you your own personal copy of that page.

And page cache hit. The page is resonant in the buffer cache, but it's not yet mapped into your application. So it's going to take a fault. A fault, you know, it takes some time to get in and out of the kernel. So it's good to be aware of these. So this is going to give you a way to actually see these things. So for a demo of that, let's bring Sanjay up again, and we'll see how this works.

OK, so when we last left off with Einstein, we had this whole time profile full of system calls and supervisor space time that we can't account for. So what we want to do is-- and you can see that Einstein is still running. It's horribly slow. If we go to the Config menu in Shark, we can just select System Trace and then either hit Start or Option Escape, and that'll start taking a trace.

OK, so it's going. And you can see it actually stopped rather quickly. That's because a system trace is truly a trace. It's not statistical sampling. So by default, we're going to collect up to a million different events, which is probably all you can handle at any given time. But if you're some kind of super programmer, you can actually up that limit inside the config editor.

In this case, so what we're seeing is the summary view, which Nathan explained. And it gives you a pie chart. And it's breaking down time in terms of user space versus system space versus idle. And it's further breaking down supervisor space into faults, kernel, and interrupts. And in our case with Einstein here, you can see we're getting about 45% of the time in user space, so that's executing Einstein code. But at this point, we're getting almost a quarter of the time inside of system calls. And that's this orange section of the pie. And down here in the scheduler, you can see these are the Einstein threads underneath CFS Boink. You can see that the average thread is running for a few microseconds, in this case about 60, 63 microseconds. And if we look at system calls-- So remember, this is an exact trace of all the system calls made during our sampling period. You'll notice there are a lot of VM deallocates. So that's virtual memory deallocating. There's virtual memory mapping and thread create running. You can option click on one of these disclosure triangles, and it'll pop open the back trace to tell you how you got into that system call.

So in this case, you can see all the way from pthreadbody up, we end up in that LLDMOD function, which remember, when we added our threading, that was the name of the function that we were calling. And it's calling pthreadcreate, which we showed you in the source, and then ending up, of course, inside of a system call.

Now let's take a look at this inside the timeline view as well. So by default, what you get is on the horizontal axis, the x-axis is time. And you have a full list of all the threads along the y-axis that are going on in the system. And at this level, well, it's just a bunch of yellow rectangles. You can zoom in either by rubber banding or by using the zoom slider up there at the top. And if you zoom in far enough, about two-thirds of the way through probably, the icons will start to appear. And in this case, we see a whole bunch of little telephones.

Those are all calls, system calls. If you click on a phone, you can see the phone you get a backtrace for that system call. So you can figure out what caused the system call from your code. And in this case, again, you'll see your code is actually a hyperlink, which we'll click on in just a second. But in this case, what we're showing is a system call that created a thread. And then if you look here on the timeline, you can see a new thread was created. It's going to do quite a few of its own system calls. And eventually, it's going to end up at thread terminate. So what's going on here? Well, this thread was fired up, it did its work, and then it terminated. There's nothing abnormal about it except that-- and you could click on the yellow, you'll see it ran for only, in this case, about 75 microseconds. So the problem here is that we just didn't give enough work to these threads to amortize the cost of thread creation and destruction. We threaded it too fine of a level. Now if we go back to Xcode-- And you can do this actually by double clicking on a link here. It takes you back to that particular system call. You'll see we're at our pthread create. And what we realized is, well, we're doing our thread creation and destruction inside of a couple of sets of nested loops. And if we could, you could scroll up to where this function is actually defined. You'll see here's one for loop for the arguments and another for the parameters. What we should try to do is move this thread creation and destruction outside of these loops so we can actually do a lot more work per thread. That'll help amortize all the cost of the thread creation and destruction. And if we go up to-- or sorry, down to test LALDMOD. You'll see how we implemented this inside of the Einstein code.

So again, there's a pthread create call and pthread join call. But in this case, we're calling a different function, LALDMOD_OPTIMIZED. And if you jump over to that function, You'll again see basically what we just showed earlier. Now this is the body of the thread, except it has far more work to do, because the four loops are actually inside the thread rather than around the thread. So we implemented this. And again, you have this if you're following along on your laptops. If you go to Einstein and go to optimization level two, you'll see an immediate pickup. And that's because, well, we're doing the same work, but we're doing a lot more work every time we create a thread.

And so you can see, if we run it again, if you hit the spacebar again, if you remember before, using a single thread of execution to calculate these data points, we were running at about 18 seconds. And in this case, we've shaved it down to about 10.7 seconds. So that's a nice speed up from threading. But we're not done, of course. Performance guys, we can't stop now. What we want to do is take another time profile. We've threaded this thing. Let's see how it looks now, and let's see if we can attack another problem.

So again, if you hit spacebar in Einstein, we'll take a time profile. And then you start up Shark with a hotkey. It's taken its samples, and once again, we get a heavy view. And you'll notice this time it looks similar to our first session, except you can now see that the process pop up. We're taking up most of the CPU cycles on this system.

That's what we want to do, right? In all fairness to Einstein at home, you can actually run this as a multiprocess app when it's a screen saver. So this wouldn't be a real optimization that the developers would probably want to use. But in our case, this was a nice educational example. So now we've got most of the CPU cycles. You can open up the thread pop-up, and you'll see we're not lying. We have two computation threads, a graphics thread, and that last thread is actually the one that's spinning up the other two. And so now what we want to do is look at our heavy view. And you'll notice, well, we've got a top function, but we also have cosine and sine. And developers look at that and they go, well, that's Apple's code. There's nothing I can really do about that. What would be nice? It would be nice if you could figure out who's calling those functions and charge it back to the parent, right, whoever your code, whatever, whichever function you wrote that's actually making those calls. So if you do a control click or if you're lucky, a right click, you You can see our data mining menu, contextual menu. And you can just charge Cosign to whoever is calling it. So bam, Cosign disappears. There's a data mining active notification up there at the top to show you that, okay, you've done something to your profile. And you can go on to sign as well. And now you can see you can either charge sign or you can just charge the whole lib system library since there's not much you can do about lib system. the whole library back to the callers. And so you can see now sign has disappeared and a bunch of other little functions have And what we noticed was our top function was in the 80% range, and now it's taking up even more time. And so that's clearly where we want to focus our optimization efforts. So if we double click on that-- You'll see again, well, we've seen this hot loop before. So you can see again, Shark is going to highlight each line according to time. And whenever it can, it's going to offer us advice via this exclamation point. And if you click on that, you'll see, well, it's got quite a bit of advice for this loop. Everything from using reciprocal estimates rather than division to unrolling the loop to using Alt-Evec to vectorize the loop. We decided to break this out into a couple of steps because there's going to be quite a bit of work here. The first thing we want to do is probably unroll the loop and try to get rid of that division because we know that that's always slow, floating point division. And if you scroll down a bit in the source, you'll see how we implemented that.

You can see we've just stamped out code in terms of four to help fill the pipeline slots. And rather than using division, we're using this intrinsic FRES to generate a reciprocal estimate. And if we go back to Einstein-- And we go to optimization level three. We call this unroll, no division, and kick it off. And what we'll get is-- well, we should be about-- 20% faster. We were at somewhere around 10 to 11 seconds, and now we're at nine seconds. So that's a pretty good optimization. But we're not done yet. We should take another profile and see what's going on.

So as expected, our top function is still our hot loop. And what we really want to do, if we double click on that again, see its hot spots. OK, so now you'll see our new code is being highlighted. It's unrolled. Shark still has some advice there. Well, you're doing some single precision. Why don't you try using Altivec? Now, the interesting thing about this loop and about the Einstein app is that it's using some doubles and some singles within the loop. And we wanted to preserve correctness, so we decided to attack this loop in a couple of different steps. Because we don't want to detect gravitational waves where there are none. So we looked at this and we said, OK, what we need to do is break out all the single precision computation that we can so we can use vector instructions.

Now, some of you may be saying that's a little crazy at this point. No, it really isn't, because we're still shipping these Macs. These Macs are going to ship for a long time. And even after we stop shipping G5s or G4s with Altivec, those machines are going to be out there for years. And beyond that, if you attended the Altivec to SSE session earlier today, you'll know that once you make the step to vectorization, the porting from one instruction set to the other is not really that hard. The hard work is getting there in the first place, making sure your data is in the right order and so on. But what we had to do in order to write any vector code, first of all, was get it down to single precision in a block. And if we look at our source back in Xcode, You see, we started to implement that. And what I needed to do was create a temporary buffer up there called local tempf to store some intermediate results. And what I decided I'd just malloc that space up there, and then I'd free it down below after I was done using it. So we implemented that as optimization level four. And if you hit the space bar, you hit another big oops. Again, time is ticking off and not updating very quickly because we're not doing a lot of work. So again, something went wrong. And what we should do is take a Shark time profile.

And once again, we have a whole window full of red stuff. What do you do about that? Well, time profile isn't too informative in this case, because again, you can't track all the way through from user to kernel space. What you want to do is system trace in this case. So Nathan switched over to system trace. We can take another profile.

And in this case, you'll notice, OK, Einstein is getting a little less than half of the time in the user space. It's taking about a third of the time in system calls. And now we have an additional 15% in VM faults. So that's virtual memory faults. That's what happens when you allocate new memory and touch it, basically. If we go to the timeline view-- Again, we'll zoom in until we can see our whole bunch of icons.

And what we noticed was, OK, if you click on that, the first telephone there, or one of those telephones, we're doing a VM allocate. And you can see from the back trace, it corresponds to the malloc call, which we just added for my temporary buffer. Then it's followed up by a VM fault, a zero fill. So that's when I actually touch that allocated memory. Mac OS X is then going to fault that page in. And that's followed by a VM deallocate. So that corresponds exactly to my free. And so this made Einstein brutally slow.

If we go back to the source, however, we notice that What we could do is, again, move that malloc and free operation outside of the loops to basically where the thread is created and destroyed, essentially creating a static buffer for this operation rather than having a very dynamic, too dynamic of a buffer. So we made that optimization. And if we go back to Einstein, call that level 5, you'll see that, again, things have picked up pretty well.

If we run it again, we'll get a feel for how much things have changed from before. Now, we're not expecting a lot from optimization level 3. This is really just scaffolding in place so we can use vector instructions at the next step. Now, if we profile once more with time profile, You'll notice again, we're back in user space. So we've eliminated all that problem of making system calls and faulting in pages. And we can once again click on our top function. You'll notice now we're in this new function. We still have the Altivec advice from Shark.

And we decided, yeah, we could write Altivec for this loop. It's pretty small. It's a good investment in time for what we're going to achieve here, because we expect Einstein to be running on a few million computers as screen savers for several man years. We could save the world a few gigawatts by writing the most efficient code.

And if you scroll down there, you'll see Altivec code. Seen this in our demos before. It looks like regular C code, but it's actually very close to the machine, just intrinsics that operate on four floating point variables simultaneously. So we implemented that, and we called that optimization level six.

We call it vector hybrid because it's still doing some calculations in double precision, and we needed that to pass all the correctness checks. So I'm hoping that we'll at least get this one into the shipping Einstein soon. If any of the Einstein guys are out there, come talk to me. But we went one step further, and I'm hoping we can get this one in too, which is a pure alt-evec loop, and if we run that, You see we ended up down at about five seconds. So remember, we started with our scalar code was about nine seconds. So we got it to be about 80% faster using vector instructions.

And if we go back to our slides, let's summarize what we experienced here with Einstein. So this-- I'm not going to lie about this-- this porting or this optimization process went on for a few days, probably a few extra days than my management knows about. But it wasn't too bad. You can see here at optimization level one was where we screwed things up. We had our threads that were just operating on a very small chunk of data, and we were paying a heavy, heavy price. way too much price for what we were getting. When we threaded with a little better model, we ended up with a nice dual processor speed up on the G5, 76%.

We unrolled and took out some division, picked up another couple of-- or about 20% more than step two. At optimization level four, we again, we blew it. We added that malloc, and we ended up spending tons of time inside of VM faults and system calls. We fixed that again. We got to vectors, and then we did even more vectorization work. And we ended up about 3 and 1/2 times faster than the original code, which is not to say that you can't do any more. We just kind of came up against the deadline for WWDC. But there is one more thing. I'll do my best SJ impersonation.

So we made a little announcement on Monday. And you may be wondering, what's the state of Shark running on the developer transition boxes and on future boxes? Well, you'll be happy to know it is running. If you've been down to the labs, you may have used it already. We have basic time profiling, because that's obviously the most common workflow for most of you. We also have performance event profiling. So that means we can look at performance monitor counters on the chip, so special hardware registers inside the Pentium. You can configure those and use those for generating hardware events like cache misses and so on. Cross-platform. So you can take a session on one of those developer boxes, save it, and then open it back on a non-developer transition system, or vice versa.

We do have code analysis and optimization tips for x86 or Intel 32 architecture. And probably most important for those of you who are getting down to the low level and coming from Power PCs, we have a disassembler which chooses your syntax, either Intel syntax or AT&T syntax. A little explanation there. AT&T is the syntax that GCC uses. Intel syntax is the official Intel syntax. Don't ask me why they're so different, but they are. But you can choose whichever one you like.

And perhaps best of all, we have an integrated instruction reference guide. So we recently got approval from Intel Legal to get this done. And probably the best way to show this off is to go to Demo Machine 3, which is our-- developer transition system. And you can see Shark is already launched. We have Xcode here. If any of you saw our show back in 2002, you know the work we did with Flurry. That's a screen saver. We'll launch that.

We've added a little timer again and put it into its own window so we could get repeatable results. You can see that's the Flurry you know and love and have seen over and over too much. And in this case, this was the base code again. And it's all scalar code. It's portable C with OpenGL and a little bit of Cocoa to get the windowing. It runs about 10 seconds to calculate 4,000 frames. And what we want to do is, again, we'll use Shark to figure out what's going on during that time. We start it up again. You hit Start on Shark.

It's just like a Mac, isn't it? You can then double-click on the top function, in this case, "Update Smoke." There's your source code. There's some advice here. In this case, integer division. In another case, floating point division. You can look at assembly or both. If you have a wide enough monitor, this makes more sense. But there you see Intel instructions running on Mac OS. Everything's highlighted. If you don't understand Intel instructions, you click on ASIM help.

And if you highlight a particular line of your ASM, the documentation updates with you. So hopefully that'll help you. get up to speed on speaking x86 if you're tuning down at this level uh... certainly a big help for us and uh... So it wouldn't be enough just to take a profile. We actually wrote SSE2 code to optimize Flurry. And again, remember, we had written Altivec code. And you can get this. This is a fat binary. It comes with our package, which you download. It's got both the Altivec code and the SSE2 code. So you can see exactly how we did this. I can tell you that the porting process from Altivec to SSE2 was probably about 1/10 of the time that it took us to write the initial vector implementation.

But if we go back to Flurry and run with vector optimizations in place, Things go a bit quicker. So this is SSE2. It can do four single precision floats, just like Altivec. It's a 128-bit vector. So we got about a 4x speed up here, from 10 seconds down to about 2.6. So it can be done. And believe me, I've written a lot of AlteVecs, so this was a big change for me. But it wasn't that bad.

If we can go back to slides now. So let's summarize what we've talked about here. Hopefully you can see that Shark is not too hard to get used to, to get running. You hit the start button and for most of you that's probably all you'll ever have to do with it except fix your own code. We have a lot of different workflows which Nathan walked through a lot of those different paths. Basically time profile and system trace we think are going to be your major weapons of choice to solve performance problems. But we have a whole host of other configs. And you can obviously create your own custom configs. We go down to the metal, whatever metal that might be.

And again, CHUD 4.2 preview is available right now off the developer.apple.com website. We just put out a package during WWDC, so please download that, give it a whirl, particularly if you're on the developer transition box. We'd love to hear what your feedback is. And of course, this doesn't cost you anything.

For more information, of course, the WWDC web page will give you links to our sample code. Everything we've shown you here is available. And some other sessions coming up right after this one is performance analysis of your memory code. There's a DevTools feedback forum at 5:00. And debugging parallel applications for those of you writing large networked applications. That should be a fun session also at 5:00. If you need to contact us, the best thing you can do is use PerfTools feedback. That goes to Nathan, myself, and several other team members, or Xavier Legros is our evangelist and chief marketeer. Bug him for features. Subtitles by the Amara.org community