Swimming with Shark - WWDC 2005

Development Tools • 55:52

Shark 4 is an extremely powerful performance analysis tool you can't afford to ignore. In this hands-on session, you will learn how to use Shark's capabilities to obtain maximum performance from your application. Bring your laptop and walk through key Shark workflows with us and learn how to use useful features such as time profiling, memory and function tracing, data mining, and VM fault profiling.

Speakers: Nathan Slingerland, Sanjay Patel

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. Let's see. So I'm Nathan Slingerland. I'm with the Architecture and Performance Group, and welcome to Swimming with Shark. Hopefully, during the sessions this week, you've seen Shark in use, but today we want to go over the basics first and kind of tell you how to use Shark on your code. But we also have some new features that we're excited to introduce to you today, and so let's get started.

So a few of you may still be wondering, what is Shark? Well, it's a really simple and fast profiling tool for Mac OS X. It's going to help you find your performance bottlenecks. You know, works with the language of your choice, Java, Objective-C, C, whatever, whatever compiler you're on. And there are two flavors, GUI and command line. So command line version you can use to script for your-- look for performance regressions, automatically profile your code. And today we have CHUD 4.2 preview. So hopefully you've had a chance to download it. It's available on the website here. And of course, it's free, as always.

So Shark has a lot of powerful workflows, the most basic being time profiling. There's also malloc tracing. New in Shark 4.2, a counter spreadsheet, which we'll talk about all these. You can also create your own custom configurations. And of course, any of the configurations that we support, you can use with network profiling. So you can look at the performance of network machines.

So time profiling. This is just the most common workflow we use in performance analysis. You know, because you want to spend the time optimizing the code that affects performance the most. So that's the code that takes the most time on your system. So how does that work? Well, we interrupt the system periodically. By default, that's once every millisecond. So, you know, a thousand times a second we record what was running, what process, what thread, you know, all that good stuff, and a backtrace.

And it's all very low overhead because we have a kernel extension that we're using to collect all this data. And that also lets us see everything. So no matter what you're working on, if it's a driver or an application, command line, you know, daemon, whatever, we're going to pick that up and be able to show you something about it in Shark.

So another important workflow besides the time profile is malloc tracing. Of course, many large applications have memory usage as a real performance limiter. So this is going to help you track that by recording a sample on every malloc and every free. So this is going to help you track that by recording a sample on every malloc and every free. And so you can understand, you know, in simplest terms, you can understand where you're using memory.

But it's also useful for visualizing program behavior because most sophisticated software is going to allocate memory whenever it performs an operation. So you can look for this pattern of allocation. And if you see repeated patterns, you can hopefully try to remove potentially redundant operations from your code. So this is a good way to understand where you're using memory. And so you can understand, you know, in simplest terms, you can understand where you're using memory.

But it's also useful for visualizing program behavior because most sophisticated software is going to allocate memory whenever it performs an operation. So you can look for this pattern of allocation. And if you see repeated patterns, you can hopefully try and remove potentially redundant operations from your code. So this is a good way to get a kind of a high-level overview of what's going on in the behavior in your program.

So new in Shark 4.2 is a counter spreadsheet. And if you've ever used the Monster application that lets you-- this is a similar functionality. It's going to let you look at the actual performance counter values. So we have, of course, in our hardware processors and our PowerPC memory controllers, we have performance monitoring counters.

And using these values, you can use this counter spreadsheet to compute metrics. For example, memory bandwidth or CPI. And the memory bandwidth config comes built in. And that's useful if you want to see how your memory bandwidth usage compares to the actual hardware limits on the system. It's a good check. You can also use it to design your own metrics and do your own customized performance studies.

So, of course, you know, those are just a few basic ones. We have more built-ins, but you can also create your own, your own configurations. You know, for the performance monitoring counters, you can search by event name. So this is really useful because, you know, each bit of hardware, each processor is a little different and has different performance events.

So this will let you just quickly do a text search and find the events that you're interested in without worrying about, you know, which counter is it on and so on. And, of course, if you want to, you can go to advanced mode here and you can control absolutely every aspect of profiling in Shark and, you know, manage every little detail.

So in conjunction with any of these configurations, you can use network profiling. So this is using either Bonjour over TCP/IP. You can control and collect profile sessions over the network. So this is useful if you want to monitor your distributed network computing or your cluster, your headless machine that you have someplace or a machine you don't have immediate access to. This is useful for doing that.

So we've talked about kind of the basic workflows, but it's probably a pretty good idea to just review the basics, right? I mean, what are we doing with Shark? And what are we doing when we do performance optimization? Well, the most basic thing is to establish a baseline, right?

So no matter what study you do and no matter what you're looking at, you need to have something that's repeatable and something that, you know, you're going to be able to measure the performance of easily. So, you know, for example, the time that an important operation takes. So you can instrument your code with a timer, for example, just to have a baseline. Or you can even use, you know, a stopwatch to do that.

And step two, you want to profile optimize your deployment code since that's what your end users are going to have. And so, you know, and that's where it's going to include the compiler optimizations. So you don't want to waste time, you know, optimizing development code. But you do want to have debug symbols on because that's how Shark is going to tell you about your code.

So, and finally, step three, of course, Shark it, right? Use Shark and see where the time is going. Make sure that makes sense. So, you know, you can start profiling with the start button and you're on your way to begin profiling. There's also a global hotkey, so Shark doesn't have to be in the foreground to start and stop profiling. Option escape is the default hotkey.

And there are a bunch of built-in sampling configurations, the most basic, of course, time profile, the most commonly used. There are a bunch of these other handy ones that we've also included. And by default, we like to profile everything usually, so that we don't miss any important behavior. But sometimes for certain types of workflows, it may be useful to select just a single process or a single file. And that's what this process pop-up does.

So Shark, once you take a profiling session, it's going to give you three ways to look at it. And the first is heavy view. And that's really the most basic and most important. You just want to first look at which functions were sampled most in your code. So it's going to list them from most sampled to least sampled.

And the most sampled would be taking the most time. So if you use this pop-up in the bottom right-hand corner, you can switch to a tree view. So rather than seeing hot spots in the heavy view, you can see hot paths. Or you can see them both at once.

So that's great. That's going to show you the breakdown over time for all the functions in your code. But if you double click on any of these functions, you're going to get a code browser. And this is great because this shows you your source code highlighted according to where the samples landed. So lines that took more time are sampled more often, and they're brighter yellow. In this example, you can see there are some bright yellow lines and some dimmer yellow lines. So the brighter yellow, the more time was spent there. So you want to focus on those.

And we have a custom scroll bar that's going to help you quickly jump to hot spots in your code. So you can use that. You can see that here in this example. And of course, the full cycle is to-- you profile your code, and then you want to go back to Xcode, edit it, make a change, and then re-profile. So you click the Edit button, and you're going to go back to Xcode, and you can make whatever change it is that you want to.

So, okay, those are the basic workflows. Shark also provides several tools for helping to better understand your code, better understand what's going on in your code. So many of you probably have complex code trees with some, you know, lots and lots of branches and, you know, lots of things going on.

If you look at this example, there's -- the symbol "B" is showing up twice in this call tree, and it's called from two different points. And, you know, it's calling some other sub-functions to do some work for it. And really what we'd like to know is, you know, not just how much time one little -- one branch of "B" is taking, but how much all the little usages of "B," how much time they take.

So if you click this "Show all branches" -- and this is the advanced settings drawer. You can show this in Shark. It's going to collapse -- it's going to give you a first top-level entry for each symbol that occurred in the call tree. So this is just -- it's pretty useful for finding out how much time you spent in a symbol and everything it calls.

So that's one way to look at it. We'll have a few other things to look at, too. Also, color by library. So this is pretty useful for large applications with a lot of modules. It'll let you quickly, visually distinguish between where all the time is being spent based on module, so based on, you know, which framework you were in.

And of course, Shark gives you data mining. So what this allows you to do is customize the display of the profile so that you can understand it better. You know, by default, we're showing the profile just based on the symbols that are in the application, and that may not necessarily be the easiest way to understand it.

So there are a bunch of operators here. First, you know, charge symbol or library. What that's going to do is it's going to take a symbol that you select and you either right-click or, you know, if you have a right mouse button, or you control-click in the profile on a particular symbol or library. And you can hide any samples that landed in that symbol or library. So we're not actually removing these samples. They're actually just being assigned to the caller. So the profile is overall staying constant.

Focus Symbol. So you can show only paths starting from a symbol, foo in this example. So this is useful when you have really deep call stacks and you just want to really focus your analysis on this part that you understand. And FlattenLibrary is another very useful operator because a lot of us, if you call a framework function, you don't so much care about the interior symbols that that function may be calling. I mean, we don't have any control over those things. So this is going to hide all those things except for the entry point, which is what you'd be familiar with.

So data mining is going to help you in cases where there isn't a function that pops right out in the heavy view or really where you're spending a lot of time. And it's going to help you with high-level analysis. If you spend a lot of time in frameworks, you can't really change the framework code, but you could possibly change how you call that framework.

So in the previous data mining example we showed you, our previous data mining technique, you pick a specific similar library. Well, there are also some useful macro-level data mining operations where it'll just automatically do, apply an operation to the whole profile. And one of the most useful is charge code without debug symbols to callers.

So usually as a developer, the code that's in your system that has debug information is only your code, right? System code doesn't have any of that. So if you do that, it's just going to automatically do that to everything in the profile and just show you the code that you have control of, you know, your source code.

So that's high-level analysis, some of the high-level analysis tools in Shark. Of course, Shark also provides tools for tuning your code. In the code browser, there's automated code analysis. So Shark will show this little exclamation point button when you have, and it has an optimization tip available. If you click that, it's going to put some help. Maybe you could do this. Maybe you could try this. So this is all based on problems we've run across or different performance optimizations we've run across that we've tried to automate and put into Shark.

Shark also, in the code browser, you can see source code and assembly side by side. So this is really useful for seeing, you know, what the compiler generated when you, you know, with your source line. So if you, it's all integrated. If you click a line of source, it will highlight corresponding instructions and vice versa. So it's all integrated between the two views.

And of course, although you may not be an expert at assembly, there's an online assembly reference guide that's integrated with this. So if you click on an instruction, it's going to bring up the definition of that instruction, in this case, FMAD. So you don't have to memorize every single instruction definition and little variation between the instructions. So those are the basics of how to use Shark. So just to take Shark for a test drive, at this time, let's bring up Sanjay Patel, one of my colleagues from the Architecture and Performance group. Thank you. Can we get the demo machine one?

So this has become kind of an annual tradition for us. We go out and try to find an application that's running on Mac that we think we can help the performance of. In this case, we found this application. It's actually a screensaver in its base form. It's called Einstein at Home. And if you want to launch that. Sure.

What it's doing is it's collecting data from different observatories, both in the United States and in Germany. And it's trying to detect gravitational waves. These were predicted by Albert Einstein in 1915. So it's collecting data from satellites and so on and processing them as a screensaver. So it's similar to SETI at Home, which you might be familiar with.

In this case, what we're looking at is the celestial sphere. And what we see are the purple dots. The purple dots represent pulsars. You'll see red dots. Those are the remnants of supernovas. You can drag around. If you guys have this on your laptops as a download, you can also download -- launch that from the disk image.

And you'll actually see the green L, blue L, and there should be a red L somewhere. Those are the actual points of the observatories on Earth that are collecting data. If you Control-click, you can actually zoom in and out of this application. And so we got the screensaver.

We actually got the source from Dr. Khanna to see what we could do with it. And so the first thing, as Nathan mentioned, when you want to do a performance optimization is set a baseline. And you can see down here we added a little timer. And we're going to iterate over a fixed set of data and process that data. And if you hit the space bar, you'll actually see a little orange target. It's on the other side of the sphere right now. It's actually processing data at that point on the celestial sphere.

[Transcript missing]

And hit the space bar. That'll get it going. And then you can use option escape to get Shark going. So right now, Shark is actually sampling at one millisecond. So this is our default configuration for time profiling. And it stopped in. We're going to process the samples that we just took over that period of time.

And here you see the default view, the heavy view. And what you're looking at is a list of the functions that actually ran during that sampling period in order of number of samples that landed in each particular function. And you'll notice there's one function that's clearly the most important here, this LALD mod.

But before we even get to that, we should always look at the process pop-up in this case is showing us that only a little more than half of the CPU cycles on our system are going into the process. So we're going to go ahead and look at the process name of this Einstein at Home app.

And if we look at the thread pop-up, You'll notice there are two threads. One of them is eating up basically one whole CPU, and that's the one that's doing the calculations. And the other one is just taking up a little time, and you'll notice that it's pretty much all in graphics drivers and OpenGL code. So that's the graphics thread that's actually drawing stuff on screen.

Well, we looked at that and we said, okay, we've got all these empty cycles. We should see if we can thread this app, because clearly that's the direction all future hardware is going to take is being more multi-threaded, more multi-core, more multi-processor. So the place to look, we figured, was let's go to the top function and let's see what it's doing, and let's see if that code is threadable. So if you double-click on that function, you get what we call the code browser.

And again, this is showing you your code annotated and highlighted in terms of hotspots. And we have this little edit button down at the bottom. If you click that, It'll bring up your source code in Xcode where you can go and edit. It should jump right to the highlighted line that you had in Shark.

And what we noticed in the case of Einstein here is that all the time is going into this single for loop, and it's got a fixed boundary, k equals 0 to the k limit. And we said, well, what if we just split that for loop n ways for n CPUs? In this particular case, of course, it's going to be two CPUs.

And if you look a little bit further down in the source code, you'll see how we implemented that using Pthreads. So there's a Pthread create, which is highlighted there. So that's actually spinning off threads. And then there's a Pthread join shortly thereafter. That's so when threads are done with their work, they actually get destroyed there. And Pthread create takes a function pointer. So this is this LALDmodCore. And if we jump over to that function, You'll notice that, well, it's just that loop, except that we've changed the iteration bounds to take parameters for the beginning and ending points of the array.

So we wrote this up with pthreads. This didn't take very long. And we implemented that back in Einstein. And we called that optimization level 1. You can hit the number 1. That'll switch it over to the newly threaded code if you hit the spacebar after that. Well, you'll see it is running, but time isn't ticking, or at least not very fast right now. And what happened is-- well, let's see if it even goes anywhere. We know we have a problem.

[Transcript missing]

Our optimization wasn't really a very good optimization, was it? We should take Shark again, right? Performance optimization is an iterative process. Sometimes you hit, sometimes you miss. Looks like we missed pretty bad. Well, we took another time profile and, well, things are definitely different. And what we see here is that a lot of the functions are this maroon or red color, and those signify supervisor space samples. So these are things inside the kernel. And we have a problem with the time profile. We can't actually link a call stack from user space into the kernel and through the kernel because there's a breakpoint there when you enter the kernel with a normal time profile.

So we want to figure out-- What's going on here, right? How do we figure out how we're getting into the kernel so often and what's taking time? Well, that leads us to a new feature in Shark, which I'll let Nathan explain to you. Thank you, Sanjay. Can we have the slides again, please?

So with Shark 4.2, we're introducing a new way to use Shark, a new configuration, and that is system trace. And so what this does, this is going to give you a way to understand how your code is interacting with Mac OS X, to look at multi-threaded behavior. And so the goal here is just to get an eye into these things that we can't normally see with time profile.

So the methodology is an exact trace of boundary crossing. So that's a lot different than time profile, which is a statistical, right? It's taking--it's, you know, every millisecond it's recording a sample. This is going to record a call stack every time we enter the kernel from user space. And it doesn't require any instrumentation. It's all going to just happen automatically. It's system-wide.

And it provides three new views. The first is a summary view, and we'll go over each of these in more detail, but the basic view is a summary view that's going to show you overall what's going on in the system. Trace view, which is going to give you a list of all the trace records, so things we're looking at here are system calls, VM faults, when your threads were scheduled.

And timeline view, and this is a visualization of the threads in your system, so you can see the threads running over time, and we'll talk about each of these. So the summary view, this is all about the biggest piece of the pie, right? So here we have a pie chart, you know, where were we spending time? Was it in user space? Was it calling into the system calls, handling VM faults, interrupts, and so on? And underneath the pie chart, we have a summary view.

So this is summarizing each aspect of the system. So first is scheduler. So, you know, a lot of times it'd be nice to know how your threads are getting scheduled. You know, how often are they, how long are they running for, at what priority, you know, how much time do they get to actually run on the CPU before they're suspended and, you know, and something else is run by the scheduler.

Well, that's what summary view is going to do. And so you can use this as kind of a good check to say, well, if I'm expecting a CPU bound thread and it's only getting scheduled for, you know, a few microseconds, and I know, you know, on Mac OS X, and I'll tell you that it's, you know, you should expect in the range, you know, of tens of milliseconds if you're actually CPU bound. You should actually be on the CPU before you get kicked off. And we'll talk more about how that works. System calls.

So this is going to provide you a profile of the system calls in the system. So it's going to give you, you know, it gives you the ability to attribute system call time. So it's going to give you the ability to attribute system call time to user space code. So it means it records the user space call stack for, and it links it to a specific system call. So if you spend a lot of time in a system call, you can see, oh, this piece of code actually called that in my code.

Similarly, for VM faults, when you're taking virtual memory faults, these are otherwise transparent. They're handled by the kernel. And this will allow you to say, oh, this piece of code was actually triggering all those VM faults that maybe you see in VM stat or some other tool. You can actually pin these back to your source code.

And of course, TraceView. This is just the complete list of all the trace records. You can look at all the call stacks, the arguments to the system calls. But best of all, this supports hierarchical sorting, so that's a useful technique for finding perhaps maybe the system calls or a specific type of system call that took the most time. You can double-click on one of those, and you can go to that event in the Timeline View.

And this view gives you a way to navigate this trace. So you can interactively zoom and scroll. You can click on an event. In this case, this is a BSD system call. And it's going to give you an inspector to look at that, you know, all the details about that particular system call. You can filter it based on a variety of things, you know, system process thread, a lot like the normal time profile per CPU, per event type.

You can also, you know, color these threads based on, you know, various parameters, the CPU they ran on, priority, you know, the reason that they were switched off the CPU. Just so you can get a good overview of why things are happening the way they are in the system.

So, you know, what are all these little pictures? Well, these rectangles, these are the thread run intervals. So this is the duration of time that a thread ran for. And also we can see within the thread, there are system calls, these little telephone icons. And there's a black line underneath the system call, and that indicates the time that the system call took within that thread. Similarly for VM faults, these page icons, this is the time that it took to handle this virtual memory fault. So that's what these, you know, the system events that we're looking at.

So why, you know, a lot of you may wonder, well, you know, why would my thread not run for the full amount of time? You know, well, you know, if it's a CPU-bound thread, truly CPU-bound, well, it will probably run for the full allotted time slice or quantum before the kernel will kick it off and run something else, assuming it has something else to run.

Another possibility is that it's waiting on a resource, so disk or some other thing that isn't available right away. The kernel is going to pick another thread to run while this one is waiting, so that's blocked. It's also possible that while you were running a higher priority thread, like a driver's, an interrupt handler thread, was made runnable. And so you're going to get this, the currently running thread will be preempted, and another, this higher priority thread will be run.

So a little more about system calls. This is work done by the kernel on behalf of the calling process. And it's really just a safe way for user space programs to call into these various kernel APIs, BSD, Mach, MIG messages. The distinction between these isn't all terribly important, but there's a variety of APIs here, just to keep that in mind.

The first one is virtual memory fault. So Mac OS X obviously has a powerful virtual memory system. And what that's doing, it's dividing the overall address space into pages, so into equal size chunks. And these are brought into physical memory on demand, of course, because we want to be able to have more applications running than we actually have physical memory to contain them all, potentially, or for the data that an application is operating on. And there are a wide variety of these faults.

The simplest, page in and page out, just shuffling things around for the currently used pages. Zero fill is, these occur when you malloc or allocate a large chunk of memory. The kernel is going to mark those as to be zero filled on demand. And so the first time you touch those pages, they're going to be, the kernel will, you'll interrupt, it'll go into the kernel, and the kernel will zero that page out automatically. Copy on write happens when you're sharing a page, you modify a shared page. So if you're sharing a page between two processes, and you modify it, it's going to make you your own personal copy of that page.

And page cache hit. The page is resonant in the buffer cache, but it's not yet mapped into your application, so it's going to take a fault. A fault, you know, it takes some time to get in and out of the kernel, so it's good to be aware of these. So this is going to give you a way to actually see these things. So for a demo of that, let's bring Sanjay up again, and we'll see how this works.

Okay, so when we last left off with Einstein, we had this whole time profile full of system calls and supervisor space time that we can't account for. So what we want to do is, and you can see that Einstein is still running. It's horribly slow. If we go to the config menu in Shark, we can just select system trace and then either hit start or option escape and that will start taking a trace.

Okay, so it's going, and you can see it actually stopped rather quickly. That's because a system trace is truly a trace. It's not statistical sampling. So by default, we're going to collect up to a million different events, which is probably all you can handle at any given time. But if you're some kind of super programmer, you can actually up that limit inside the config editor. In this case, so what we're seeing is the summary view, which Nathan explained, and it gives you a pie chart.

And it's breaking down time in terms of user space versus system space versus idle. And it's further breaking down supervisor space into faults, kernel, and interrupts. And in our case with Einstein here, you can see we're getting about 45% of the time in user space, so that's executing Einstein code.

But at this point, we're getting almost a quarter of the time inside of system calls. And that's this orange section of the pie. And down here in the scheduler, you can see... These are the Einstein threads underneath CFS Boink. You can see that the average thread is running for a few microseconds. In this case, about 60, 63 microseconds.

And if we look at system calls... So remember, this is an exact trace of all the system calls made during our sampling period. You'll notice there are a lot of VM deallocates, so that's virtual memory deallocating. There's virtual memory mapping and thread create running. You can option click on one of these disclosure triangles, and it'll pop open the back trace to tell you how you got into that system call. So in this case, you can see all the way from pthreadbody up, we end up in that LALDmod function, which remember when we added our threading, that was the name of the function that we were calling.

And it's calling pthreadcreate, which we showed you in the source, and then ending up, of course, inside of a system call. Now let's take a look at this inside the timeline view as well. So by default, what you get is on the horizontal axis, the x-axis is time, and you have a full list of all the threads along the y-axis that are going on in the system. And at this level, well, it's just a bunch of yellow rectangles.

You can zoom in either by rubber banding or by using the zoom slider. And if you zoom in far enough, about two-thirds of the way through, probably, the icons will start to appear. And in this case, we see a whole bunch of little telephones. Those are all calls, system calls. If you click on a phone,

[Transcript missing]

or sorry, down to test LALD mod. You'll see how we implemented this inside of the Einstein code.

So again, there's a pthread create call and pthread join call. But in this case, we're calling a different function, LALDmodOptimized. And if you jump over to that function,

[Transcript missing]

But we're not done, of course. You know, we're performance guys. We can't stop now. What we want to do is take another time profile. We've threaded this thing. Let's see how it looks now, and let's see if we can attack another problem.

So again, if you hit spacebar in Einstein, it will take a time profile. And then you start up Shark with a hotkey.

[Transcript missing]

You can see we've just stamped out code in terms of four to help fill the pipeline slots. And rather than using division, we're using this intrinsic FRES to generate a reciprocal estimate.

And if we go back to Einstein--

[Transcript missing]

So as expected, our top function is still our hot loop. And what we really want to do, if we double click on that again, see its hot spots. OK, so now you'll see our new code is being highlighted. It's unrolled. Shark still has some advice there.

Well, you're doing some single precision. Why don't you try using Altevec? Now, the interesting thing about this loop and about the Einstein app is that it's using some doubles and some singles within the loop. And we wanted to preserve correctness, so we decided to attack this loop in a couple of different steps because we don't want to detect gravitational waves where there are none.

So we looked at this and we said, okay, what we need to do is break out all the single precision computation that we can so we can use vector instructions. Now, some of you may be saying that's a little crazy at this point. No, it really isn't because we're still shipping these Macs. These Macs are going to ship for a long time. And even after we stop shipping G5s or G4s with Altevec, those machines are going to be out there for years. And beyond that, if you attended the Altevec event, you're going to see a lot of different things.

And beyond that, if you attended the Altevec to SSE session earlier today, you'll know that once you make the step to vectorization, the porting from one instruction set to the other is not really that hard. The hard work is getting there in the first place, making sure your data is in the right order and so on.

But what we had to do in order to write any vector code, first of all, was get it down to single precision in a block. And if we look at our source back in Xcode,

[Transcript missing]

And if you hit the space bar, we hit another big oops. Again, time is ticking off and not updating very quickly because we're not doing a lot of work. So again, something went wrong. And what we should do is take a Shark time profile.

And once again, we have a whole window full of red stuff. What do you do about that? Well, time profile isn't too informative in this case because, again, you can't track all the way through from user to kernel space. What you want to do is system trace in this case. So Nathan switched over to system trace. We can take another profile.

[Transcript missing]

Again, we'll zoom in until we can see our whole bunch of icons.

[Transcript missing]

If we go back to the source, however, we notice that What we could do is again move that malloc and free operation outside of the loops to basically where the thread is created and destroyed, essentially creating a static buffer for this operation rather than having a very dynamic, too dynamic of a buffer. So we made that optimization. And if we go back to Einstein, call that level 5, you'll see that, again, things have picked up pretty well.

If we run it again, we'll get a feel for how much things have changed from before. Now, we're not expecting a lot from optimization level 3. This is really just scaffolding in place so we can use vector instructions at the next step. Now, if we profile once more with time profile, You'll notice again, we're back in user space. So we've eliminated all that problem of making system calls and faulting in pages. And we can once again click on our top function. You'll notice now we're in this new function. We still have the AlteVic advice from Shark.

and we decided, yeah, we could write Altevec for this loop. It's pretty small. It's a good investment in time for what we're going to achieve here because we expect Einstein to be running on a few million computers as screensavers for several man years. We could save the world a few gigawatts by writing the most efficient code.

[Transcript missing]

You see we ended up down at about five seconds. So remember, we started with our scalar code was about nine seconds. So we got it to be about 80% faster using vector instructions.

And if we go back to our slides, let's summarize what we experienced here with Einstein. So this, I'm not going to lie about this, this porting or this optimization process went on for a few days, probably a few extra days than my management knows about. But it wasn't too bad.

You can see here at optimization level one was where we screwed things up. We had our threads that were just operating on a very small chunk of data and we were paying a heavy, heavy price.

[Transcript missing]

We unrolled and took out some division, picked up another couple of, or about 20% more than step two.

At optimization level 4, we again, we blew it. We added that malloc and we ended up spending tons of time inside of VM faults and system calls. We fixed that again. We got to vectors. And then we did even more vectorization work. And we ended up about 3 and 1/2 times faster than the original code, which is not to say that you can't do any more. We just kind of came up against the deadline for WWDC.

But there is one more thing. I'll do my best SJ impersonation. So we made a little announcement on Monday. And you may be wondering, what's the state of Shark running on the developer transition boxes and on future boxes? Well, you'll be happy to know it is running. If you've been down to the labs, you may have used it already. We have basic time profiling, because that's obviously the most common workflow for most of you. We also have performance event profiling. So that means we can look at performance monitor counters on the chip.

So special hardware registers inside the Pentium, you can configure those and use those for generating hardware events like cache misses and so on. Cross-platform. So you can take a session on one of those developer boxes, save it, and then open it back on a non-developer transition system or vice versa.

We do have code analysis and optimization tips for x86 or Intel 32 architecture. And probably most important for those of you who are getting down to the low level and coming from power PCs, we have a disassembler which chooses your syntax, either Intel syntax or AT&T syntax. A little explanation there, AT&T is the syntax that GCC uses. Intel syntax is the official Intel syntax. Don't ask me why. They're so different, but they are.

But you can choose whichever one you like. And perhaps best of all, we have an integrated instruction reference guide. So we recently got approval from Intel Legal to get this done. And probably the best way to show this off is to go to Demo Machine 3, which is our-- and the developer transition system.

And you can see Shark is already launched. We have Xcode here. If any of you saw our show back in 2002, you know the work we did with Flurry. That's a screen saver. We'll launch that. We've added a little timer again and put it into its own window so we could get repeatable results.

You can see that's the flurry you know and love and have seen over and over too much.

[Transcript missing]

It's just like a Mac, isn't it? You can then double-click on the top function, in this case, Update Smoke. There's your source code. There's some advice here. In this case, integer division. In another case, floating point division. You can look at assembly or both.

If you have a wide enough monitor, this makes more sense. But there you see Intel instructions running on Mac OS. Everything's highlighted. If you don't understand Intel instructions, you click on ASIM help. And if you highlight a particular line of your ASIM, the documentation updates with you. So hopefully that'll help you. Get up to speed on speaking x86 if you're tuning down at this level. It was certainly a big help for us.

So it wouldn't be enough just to take a profile. We actually wrote SSE2 code to optimize Flurry. And again, remember, we had written AlteVec code. And you can get this. This is a fat binary. It comes with our package, which you download. It's got both the AlteVec code and the SSE2 code.

So you can see exactly how we did this. I can tell you that the porting process from AlteVec to SSE2 was probably about one-tenth the time that it took us to write the initial vector implementation. But if we go back to Flurry and run with vector optimizations in place, Things go a bit quicker.

So this is SSE2. It can do four single precision floats just like AlteVec. It's a 128-bit vector. So we got about a 4x speed up here from 10 seconds down to about 2.6. So it can be done. And believe me, I've written a lot of AlteVec, so this was a big change for me. But it wasn't that bad.

We can go back to slides now. So let's summarize what we've talked about here. Hopefully you can see that Shark is not too hard to get used to, to get running. You hit the start button and for most of you, that's probably all you'll ever have to do with it except fix your own code. We have a lot of different workflows which Nathan walked through a lot of those different paths.

Basically time profile and system trace we think are going to be your major weapons of choice to solve performance problems. But we have a whole host of other configs and you can obviously create your own custom configs. We go down to the metal, whatever metal that might be.

And again, Shud 4.2 preview is available right now off the developer.apple.com website. We just put out a package during WWDC, so please download that, give it a whirl. Particularly if you're on the developer transition box, we'd love to hear what your feedback is. And of course, this doesn't cost you anything.

For more information, of course, the WWDC web page will give you links to our sample code. Everything we've shown you here is available. And some other sessions coming up right after this one is performance analysis of your memory code. There's a DevTools feedback forum at 5:00 and debugging parallel applications for those of you writing large networked applications. That should be a fun session also at 5:00. If you need to contact us, the best thing you can do is use PerfTools feedback. That goes to Nathan, myself, and several other team members. Or Xavier Legro is our evangelist and chief marketeer. Bug him for features.