CHUD--Performance Optimization Tools in Depth - WWDC 2003

Hardware • 46:24

The Computer Hardware Understanding Developer Tools (CHUD Tools) are designed to help hardware and software developers collect data on and measure the performance of PowerPC Macintosh systems running their products under Mac OS X. This session walks you through using the tools to look for and fix problems in your code, with the aid of the hardware and OS performance counters.

Speakers: Sanjay Patel, Nathan Slingerland

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. My name is Mark Tozer-Vilches. I'm the Desktop Hardware Evangelist in Apple Computer and Worldwide Developer Relations. Welcome to session 506, performance optimization tools in depth. Now, optimization means a lot of things to a lot of different people. It can be from trying to get your application to launch faster, access the network faster, get higher frame rates. Bottom line is, it's about speed, it's about performance. It's about getting your application to run faster than what it currently does today, or faster than what maybe your competitive application does. Bottom line is, there's a common denominator.

You're looking to increase performance, and you need to know where that performance can be increased. In order to do that, you need tools to be able to allow you to understand where that work can be done. Apple has created a set of tools for developers shipped freely on the developer tools that you'll hear about today. I'd like to introduce up here a member of that team that's involved in creating these tools, Mr. Sanjay Patel of the Architecture Performance Group.

Thanks, Mark. So my name is Sanjay Patel. I'm in the Architecture and Performance group. We're going to start off today by talking a little bit about G5 from a programmer's perspective, some issues you may run into as you're moving code from G3 and G4 over to the new systems. So to start off with, first of all, the 970, the PowerPC 970, it's the official chip name, is a very super scaler, very wide and very deep machine. It's based on IBM's Power4 architecture. It's a true 64-bit chip of the PowerPC AS architecture.

It has the full Altevec instruction set, the full 162 instructions, all implemented in hardware. Also has a high bandwidth point-to-point interface, so this is a little different than a bus. What we actually have is direct connections between the processor and the memory controller. And we also supply automated hardware prefetch engines, so what these guys do is they'll start detecting patterns of memory accesses and prefetching those memory accesses into local caches for you.

So here's a picture of the die. You might have seen this in the keynote or at the 970 presentation yesterday. But what we have here is two load store units, independent, two fixed point units, two IEEE compliant independent floating point units, The full set of the four Altevec subunits, the ALU as well as the Permute. A branch unit and a unit to handle condition register logical operations.

So here's another view from the top there. You see instructions would be in the L1 cache. They'll go into fetch, sit around in some queues. They'll go to dispatch, where they can be dispatched up to four instructions plus one branch on every clock. So this is really a wide machine. There you can see you get fed into 10 issue queues to the 12 execution units. So again, this is just the widest machine you've probably dealt with.

How does this all compare against G4? Well, to keep everything--all these units flowing, the core can actually have over 200 instructions in flight, versus a little over 30 in the G4 if you count the completion buffers as well as the various queues. The pipeline stages have been expanded, so we're at 16 stages for a simple instruction versus 7 in the G4.

As I mentioned, we have two load store units versus one, as well as two floating point units versus one in the G4. There are two general purpose fixed point units, where in the G4 there were three dedicated simple units and one complex unit. Vector is similar. There's the ALU, which includes floating point, complex integer, simple integer, and the permute unit.

If we talk about the caches, there's quite a few differences here. First and foremost for programmers is the cache line size has changed. It's 128 bytes where it used to be 32 bytes. The L1 data cache is the same size, but it's a two-way associative design and write-through versus eight-way and write-back on the G4. The instruction cache has been doubled, so we're at 64K now, although it's a direct map design versus eight-way associative on the G4.

L2 cache is also doubled, so now we're at a full half meg. Also an eight-way associative in both G4 and G5. The replacement algorithm is LRU versus random on the G4. There is no L3 cache on the G5, whereas on the G4 you had up to two megabytes. Now that's partially made up for by the fact that processor bandwidth is just tremendously higher on the G5.

It's up to 3.5 gigabytes a second effective in each direction, both to and from memory simultaneously, versus a 1.3 gigabyte per second bus for the G4. On the other side of the memory controller, we've doubled the width of the DDR interface, as well as increased the clock frequency. So more than twice the bandwidth is here, available from the DDR chips. 6.4 gigs versus 2.7 gigs on the G4.

So, what does this all mean from a programmer's perspective? Well, there are going to be some things you're going to have to look out for as you're porting your code and optimizing it on this chip. And so the first thing you'll notice is that there are more pipeline stages here, which means instruction latencies have grown from G4.

So, how do you work around that in your code? Well, you should do more in parallel, right? So manually unroll important loops or try to use compiler flags such as unroll reloops with GCC. You can also schedule your code using M2 and equal 970 with the new GCC 3.3.

Now similarly, because the pipeline is longer, branch mispredictions are going to cost more. It's just going to take longer to recover from a mispredict. So there are several solutions you can use here. If you're coding in C, GCC offers a built-in expect. So that's for a very highly predictable branch, such as maybe exception code. You expect it not to be taken very much. You can use this macro built-in expect. If you're coding in assembly, we have these new plus plus and minus minus suffixes for all branches. So either highly taken or highly not taken.

Best solution is to just not do branches, right? So in floating points, you have fcell. That's enabled with FastMath, and what that allows you to do is a conditional move in floating point registers. In the vector domain, you have vcell, very similar operation, you use it with masks. In the integer domain, you have the carry bit. So this can be used for min and max type of operations. You can also use masks to avoid branches when you're doing integer, what effectively would be conditional moves.

And then feedback directed optimization is something most programmers don't try, but this can be very effective on G5. Because if you can have a representative run of your program, let the compiler annotate that run, and then mark all branches with highly not taken or highly taken, this can be very effective for improving performance on this long pipeline.

So as I said, the data cache is quite different than it was on G4. And the most important thing here is that it's 128 by line. What can you do to work around that? Well, that's either a win or a loss for you, depending on your code, right?

If you have a lot of locality, well, you're probably going to incur one miss, where you would have had four misses on the G4 system. So, you must design your algorithms and your data structures to move through data--to move through memory sequentially, continuously, if possible. Okay? That's also going to trigger the hardware prefetcher, and this is very powerful, because it will amortize all of the latency up to memory.

So that's the next topic. Because it is a point-to-point interface to the memory controller, latency--effective latency may be higher than what you've seen on a G4 system. And that's because to maintain coherency, you have to go to the memory controller and then bounce back to another processor. What can you do to avoid any of those penalties? Well, software prefetching--so there are several instructions. And of course, the hardware prefetcher is the best solution because it's self-paced, and it'll be synchronous with your code.

As you miss, the hardware detects those misses, detects the pattern, and prefetches lines for you. You can also batch your loads. So say you need to access several pieces of data, you know you're going to need them in advance. Try to group those loads together, because the bus can support several misses. simultaneously.

So the Data Stream Touch instruction from the Altevec instruction set is execution serializing on the G5 because it's mapped onto the existing hardware prefetch mechanism. So what can you do to avoid DST? Well, first of all, you can probably remove it. It is just a hint, so there's no guarantee that DST is going to be effective for you in the first place.

The preferred solution is to rely on the hardware prefetcher, so assuming you have contiguous memory accesses, that's going to automatically work for you. Now, if you have non-contiguous accesses, we recommend that you replace a single DST with several DCBT instructions. That's a Data Cache Block Touch. So you would issue one of those for each line.

So, legacy code that uses DCBZ, which is the zeroing of a cache block, or DCBA, the allocation of a cache block, is going to perform very poorly on G5. Why is that? Well, DCBZ is emulated effectively to only work on 32 bytes, and we had to do this to ensure backwards compatibility with existing code.

DCBA is not implemented on G5, so that's just going to be an illegal instruction. You'll end up in an exception handler and then bounce back to your code. So this is just going to be tremendously bad for any performance-critical code, and the only reason you would have used these instructions is because it's performance-critical.

So the solution is get the DCBZs and DCBAs out of the code. Again, DCBA is just a performance hint, so that shouldn't affect any kind of functionality. If you do need to zero cache lines, we would recommend that you use memset or b0 rather than trying to roll your own zeroing functions. But if you do need a DCBZ type of function or an allocate of a cache line, we have this new mnemonic called DCBZL. And that's going to effectively zero out whatever the native cache line length is for any system, whether it's G3, G4, or G5.

So here's an example of how to use DCBZL. Now, for those of you who have used DCBZ, you'd say, "Well, the original definition of DCBZ was simply to zero out the native cache line length, so what have we changed?" Well, the reason we have to have this new mnemonic is because most programmers ignored that warning. They coded for 32 bytes, and now they're going to get bitten.

So what we'd much rather have you do is code based on line size, so effectively stride through memory based on the line size, which you can get from the operating system, whatever the current line size is on the system. And of course, if you're just doing a memset operation, we'd much prefer that you use memset or B0.

So, synchronization primitives-- this would be locks, sinks, iSinks-- they are going to be more costly on this architecture, on this chip, than they were on G4. And that's for two reasons: one, the longer pipelines, and two, the longer latencies to memory. So, this is a tough one, but what you have to do is just make sure all your locking is absolutely necessary. Minimize the lock hold time so you're not contending for locks as much. And of course, ensure that each lock is in its own cache line so you don't have fighting between processors.

So scheduling is crucial for this chip, and it's going to require recompiling or even hand scheduling for optimal performance. So what we recommend is you use GCC 3.3, which has a pipeline model and scheduling model for 970. And the other thing you can do for really performance critical code is understand dispatch group formation using SHARC. And for those of you who don't know what SHARC is, we'll get to that in just a minute.

So in summary, this is a very parallel core. You have two, basically, of each unit-- LSUs, FPUs, FXUs, lots of renames, lots of instructions in flight. So if you have very synchronous code, it's simply not gonna take advantage of this core. So what you want to do is, of course, unroll and schedule. You can also use AlteVec to calculate up to a theoretical peak of 32 gigaflops on a 2 gigahertz system, a dual.

The 970 has the full precision hardware square root, so you don't need to make calls to any libm functions for square root anymore. If you're using GCC, we offer this flag, PowerPC-GPopt. We also have native long long support because this is a 64-bit chip. It can natively do double word lengths in leaf functions using PowerPC 64.

So again, the system and the chip are all designed for high bandwidth. There's incredible bandwidth to the L1 cache, between the caches, and out to memory. 32, 64, and effectively 3.5 gigabytes per second in each direction on the bus. Take advantage of that by using streaming, using software streaming and hardware streaming prefetch.

So again, the optimal cache control instruction, rather than a DST, is to use DCBT to prefetch. If you have a DST that covered a lot of ground, multiple cache lines will then issue multiple DCBT's in its place. Don't use DCBZ, because that's emulated. Use the DCBZL instruction, but be careful if you're using it. Make sure you account for cache line size.

And again, DCBA and DCBI are illegal, so those just need to be removed from code. Okay, so we've talked a lot of theory here. How you actually get down and dirty with your code and figure out what's going on. Well that's where Chud comes in, so I'd like to introduce Nathan Slingerland.

Okay, thank you Sanjay. So hopefully a lot of you were introduced to CHUD tools last year at WWDC, at least the version 2 tools. And this year we're happy to give you the version 3 of the tools, and we have a lot of enhancements and improvements to that. But basically the CHUD tools are a suite of low-level performance analysis tools written by Apple's Architecture and Performance Group.

And they give you access to the performance counters and the processor memory controller operating system. and using these counters with CHUD you can find problems in your code and improve your code. And of course it's free. So it's on the Developer Tools CD, and it's also on the website.

So in 3.0, we have several classes of tools, profiling tools, so tools to find out where things are happening. These include Shark. So this is the successor to Shikari, if you've used that. And we'll get to all the great new features it has. Monster is a spreadsheet for performance events, and it has a lot of great new features, too. Saturn is a new tool for visualizing function calling behavior.

For tracing, so if you've ever done Altevec or a very processor critical code, sometimes it's useful to see how things are actually happening on the processor. So we have Amber to take an instruction level trace of a particular program, and then Assets is a program to analyze this trace, or SIMG4, PowerPC 7400 cycle accurate simulator, and soon SIMG5, so you'll be able to simulate for the PowerPC 970.

And of course, we provide the CHUD framework. So this is the API we use in all our tools. And you can use this to make your own tools or call into the CHUD tools and have them do what you need. Okay, so the performance counters. These are in the processor and memory controller and operating system, as I mentioned.

And what these do, they count interesting low-level performance events that things such as cache misses or instruction stall cycles that would otherwise you'd have to use a simulator to find out what's happening at this level. Page faults in the operating system, you can find out when those happen. And the Chud tools let you configure these counters, tell them what to count, record the counts, and then you can use the tools to look at what the result is.

Okay, so the first tool we're going to talk about is Shark. Shark is a system-wide profiling tool. So you can use it to profile a process, a thread, or the entire system if you want to look at that. The most basic usage is just a time profile. So this will show you where the hotspots are in the system, where the system is spending its time. You can also use any of the performance counter events. So you can get an event profile to see where hardware events relate to your source code. For example, where cache misses might be coming from in your code.

We capture everything: drivers, kernel, applications. So if you're a driver or kernel extension writer, you can also use CHUD to see the call stack and find out where things are coming from. And of course, we're very low overhead. This is all handled inside of the-- inside of our own kernel extension.

Once you've gathered this session that you're interested in, take in the samples that you want to look at. We provide automated analysis, so we annotate your source code, disassembly of your source code, give you optimization tips about how to improve your code. And there's also static analysis. You can use this to just look for, for example, DCBA instructions in your code, if you want to make sure you catch every instance of that. There's a command line version, so this is both scriptable and you can also use it--you know, you can telnet into a machine and use this from the command line.

And of course we can save sessions and you can give them to your colleagues, pass them around, whatever you'd like to do. So Monster is a more direct interface to the counters. This lets you look directly at the results of the counters. You can configure them using Monster, collect the PMC data based on timed intervals, hotkey, event counts, every 10,000 cache misses, for example. And then you can view the result in a spreadsheet or a chart.

In addition to just the raw performance counts, there is a built-in shortcut language. This uses an infix computational language that you can program your own metrics that you're interested in, or you can use the built-in ones, things like memory bandwidth, bandwidth over the memory bus, or cycles per instruction, a variety of things. There's a command line version of Monster provided as well for your scripting and remote sessions, and you can also save and review these sessions as well.

Okay, so the best way to see how to use these tools is with a demonstration. So what we're going to look at is a program called the Noble Ape Simulation. This is written by Tom Barbele, and he's simulating apes on a tropical island. And these apes can think, and he's simulating the biological environment, so the food and the other animals on the island, as well as the cognitive processes of the apes. I mean, obviously, simple cognitive processes such as desire and fear and those kind of things. So this is open source, and for more information, please check out his website at nobleape.com. So let's switch to the demo machine.

and see this C-noble ape in action. Okay, so this is the map window. Here we just showed, this is the island, right? And each red dot represents an ape running around the island doing its thing. And we can select one ape at a time. That's the ape with the red box around him there.

And for this ape we can see his brain, what's happening in his brain in the brain window, right? And of course, any good performance study requires a performance metric, and our metric is ape thoughts per second. So this is the, for the original code, we're getting, we have this metric. So around 1200, 1300 or so. Okay, so the first thing we'd like to do, we're gonna launch Shark, and we'll see what's happening in the system.

Okay, so this is the main Shark window, and it's really pared down and simple, just to let you start your work. By default, we come up with the time profile, so this would be the most common thing you'd use it for. We provide a bunch of other built-in shortcuts and configurations. You can, of course, you can create your own using any of the performance counters. But for now, we're going to use time profile.

There's a start button here for starting sampling, but you can also use the-- there's a global hotkey so that Shark doesn't have to be in the foreground. It can be, you know, in the background, and you can start it. So we'll use that hotkey. We'll take a 5 or 10 seconds sample. See what's happening.

So here's the profile. What we've done here, this is just listing the functions that it sampled inside Noble Ape from most sampled to least sampled. So when you're optimizing, you wanna work on what's running the most of the time, then you're gonna get the most benefit out of optimizing that code.

So we see that Noble Ape is 50% of the system. This is the process pop-up, right? And this is like top, it lists what was running in the system. It's kind of strange, 50% of the time, even though we know that we're CPU bound. Well, if we go to the thread pop-up here, you can see that in fact, it's single threaded. So this is, we're running Noble Ape on a Power Mac G5 that has dual two gigahertz processors.

So, all right, so next step, we wanna thread this thing, right, since we wanna take advantage of both processors. So this is the heavy view that we're looking at. There's a heavy profile of view and a tree view. The heavy view, we can open up these disclosure triangles and see how we got to this heavy function. So we started in main, we called plat cycle, called control cycle, called cycle ape troop, called the cycle troop brain scaler, this important function.

So we know our code, and we know that we can't really split the processing of things between simulation cycles, right? The way this app works, there's a, you know, it does a simulation cycle, and, you know, within each simulation cycle, it's processing a bunch of ape simulation cycles. Well, the simulation cycles themselves are not independent. They depend on one another, right? But we know that the apes are independent. They're independent thinking apes, so, you know, we can parallelize it at that level. We can process the apes in parallel.

For each simulation cycle. So that's what we did. We threaded that to split up the number of apes. We have 64 apes to split it between two threads evenly. So you'll note that the brain rate is originally around 1,200. When we do this, we get around 20, almost 2,400, not quite. That's pretty good. We've gotten a nice speed up just from threading, from taking advantage of that second processor. So let's profile again and see what's going on. Okay.

So now we can see that in the process pop-up, we can see that Noble Ape is taking up a much more significant amount of the system. That's a good thing. And we can see from the thread pop-up that we have our main thread, that's the 9.2%, which spun up two computational threads, each about 40% of the time.

So, the next thing we'd like to do is actually optimize this function. This function is important to us, right? It's almost 90% of the time is spent in here. So, if we double click on this, Shark will present us with our source that's been highlighted where we've sampled. So, this tells us where in our source code we're spending time. So, it's actually inside of this Cycle Troop Brain Scaler, it's just--it's this for loop.

So we can see that this for loop actually represents about 94% of the time in this function. Okay. So I should probably talk about a couple of these other things here. Oh yeah, well so the scroll bar at the side is like an overview. You can easily jump to the hot spots in your code. Right. So it's colored accordingly, right. A brighter yellow means more samples. At the top we have a source file list. This is, you know, sometimes you can have more than one source file contributing to a particular function with header files and like that.

And this function popup is like what you have in Project Builder. You can easily jump to different functions. And then we have the edit button. So what this allows you to do is it will jump into Project Builder at the same selected line. Right. So you can easily go to where you want to edit and change something once you know where the problem is. Okay. So let's go back to Shark.

and what Shark does is it provides us with advice. That's what this little exclamation point button is. It's advice for us. It's calling something out. And so there's two bits of advice here, but we'll focus just on the first one. So this loop contains 8-bit integer computation. And if you--obviously if you're spending a lot of time in this 8-bit integer computation, it might be a good idea to use Altevec to really improve the speed of this code. So that was what we did. That was our next step. So let's, I guess, let's go try that out.

"Get the ring." So, that's a nice speed up, but we're not done yet. That's good. So, let's profile again. Let's see where we're spending time now. We'll double click again on this. We see the vector function shows up at the top of the profile. And this is the vector code. So a lot of you are probably--if you've used Shikari, you saw the assembly view.

If you double-click on any of these--any source line, it's going to jump to the assembly view that you're familiar with. And if you double-click on the assembly, it'll jump back. And it's going to highlight the line--the instruction or instructions that correspond to that source line. So you can see--this can give you an idea of how good the code gen is for your compiler, right? right, how many, what kind of instructions it's generating for each source line.

And, okay, so if you've seen this before, the columns here, we have samples, how many times we sampled each instruction, address, obviously the address and instruction, the instructions themselves, and switch between various views of the address. Cycles is the latency and throughput of a particular instruction, so these are for the 970, right, that's the CPU model down at the right-- lower right-hand corner there, tells you that. And in the comment column, various things about this code, and of course the source file at the end. Now, one of the nice things that we give you is an ability to visualize dispatch groups. So if we go to this option, we can turn this on.

If you remember the diagram that Sanjay showed earlier with the dispatch, we can see here how they--usually between four and five instructions, dispatch, this would be in each cycle. So this can give you a good idea about how things are actually behaving on the machine. So the other thing we provide too is this functional unit and dispatch slot utilization graph.

So you want to talk a little bit to that? Right. So on 970, as a programmer, the key bottleneck you'll have to face is maximizing dispatch group width, because that's one of the narrower points in the core, because it's four instructions wide plus a branch. So what we offer you here in dispatch slot utilization over here on the right is the average group size. You can see how effectively is your code taking advantage of this really wide issue machine, wide dispatch machine. And dispatch defines where instructions are issued to, which functional units. So here you see a map of the 12 functional units that I talked about.

And you can see that the units are symmetrical, like the two LSUs here. If there's a big imbalance between, say, one of these, the LSUs-- one is doing a lot and one's not-- well, that's something that you could probably correct with scheduling or reordering your code, because what you want to do is balance the execution units.

Right? You don't want half the chip doing all the work and the other half sitting idle. All of that is defined by dispatch groups. So that's why we've put dispatch group modeling into Shark. - Right, so you can select a few, right? I mean, you can do the dynamic selection, right? - So this is dynamic. You select a few instructions and it'll tell you where they got mapped to. The charts update and the numbers will update it with it.

Great. So this can help you, obviously, on the Power Mac G5, tuning your code. But let's go back to the source view. And if we look a little closer at our vector code, you remember we vectorized this inner loop that we saw was taking up 94% of the time, right?

Well now that loop is still important, but it's taking up a smaller portion of the time in this function. We can see also that up at the top and bottom of this function, we're spending more time-- more time relatively speaking-- in the scalar code, right? The two loops that we didn't touch.

So, this code is very similar to the other loop, it's almost exactly the same, and I believe Sharpe will point this out, but yeah. Right, it's saying, yeah, hey, vectorize this loop too, this is important now. So, that was the next step, was to vectorize the rest of this, so to vectorize the entire function, and as well as a few other, there are a few other optimizations as well. So, let's try that out.

starting around ten thousand or so So another 40 to 50% we can eke out by vectorizing the rest of that function. You can see some of the gorillas have gone off into the water. You can bring them back to life by dragging them back to land. They're a bit suicidal. Yeah, they just like the beach.

All right, so, uh, we're gonna have 15x or 14x speedup, that's pretty decent. Um... We hope you can all do that well in your code too. So the next step, okay, so, no, don't do that. We have a few more things to show. There are a couple of other things we didn't really talk about. Shark allows you to manage the sampling session. So we've taken about four sampling sessions here.

You can either look at these in parallel, right, in a multi-window mode, or you can put them back and deal with them one window at a time. The multi-window mode is nice because you can put them side by side. There's also a session drawer. You can quickly switch between them in single, in the single window mode. And of course, as we mentioned, you can save sessions.

There's also ancillary information included. So whenever you take a session, it records what kind of machine it was and gives you some space to write notes to yourself about what was happening on this. So this is archival. You can keep it around, remember what happened. There's also an extensive user guide included. So it's just online here. Please read it. There's lots more information and features that are covered in there. Okay.

So, one other thing we wanted to look at, we want to use the Monster Tool, and we want to look at some more--some of these performance counters in depth. So, this is the main Monster Window, and this is the spreadsheet, right? So, on the left-hand side, we can see the various performance counters that are on this system, and on the right is the spreadsheet itself. And the shortcut pop-up is similar to the sampling config selection on Shark. Same thing. But what we can do is we can edit these shortcuts. So, if we go to the shortcut tab.

And we're going to look at DDR memory bandwidth. Right, so this is--it's just--it's taking a few of the U3 counters, the memory controller counters on this Power Mac G5, and it's going to calculate the number of megabytes transferred over the memory bus. All right, so the way it does this is it counts the number of beats, and each beat is 16 bytes in the data bus, so it can multiply out and figure out for every 10 millisecond sample how many megabytes that means. So we have a session. Let's open that.

It's on the desktop, yeah. There it is. Right, so if we pop open the run pop-up,

[Transcript missing]

So you can see how as we optimize the code, we were able to take better and better advantage of this bandwidth, the massive bandwidth that's available on the Power Mac G5.

So let's go back to the slides, I think. Right. Right, so you might wonder-- I guess-- I'll let Sanjay talk. So you might wonder, how does this compare against G4? So we started out with the regular scalar code. And on the G4, you get about 1,200 ape thoughts per second. We were getting closer to 1,300 on the G4.

Well, you say, well, the G4 is running much higher frequency, and the G4, we're barely getting a little more than 10% faster performance here. So what is the bottleneck? Well, the initial bottleneck was all integer performance. So with the longer latency, longer pipeline instructions, you're just not going to get the full frequency increase in your performance increase. So when we went to threaded, we started to expose the better bandwidth on the bus.

So because we have two processors, and they each have independent point-to-point connections to the memory controller, we're running much faster than the G4. And then we really break this open when we go to vector. The G4 does well. We get a 2.5 or so x speedup from using vector. But on G5, we don't have any bandwidth limit yet. So we get a full 4x improvement from going to vector. So we're going to go to vector. And we're going to go to vector.

And then with Vector Optimized, you can see again the G4 does pretty well. Nice speed up. And the G5 gets a 60% speed up. And that's--you know, if you look at the back of that monster chart, you can see we were getting peak bandwidths of 2.5 gigabytes a second on the bus.

So we're still not done yet, but we didn't have more time to optimize before this demo. But clearly, you know, there are a lot of resources there. And if you start with basic code, well, you might get a decent speed up over a G4. But if you put a little effort into it, you can get very big speed ups if you take advantage of AlteVec, take advantage of all the bandwidth that's available to you.

Okay, so a third tool we'll talk a little bit about is Saturn. Saturn, so the other tool, so Shark is another profiling tool that we've talked about. That provides a statistical profile, right? It's periodically interrupting the system, recording where you are, and then going on, right? And then afterwards, we say, well, wherever we got the most samples, that's where the most time was spent. Well, Saturn is going to instrument every function in your source code to give you an exact profile. So we'll show you, this allows you to visualize the call tree. It uses GCC to instrument each function at entry and exit.

And it records this function call history to a trace, and with this we can get call counts, so how many times each function was called, performance monitor counts, it can use those, as well as the execution time of each function, right, added up. So if we look at this, this, you know, at the top we have the familiar call stack, you know, call tree view that says, you know, how--where we spent time in each function and its descendants and like that. And at the bottom we have something that's--it's viewing the same data but in a different way.

It's plotting call stack depth vertically versus time on the horizontal axis. And what you can use this for, if you see a very sharp, you know, narrow spike, that means that you're spending a lot of time in calling overhead, right? You're not getting--you're going through many, many function calls and not getting a lot of work done if it's not a wide call or wide stack. So, okay. That's it. Saturn.

And of course the CHED framework. You can use the CHED framework to instrument your source code. You can use this to start and stop Monster or Shark. You can also write your own performance tools. A lot of the functionality, almost all of it that's in Shark and Monster is exposed in this framework.

So you can set up and start and stop the PMCs, collect information about the hardware, a lot of things that otherwise you'd have to go through I/O Kit and it might be some extra--a lot of code to get at. And of course an extensive HTML reference guide is provided.

Okay, so here's an example of using the framework to remotely control either Shark or Monster. What you do is you pick the profile that you're interested in and then place either of those tools in remote mode. That means allow other tools that want to connect and control the start and stop of the counters to do so.

So first we initialize, you know, acquire remote access, I mean make sure that another tool is actually waiting for us to do something, and this will block if the other tool isn't currently waiting. Start the RemotePerfMonitor, so for this function you can give it a label that's going to appear in the tool. Do whatever it is you're interested in, whatever the code of interest is. Stop the RemotePerfMonitor and of course release to be a good citizen.

So that's one way to use the framework, that's to instrument. The other way is just more directly, you can set up the counters directly and read them directly. You know, initialize, acquire the sampling facility. So there's only one set of performance counters in the system, right, because there's one set of hardware, there's one physical device. So you have to acquire the sampling facility, the kernel extension that we have manages access to this. Set up the counter events, clear the counters, start the counters, do whatever it is you're interested in, stop the counters, and then process the result.

We also provide some lower level tools. So if you've ever done, as I mentioned, AlteVec programming, or any kind of really intense tuning, you'd like to know what's happening on the processor core. Why is it slower than you expect? What's happening? So with Amber, this is a command line tool to record an instruction trace to disk.

So this is for all the threads in a given process. Record that to disk. And then you can run that trace file through ACID. This is a trace analyzer. It gives you some interesting trace statistics. You can plot the memory footprint that this trace walks through. Point out problematic instruction sequences.

And then you can also run this trace through SIMG4 for the 7400 processor, or eventually, when this is available soon, the SIMG5 PowerPC 970 simulator to know exactly what's happening. Okay, so at this time, we'll turn it back over to Mark, I think, for session wrap-up. Thank you.

So to give you a little bit of a roadmap of other sessions that are going to be valuable to you, the Tuning Software with Performance Tools, session 305 this afternoon at 5:00 in Presidio, and then the Mac OS X High Performance Libraries. Again, another set of tools-- really, libraries are not tools, but venues here for you to be able to eke out performance out of the operating system. Again, optimization-- just to go back on my introductory statements-- should not always be a process or an afterthought. Optimization should begin with when you first start writing your code.

It should be part of the process of how you want your code to be written. So you're not going back after the application is written and think about, well, maybe I should thread my application. I should utilize threading processes. An application like the Noble 8, we can add threading because it's not a lot of code.

But if you're talking about a much larger application, a word processor, a graphic imaging editing application, then you're looking at a whole redesign, possibly, and then it becomes more frustrating. So again, optimization should be something that is both at the beginning of your project as well as an afterthought of once you've finished your project, how do you get more performance out of it?

You know, the other thing I wanted to mention is that the G5 PowerPC processor is a very unique architecture, much different than the G4, as Sanjay pointed out in his session presentation. And for that reason, we want to make sure that you have as many resources available for you to understand what those differences are and how to take advantage of that. All week, we have been running a G5 optimization lab on the first floor in the California room.

I urge you to visit, talk to the many engineers that have made themselves available, spending countless hours. Monday and Tuesday, we were there until midnight. You'll be able to talk to Sanjay, Nathan, and several other engineers, both from Apple and IBM, throughout the week. As well, we'll have follow-on kitchens available to you as developers in the developer program at Cupertino, following the developers conference.