Mac OS X Performance Optimization with Velocity Engine - WWDC 2002

Darwin • 1:02:03

Apple offers supercomputing performance to both consumers and professionals in its entire line of desktop computers with Velocity Engine and the PowerPC G4 processor. Processor-bound operations such as memory copies, string compares, and page clears can be offloaded to the vector engines of the G4. This session provides an overview of the Velocity Engine vector libraries, including recently optimized double-precision FFTs, MDCT, and BLAST. Apple's numerics engineers will discuss techniques for general vectorization and for identifying code that can be vectorized.

Speakers: Ali Sazegari, Ralph Brunner, Robert Murley, Bill Kincaid, Nathan Slingerland, Sanjay Patel

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. My name is Mark Tozer-Vilches. I'm the hardware technology manager for developer relations. This morning is session 102, Performance Optimization with Velocity Engine. This session, for those of you, if you ever find yourself doodling, doing square roots, factoring polynomials while you're on the phone, this is the session for you. So quite a way to start the morning with some cool math. We've got some great content, some cool demos, and I'd like to bring up Dr. Ali Sazegari, the head of the Vector and Numerics group.

Hi, I'm Ali Sazegari, and welcome to our session. Today, we're going to talk about a couple of different things. The first topic that I'm going to talk about is what we've done for Jaguar on the numerics side. Then, we're going to dive into the benefits of using the vector units, either by itself or using our libraries that we have in Jaguar. And in the last section of our talk, we are going to show you how to use the new performance tools that are absolutely crucial to your success. So, let's get started with this, if I can figure out what to do.

All right, so numerics libraries in Jaguar. You know that our numerics library, the LibM, that you've been using up to 10.1.3 was based on a freely distributable library from Sun, FD LibM. We changed that completely, and we're going to go through that. Then we're going to go talk about the benefits of vectorization.

And if you don't want to vectorize your code, vector libraries in Jaguar are going to be very helpful. Then some optimization techniques, details of what to do about your code, how to make it a little bit faster. Hopefully you're going to use those in your code. And then profiling tools.

Okay, what you will learn. Available libraries. We have expanded the available libraries on the Jaguar as we go along in Mac OS, then we'll have a richer and richer set of libraries. Some optimization techniques. The data injection method that we're going to talk about. Vector versus scalar algorithms. We are going to draw your attention to be aware of the LSU bottleneck on our machine and using the instruction set architecture details of the processor if you really want to squeeze a lot of performance out of this machine. The whole idea here is that the increased parallelism is really key for better throughput on PowerPC. And again, at the end, we're going to talk about profiling tools. Thanks.

What's new in Jaguar? On the numerics front, we have revamped LibM. It is now based on MathLib V5. For those of you who come from the OS 9 platform, you are familiar with MathLib V3. We went to MathLib V4, and I'm going to talk about that a little bit later.

And now, MathLib V5. We also have brought FixMath up to snuff again on Jaguar. On the vector domain, we have an expanded signal processing library. We also have a full standard basic linear algebra subroutines, which you can build your algorithms on top of. And then we're going to show you some vectorized technologies that are going to come to Jaguar.

New Libem in Jaguar. Mathlib V4 shipped in 10.1.3. We had a pretty bad performance before that because of the FD Libem that was used. We switched that completely and we went to what was the core of the OS9. We ported that. So 10.1.3 and beyond basically has parity with OS9's performance.

Coming in Jaguar is Mathlib V5. Mathlib V5 is faster than V3, which is in Mac OS 9. And the important part to remember about V5 is that it is completely IEEE 754 and C99 compliant on double precision. The implementations which are available are on PowerPC and IA32 for Darwin users. This is a no-holds-bar, high-performance, extremely accurate, and very robust. mathematical library that we're pretty proud of.

New fixed math in Jaguar. For those of you who continue to use fixed math, we have ported what we had initially done for OS 9. Now in Jaguar, fixed math is going to be floating point based as it used to be in Mac OS 9. I would like to emphasize that even though fixed math is extremely fast now, even a little bit faster than what OS 9 was, I highly recommend that you don't use fixed math, that you migrate your code to using floating point because floating point on PowerPC is significantly faster than what we do in here.

Okay, now it's a good time to talk about some general guidelines for vectorization. What we do is, basically, this page here describes what we do every day at Apple for everything we just come across. And my first and most important suggestion here is to profile your code, profile and profile, and then when you're done, profile some more.

Because you really want to know where your bottlenecks are and then move on from there. Use Chut2Profile. Chut is the computer hardware understanding development tools that we're going to talk in details later on in this session. It will show you, it has a lot of tools, it will show you where the bottlenecks are and where you should go and make your code fly. A lot of times, developers may think that a particular routine is really the cause of the problem. And after working on it for several days and making it a lot faster, they find out that they really haven't budged the performance at all.

And this has happened to us before. Code has been brought to us when they thought that this was the bottleneck and after working on it for about a week or two, we've given them back. And what we've done is that we've basically taken 2 or 3% of the entire processing time down to 1% or maybe half a percent. But the overall application is not that much faster. So Chut is extremely important in that regard. The next thing is to rewrite. So you rewrite the scalar using data parallelism, which I will go over in detail of that.

Then after you've done that and after you've found out where your bottlenecks are, you should rewrite select parts in vector. And then use the vector libraries that we have as much as possible. In our game, basically you need to take charge of what the compiler is doing as much as you can.

And one of the things to worry about is that because we have a fairly... We have a large set of registers, a very rich set of registers, and we want to make sure that we use that all the time, you need to be aware of the register spillage inside your compiler and see if this is not working out and do something else instead.

At this time, what I'd like to do is to go over an example. This example is the IMDCT for MP3. For some of you who are not familiar with this particular thing, this is a transformation not unlike FFTs, the Inverse Modified Discrete Cosine Transform. And this chart here shows what happens to this particular piece of code as we go along and optimize it.

I'd like to mention that the results of all these are exactly the same. We have not changed the results of any of this stuff here. There's no shortcut in the results. The results are still correct towards the bottom. The first IMDCT code has, of course, this guy calls cosines.

And if you put the cosines in a loop, it has a loop, and you call our libm, which is highly accurate, 53 bits of significant, you'll get 583. And this is units of time on any machine, let's say. The smaller numbers in this chart are better. Just taking this cosine out and making it a table lookup, which becomes the original ISO code, so if you paid your money and you got your original ISO manual and you just typed it in, you get 20.1. So the moral of the story in here is that don't call libm 5 million times inside of a loop if you can help it, especially if you're not interested in very high accuracy.

So the moral of the story in here is that don't call libm 5 million times inside of a loop if you can help it, especially if you're not interested in very high accuracy. The next thing is an algorithmic optimization in here. And we have not moved into the vector domain right now. We're still sitting on top of the scalar domain.

The Lie optimization is a technique which will bring the size of the 2048 cosine table lookup down to 384. It turns out that table lookup is fast, but it's not as fast as doing actual computation on the machine. And bringing the size of the table lookup will have... an improvement over the performance.

After the Lie optimization is done, we found a way, this is actually a standard way of doing this, of making our IMDCT call our FFT. We make IMDCT based on the FFT. This will bring the cosine table even further down. And now the scalar code is running at 1.23. 1.23 units of time, when we started with 583, or if you just look at the original ISO code, it's at 20.1.

So it's a fair amount of improvement already. And going from the scalar, which we're still sitting on the scalar domain here, to the vector domain, then we are going to almost three times performance just moving from scalar to vector. The moral of the story in here is that you should not be calling libraries and site loops if you don't have to.

You should be minimizing your table lookups if that's possible. And the second thing is that work on your scalar code before you move on to vector to make sure where your bottlenecks are and how to squeeze the performance out of it and then move on to the vector code.

What are the vectorization benefits? And this is true for all platforms, including ours. On ours, the PowerPC vector unit has an orthogonal design. It really is not a second-class citizen to anybody. As opposed to the design of the MMX, which sits on top of the floating-point register file and has a granularity which is larger than one instruction. So in order to amortize the use of the vector engine, you have to use a few of the instructions before you move on so that you'll see the benefits. You can actually use this in conjunction with anything else.

There is no penalty. There is no overhead in using the vector engine. The engine is completely pipeline, single-cycle instruction execution, unlike some other parts of the PowerPC, which I will go through. So this is extremely important because if you want to have a lot of performance, you do not want to have stalls in the pipeline.

The other benefits of the vectorization is some of the libraries that we offer in Jaguar. The first one is the digital signal processing library, which we have enhanced quite a bit. The BLOS, which is a standard BLOS. I will go into detail of that. The basic operations, and that is the add, subtract, multiply, divide for anything which does not exist on the processor. So that you can use, let's say, 32 bits multiplies.

We don't have 32 bit multipliers on the AlteVec. And Mathlib, which is a counterpart of the regular library that we have on the scalar domain that you access with LibM. You can access that with VMathlib. We promise you better results and better performance on VMathlib as we go on, and Jaguar will see some of that benefit. At this time, what I would like to do is show you what would happen if you actually took your code and vectorized it.

And spent some time going over it, profiled it, and found out where the bottlenecks are. And what I would like to do is introduce my friend Ralph Bruner of Core Graphics. And he is going to show us vectorization work that we have done for portions of cores. I want to use that microphone there. So what do we have here? The higher the speedometer, the better frame rate, and this is what we want, just like the speed, okay? Okay.

I get a frame rate of about 20 frames per second. Go and turn on all the vector code we have in Quartz and do the same. You get about 30 frames per second. You get the idea. Okay, so this is the Quartz compositor that has been vectorized. So the Quartz compositor is the piece of code that takes all the Windows content and mixes together to present the final on-screen result and does all the translucency and genie effects and stuff like that. So this is part of Jaguar, right? This is part of Jaguar. So if you bought a G4 and you put a Jaguar on it, you'll see the benefits right away on at least resizing.

So, the other little demo I have is life resizing. We identified that as one of the performance bottlenecks in using Mac OS X, so we worked on that a bit. And first thing I'm going to show again is life resizing a window.

[Transcript missing]

Okay, and here what has been vectorized is the compositing part, like in the translucent terminal demo before, but also pattern fields like the pinstripes and the scrollbars, text drawing, shadow generation, a set of about eight functions that we vectorized. And each of these functions measured by themselves gave us something like three to five X performance improvement, and the overall performance improvement is then quite noticeable. Okay.

The third demo I have is a bit of an oddity. So I have an image here which is stored in an unusual format. It doesn't have red, green, and blue at every pixel. It has a polynomial for red, green, and blue for every pixel. And so you can actually evaluate the entire image at different points. And what that means here, we can set the focus distance in that image.

So this is an example for an algorithm that you probably wouldn't even try if you hadn't have a SIMD unit. So how many polynomials are you evaluating here? So that's 1.3 million polynomials for a single frame. 1.3 million polynomials for a single frame. Don't try this at home. And the data that's backing is just the coefficients for these polynomials is 8 megabytes.

So, let me just wiggle this around a bit to get the performance numbers here. - And we see we run about 15 to 18 frames per second. And just for kicks, I implemented a scalar version of that. Which is rather sad. It's like two frames per second. Okay.

So that's it for the demos. All right. Thank you, Ralph. So as you see... Thank you very much. So as you see, just spending some time vectorizing your code, and this is not calling our libraries, just finding out where the bottlenecks are and spending time and vectorizing will have a huge effect on the performance. Of course, Quartz is an extremely important component of Mac OS X, and everything you saw in here besides the 1.3 million polynomials per frame is going to be included in Jaguar, and Jaguar will see the benefits of that.

The next thing I would like to show you again is, again, some more benefits of vectorization, what happens if we vectorize some other portions, and I would like to ask Robert Murley to come up. We have vectorized the RGB to RGB color transformation for ColorSync, and Bob is going to show us what happens with the vectorization of that component.

[Transcript missing]

One on your right is running without any vectorization. The vectorization you will see is the one on your left. The part that has been vectorized is ColorSync, the color correction software that is built into the map. So I wanted to show you the speed difference between scalar and vector for ColorSync.

Looks like as we can see the... Oops, that wasn't fair. All right. He really handicapped the scalar one this time. Let me try that again. Okay. The scaler is slow, but not that slow. All right. Is that clicked here? All right. OK, here we go. I got it this time. All right. The gods of demos are with us again.

We can see that the vector one is faster, and the vector was done before the scaler, so we can take it back up also. I just wanted to mention also that about roughly 60% of the processing going on here is color sync, and the rest is other things. The color sync vector is, if you saw it by itself, it's about 2.2 times faster than the scaler.

And that is an overall speed improvement of the ColorSync scalar code, right? Correct. Okay. All right. Thank you, Bob. Again, this is... So again, this is a major component of Mac OS X, and Jaguar will have this. And if you have a G4, you will get a better RGB to RGB transformation.

So let's talk in detail about some of the libraries that we have for the new Jaguar. The new one, which is quite dear to my heart, is the signal processing one. The DSP, VDSP, we've had this for a while. It was a single-precision-based, very fast signal processing library. And what we've done for Jaguar, we've added double-precision DSP functions.

And we've also added base 3 and base 5 FFTs because there was some demand for base 3 and base 5s. These are very high performance and excellent single-complex on 1D and 2D FFTs. For those of you who are using FFTs in here, these numbers should be fairly familiar. The 1024 complex FFT takes only 9.8 microseconds on a 1GHz PowerPC.

That's only on one processor. And the real one takes 5.2 microseconds. I'd like to tell you that on PowerPC, we were the first ones to make available FFTs, which were getting processed at 1024 complex FFTs, which were processed under 20 microseconds and now under 10 microseconds. So going over some of the details of the DSP, I'd like to show you the performance and compare that to, what used to be the highest clocked Pentium machine.

What we have here is various nodes of the FFT from 256 to 4096. The blue one is our VDSP library and the yellow one is the IPP. IPP is the Intel's Integrated Performance Primitive Library, that many of you may be familiar with. And we are calling it, these are in-cache transformations that we are doing.

And as you can see, the 1GHz G4, is compared fairly favorably with the 2.2GHz P4. If you go to a 2.4GHz P4, then you'll get a little bit better performance, about 4 or 5%, but the numbers look pretty much the same. We're still faster than a 1GHz machine. Here is the real one, the FFT.

We have a slightly better algorithm and we do better than the IPP, which is, which is interesting. I'd like to follow this up for the image processing folks. We have a 2D FFT, also the same thing. The 2D FFT is a fairly tough guy to implement and to get right and make it fast.

But again, the VDSP is doing a better job than the Intel's primitives. I'd like to follow this up with the complex one. And again, the complex also is faster. We only have four of these graphs in here, so I don't want to bore you, but I want to tell you that the signal processing library that we have in Jaguar is really second to none.

It is extremely high performance on the vector engine. It's also very high performance on the floating point unit now, if you wanted to choose the double precision instead of the single precision. Typically, the difference between the vector performance and the scalar performance right now is anywhere from 4 to 5 times. So you will go, if you use the VDSP on the vector engine, you'll get 4 to 5 times. So you'll get 4 times better performance than you'd use it on the scalar engine, and the scalar engine is already hand tuned.

What I would like to do here is now we've talked about the benefits of vectorization if you just went and vectorized. And what I would like to now show is what happens if you use our libraries like the FFTs and other things. I'd like to introduce my friend here, Bill Kincaid, who is going to show what iTunes does with our library. All right, so all those other demos are cool, but we all know what really matters is MP3. Okay. Here's an unbiased view of some guy who actually works on iTunes.

A couple of different builds of iTunes here, and I was going to show them to you on identical machines running side-by-side, but we don't have identical machines, so I'm not going to do that. Instead, I'm going to ask you to... Watch the two versions, one running after the other, and remember what the performance of the first one was as you're watching the second one. It should be pretty obvious. Just before I actually run the demo for you, let me give you some figures on the sort of demands that we're making on the processor and vector unit here.

What I'm going to show you is MP3 encoding, during the course of which at sort of typical settings, we'll run through on the order of half a million FFTs, complex FFTs, to encode an album. And we'll run through a similar number of MDCTs. The MDCT is essentially identical to the IMDCT that Ali was talking about. It's the forward transform instead of the inverse transform. So there's an awful lot of computation going on here.

In fact, a couple of years ago, it was considered...

[Transcript missing]

The first thing we want to talk about is the special build of iTunes that is not vectorized. So, whereas the normal shipping iTunes uses Ali's vector library for the FFTs and the MDCTs, this one does not. It uses, in fact, a very fast scalar FFT, but it is scalar.

So as not to add another variable to the mix here, we're not going to encode off a CD as you would typically do at home. Instead, we've got an AIFF file. It turns out I actually tried in preparing this demo, I tried encoding off a CD, and the problem is we're I.O. bound instead of CPU bound, so it didn't make a very good demo.

So here's our AIFF file that we're going to convert to an MP3. And here it goes. So the number that I'd like you to notice will appear here in a second. It's the X factor here, 9.6X, so that means we're roughly 10 times real-time encoding from AIFF to MP3. So this is using a scalar FFT and a scalar MDCT at this time, okay?

www.aliksa.com/velocity-engineering/ As impressive as 10X is, you can do just one heck of a lot better if you use the vector engine. And I'll show you that in a second. Or 29 seconds. We could probably hear the tune that it's trying to do. Yeah, I didn't handicap it by asking it to play while it was in touch. Oh, okay, I see. All right. So he's putting all his energies into that. He handicapped the other one. Okay. That's always fair. At roughly about 9x. So, 9 times real-time. Okay? All right. That's that. And this is the vector version.

[Transcript missing]

So if any of you guys are doing MP3 applications, you should really look into the stuff that's available in the vector libraries down in the BDSP libraries. There's good stuff there. All right. Okay, thank you, Bill. So this is... This is a very good reason to use our VDSP.

It's extremely fast, and it does the job right, and I think you will see the benefits of it right away. The important thing also to remember is that VDSP has also a scalar component into it on the single precision engine, and you don't have to worry about which machine you're running on. Just call it. It will make the right decision for you. It will run the correct thing on the scalar machines. It will write the correct thing on the vector one.

The next topic I'd like to talk about is the basic linear algebra subroutines. We've been asked about this many times. I'm happy to announce here that we have a full suite of the basic linear algebra subroutines in JAGUAR. This is an important addition to JAGUAR, and VDSP, VMathLib, and BLAS is a cornerstone of a lot of the engineering and scientific work that is required.

I think people using JAGUAR are going to be happy about that. The BLAS that we have is an industry-standard BLAS. It comes in single and double precision, real and complex, vector and scalar. It transparently selects between scalar and vector, so you don't have to worry about which machine you're running on. We maintain multi-gigaflop performance, even coming from DRAM, which I'll show you in a little bit.

We have the fact that we are going through a lot of machinery to make sure that we're running on a scalar or a vector machine. We also have included some specific calls to very small, tiny matrix multiplies, or complex matrix multiplies, or vector matrix multiplies that you can use if you're not worried about the BLAS compatibility from one platform to another. There's a bunch of those calls in there as well, and they're extremely fast. You have to know you're running on a vector machine or a scalar machine. You have to know what you're doing, but you get the raw speed.

What I would like to show you now is the performance of S-GEM. S-GEM, for those of you who know, is the complex multiplication of a matrix times another one plus another matrix, and there are two scalars involved in here that have to be multiplying in the middle. And what we have here is the comparison of the Mac OS X BLAST S-GEM versus the MKL, the Mathematical Kernel Library from Intel.

As you can see, we do quite well in the crucial and very important portion, up to 64 by 64. It turns out that most of the applications that we deal with are on the tiny side of things, and we like to give them as much boost as possible.

And we maintain performance after that. The Intel crowd does better because of the way they handle I.O., but that's beyond our reach. There is nothing we can do about this at this time. But up to 64 by 64, we can give you an enormous amount of performance. And it's pretty important to see, it's a 1 GHz machine.

This is per processor again, so if you multi-thread your application and you run it on two processors, you're going to get two of these guys. You get 5 GFLOPS. Per matrix multiply per processor. So S-GAM again is one of those things, it's like a 1024 complex FFT. Everybody has it in their mind, at least I do. And I want to make sure what my gigaflop rate is on that one.

We also have had traditionally another library called VectorOps in our OS for those of you who have used previous versions of OS X and some of the OS 9 before. The differences between the BLAS and the VectorOps is that the BLAS is an industry standard. If you have calls to BLAS, it will just compile on our machine now. The BLAS that we have is based on a patented algorithm, and it's very high performance.

VectorOps now is relegated more to an educational tool. The source is available. You can download it from our website. You can look at it and see how we have vectorized it. It has still pretty good performance, but it's not as good as the BLAS. VectorOps is good for small arrays. If you really need source and you don't want to worry about the BLAS and you have small matrices or small vectors, this is for you.

It also is only on the vector domain, and it uses the vector registers as opposed to regular pointers to arrays. So if you're sitting on the vector registers and you want to use vector multiplies, matrix multiplies, and this is what you want to do. What I would recommend is that the BLAS is the preferred way of doing work now as opposed to the VectorOps.

At this point, what I want to go into is the performance issues, and I have a few of these performance issues that I will talk about, and how you can do better with your applications. The first one is the data injection technique, and this is a word I came up with as I was writing this slide here because starving the processor didn't just sound too good.

So this is a software-only approach. It minimizes the pipeline stalls. It really takes the execution pipeline into account. I will have a detailed example of that. Remember, as deeper your pipeline is, as we are going further and further into processor land here, and more and more progress in gigahertz is happening, well, these processors are getting longer and longer pipelines, and you have to worry about that.

So the deeper pipelines need more independent set of data, and this is exactly what I mean by data injection techniques. I also want to tell you that the dynamic dependency will slow a machine down. If you have, let's say, an expression, and a result of that is required for the next expression, and the result of that is required for the next expression, that's what I call dynamic dependency, it's hard to avoid stalls there. Because the machine just has to wait until the first one is done before we can dive into the other one.

So this is something that we would like to avoid. The next performance issue that comes to mind is vector algorithms. It is very important for people who are coming from the scalar domain to remember that some very fast scalar algorithms are just not vectorizable. They don't have a parallel grain in their lines. I mean, there's just nothing you can do about that.

You have to rethink your algorithms. Different algorithms may be required. Perhaps the different algorithm that you chucked because it wasn't good enough for scalar is not good enough for vector. So this is important. Again, the dynamic dependency of the instruction causes slowdowns. Keep that in mind. And the data injection techniques will not work in the presence of dynamic dependencies. I will follow this up with an example.

The next issue is the LSU bottleneck, the load store unit on the PowerPC. We typically see two or three loads and stores for FPU and vector instructions. Per vector and FPU and vector instructions, this is the way the compiler generates things. Some of the remedy for that would be to inline or remove trivial functions. We're in the habit of compartmentalizing things and putting things away. And for further use, we've got one or two lines of code, we wrap it up in a function and put it away, and then we call it.

What I'm advocating here is not to do that. Try to inline as much as possible. Try to alleviate the problem of going from one function to another. Minimize memory movement. And this is part of that. By inlining, you would minimize memory movement. Keep values and registers as much as possible. This is somewhat in your discretion and somewhat it's not.

Using the compiler, it's somewhat not in your discretion. And then if you wrote it in assembly, of course, you can do whatever you want. The LSU over-reliance frequently causes two to three times slowdowns in the codes that we have seen. And this is one major problem that we generally have to deal with. It's to remove some of these LSU instructions from the instruction stream.

The next performance issue is that you really have to be aware of the ISA detail, the instruction set architecture of the PowerPC. Be aware of that. Know what the processor is doing under your feet. Remember, the vector engine is pipeline, single instruction, and for all of 162 instructions that we have. This is not true for the FPU. The FPU produces stalls. As an example, I have the FP divide. The FP divide is not pipelined but sequenced.

The FP divide on the early G4s would take about 33 cycles, would not allow anything else to be dispatched, and it will give you the result after that. And the new one, the new G4s in the 1 GHz machines, takes 37 cycles. So be aware of that. Also remember that the reciprocal estimate is not pipelined. The reciprocal estimate takes anywhere from 14 to 21 cycles, depending on which flavor of PowerPC you use.

One other detail of the ISA. that you should keep in mind is that we only have dual issue on the older G4s. These are, of course, superscalar, out-of-order cores, but the more issues you have, the more work you can do. But the old ones only have two, and the new ones have triple issues, so we can issue more of these instructions and do more work.

Just have to remember that even though we have a triple issue, the vector engine can take only two at the same time. On the older designs, there was a restriction of which subunit would be able to process that, but on the newer designs, all of the VALU and the VPERM unit inside the vector engine is up for grabs for the dual issue.

Now I'd like to go over an example of what happens if we use the data injection technique that I alluded to. Here is just one run-of-the-mill example of a floating point program. And this floating point program is doing something, let's say. I don't know what it's doing, but it's doing something. And what we like to do is to find out what happens to the processor.

I mentioned CHUD. This is an output from a CHUD tool that you will see later. It's showing us what the machine is doing, the various sections of the processor pipeline for the older style G4s at this time here. There's an instruction stream on the left side, and on the right side is where the pipeline looks like.

The letters in D are the stalls that you see, and of course the compiler for me has generated an add immediate and a load floating point single that may or may not be useful or needed at this time. It takes 25 cycles to do this particular little program here. What we want to do is find out what we can do. can do to enhance this.

What I've done here is that I've taken exactly that algorithm, and now instead of using one set of data, I'm using three sets of data. Because what I saw was that my pipeline had bubbles in it. I had a dynamic dependency, and that dynamic dependency would not allow me to go as fast as I would like to do. Here what I've done is, taken the same thing, and now I'm doing three things at the same time, basically parallelizing my algorithm.

And looking one more time at the CHUD tool, again the instruction stream is at the left side, the pipeline is at the right side. This is as perfectly as you can get on this machine pipeline. There is no stalls. There are a little nitty-gritty thing happening here, but we're not going to talk about that. But what happens is that this is the best. If all of your code would be like this, your machine would be flying at 10 gigahertz.

And you won't need to worry about the frequency anymore. This is as good as it gets. And basically now we're doing three sets of this for the same program in 32 cycles. So basically, instead of the 25 that we have, now we've brought it down to 10 cycles. Extremely powerful technique, this data injection, which basically parallelizes the set of data that you're sending inside the program. And we use this all the time. And one of the basic problems of getting slower performance on the PowerPC.

I'd like to wrap up this section of the talk by telling you where to go to learn more. I'm not expecting you to write all this stuff down and go and apply it right away after this session here. So where can you browse it? We have a new Spiffy website which has a lot of information for you and you can find it in developer.apple.com hardware VE.

You can code it by going to the Veclip framework. This is where all of our frameworks are. If you wanted to talk to the VDSP, to the BLAS, to the VectorOps or the VBasicOps or the VMathlib, this is where you would find it. How you would actually use it in your program, you would include Veclip as a hierarchical include here, Veclip slash Veclip dot h.

And using it is as easy as putting the minus framework in front of the, as a compile option and our library which is Veclip and then you're set to go. And you'll get the benefits of the vectorization whether you like it or not. Well, you have to call the vector library. All right. So I'd like to turn the podium to my friends from the hardware performance group, and they're going to tell us a little bit more about what they've done to show us where the bottlenecks are. I'd like to introduce Sanjay Patel and Nathan Slingerland.

Hi, my name is Sanjay Patel. I'm in the Architecture and Performance group. This is Nathan Slingerland. So we live over in hardware, so we like to optimize down to the metal. And that's why we've created the CHUD tools, Computer Hardware Understanding Development Tools. So I'd like to introduce the first couple of tools are used to actually Get down to PowerPC machine-level instruction profiling. So the first tool is AMBER. It's a command line tracer. And what that allows you to do is get the machine code for your program to actually see it in a pipeline view through a cycle-accurate simulator that's called SIMG4, which models the PPC7400.

So as Ali was showing earlier, we have our example of division where we use a reciprocal estimate in this case. And so all you have to do to instrument your code is throw in these little start/stop functions, which we'll take a look at in a second. There are actually many different ways we can start and stop tracing.

So at this point, I guess we'll go to... Well, we'll take a trace in a second. But as we saw earlier, this is the output from SimG4, which is the cycle-accurate simulator. And what we see here is that the first instruction, the reciprocal estimate, is a long latency instruction, and it's not pipelined. So what you see there in the E's, the processor is actually executing over that, but none of the following instructions can proceed ahead of it.

And of course you see the other data dependency stalls with the highlighted red D's. So what we want to do is, we'll take a trace that shows this in a second, and show how we can break some of these pipeline stalls. So if we can switch over to the demo computer.

There we go. So let's first take a look at the start/stop function that we introduced. In this case, it's just a single machine instruction, which is actually an illegal instruction from user-level code. And what we've done is we've overridden the illegal instruction handler in amber to use this as a trigger to say, "Start or stop tracing my program."

So it's as simple as that. There are a number of different ways you could do this as well. You could either hot-key if your program is long-running to take a massive trace. Then you're just depending on how fast you can hit the keys to start and stop. You can actually start and stop based on a symbol or an instruction address as well. So these are the parameters for amber you see on screen. So since we have an instrumented function, we're going to use the dash I option and take a trace.

So you can see it's really a quick process to, since our program is so short, we're just going to take a trace of 16 instructions. And if we look here, we'll see what Amber has created a directory called trace with some offset. And if you go in there, you'll see there's a thread file. So if your program is multi-threaded, we'll create multiple files for each thread.

And so now what we want to do is, this file is actually binary, so to make sense of it you need to input it to SIMG4, and you can get a pipeline view as we saw earlier. Okay, so there's our pipeline view as we saw in the slide. So I guess let's go back to the presentation computer so we can take a look at that output.

Can we go back to the presentation computer? Okay, so here we are again. So this is the same output we just saw from SIMG4. And as we said, it takes a long time to execute because of that reciprocal estimate. So as Ali showed us, we can break that bottleneck by using a reciprocal square root and then multiplying it, because reciprocal square root is a pipeline instruction and it's low latency.

So again, we'd have our start and stop calls to Amber to take a trace, and if we took that trace and input it to SIMG4, we'd see that the pipeline output, now we've eliminated the long delay at the beginning, we still have all of our data dependency stalls. But we've already gotten this program down to 24 cycles, so we're 29% faster than the base case.

So again, if we were to use the data injection technique where we do multiple sets of data rather than a single to match the length of the pipeline of the processor we're interested in, we'll see that we have the SIMG4 output shows that there are no more stalls. And what we've actually done now is throughput is 2.9 times as fast as what we started in our base case. This program is only one cycle longer than what it would take to do the single division.

Okay, so in summary, once you're ready to get down to optimizing at machine level, you want to take traces and use SIMG4 to analyze your output so you can identify all the bottlenecks that you might be encountering on the processor. So in this case, we saw we got a 29% improvement just by switching one instruction, one machine instruction. And then by using data injection, we can actually be almost three times as fast as our base case. So I'll now turn it over to Nathan to talk about a couple of our other tools.

Thank you, Sanjay. Okay, so I'm going to talk about a few other tools, and they rely on the performance counters, which are built into the hardware on the PowerPC chip and also in the memory controller on some of our machines. And with these performance counters, you can count interesting performance events like cache misses or instruction counts, execution stalls and the like. And the Chud toolkit gives you the ability to actually ask those counters what's going on inside your program.

So the first tool we're going to talk about and I'll demo for you is Shikari. And Shikari is a system-wide profiling tool. Okay, so it's going to show you over time what's happening in the whole system. So you can use, you can sample based on time, like I said, so it would be similar to a program like Sampler.

Or you can sample on events, so any of those performance events. So you can find out where cache misses are coming from. And then what it does is it will correlate those events with your source code. And it'll show you performance hotspots based on those events. and you can also look at an annotated disassembly of any function that is a hotspot.

Now let's go to the demo machine. So in order to demonstrate how you can use Shikari to track down real performance bottlenecks and then fix them, we have this converted screensaver. It's called Flurry by Calum Robinson. And right now this is the original Flurry code running. And what we're doing is we're timing how long it takes to render 500 frames. It's an OpenGL screensaver. And in this machine, it's almost 12 seconds, 11.6 seconds to do that.

Okay, so the first step in understanding any kind of performance issues would be to profile this, especially because this is a screensaver. We don't know if we're CPU-bound or graphics card-bound or what. So this is the main Shikari window. What we have here first is the profile that's going to list from top to bottom the most frequently sampled functions.

And below that we have a list of sampling configurations. This pop-up here is basically the defaults that come with it, but you're free to add configurations. You can take advantage of all the performance counters that are available in your system. So, let's actually use the time profile and see where Flurry is spending its time.

So it's taken a second's worth of samples. And we can see the first thing that pops out to us here is that the lib system square root is taking a significant portion of time. Now, as Ali told you, the Mac OS X square root is actually quite fast and very precise. But because this is a screensaver, maybe we don't want to be getting back the IEEE 754 compliant, 53 bits of precision, all that goodness, for the screensaver.

So if we go and look at where we're calling this, we can see that this is actually a pretty good candidate for replacement by what we have as the floating point reciprocal square root estimate instruction. And now, this isn't appropriate for every algorithm, of course. A lot of times you will need to use the lib system math library square root function. But in our case, we know that we're okay with that and that the 5 bits of precision that this provides is sufficient. So let's see how much that helps us.

So almost a 2x speedup. That's pretty good for, you know, it's still scalar. So the question now is, again, like as Ali said, profile, profile, profile. Let's profile again and see where we're spending our time now that we've eliminated the call-to-lib system. So we can see that now the system square root function is gone from the profile. That's what we expect. And now we're spending a lot of time in update smoke and draw smoke. That's our code. That's inside of Flurry.

Okay, so we know that, because this is our code, we know that this is dominated by floating point computation, this particular, these two functions. And so what we're going to do next is look at, and what Shaqari can tell you, is floating point issue stalls. These are the dependency stalls that we talked about earlier. And there's a preset for that. So this will tell you where in your code the most floating-point dependency stalls are coming from.

Okay, so we can see now that most of our floating point issue stalls are coming from update smoke. So with Shaqari, you can drill down and see the actual disassembled code behind that. And it's going to visually point out to you where the hot spot is. So blue meaning not hot or cold, it wasn't referenced or wasn't recorded in the sample. And then we'll see yellow or orange when we get to a hot spot. So here it is, around, in the file smoke.c, around line 279 or so.

If we go back to our source code here. So that's this loop. So, okay, we already know what to do about this, right? We have dependency stalls. We could use data injection, unroll the loop, have it try to do more things in parallel. Or, if we're going to spend the time, maybe we should vectorize it and unroll it. So do both at the same time. And that's what this code is here. So, let's see how that performs.

So almost another 2x speedup for using AltaVec. Okay, so where would we go next, right? Well, let's actually, you know, we might think, well, let's try and find AlteVec issue stalls or, you know, maybe we should MP it or something, you know, multi-thread this app or something like that. So let's use, let's profile again and let's see where Sha'Carri says we're spending time.

So what's interesting here is now that we're not, most of the time we're actually not in our code at all. We're actually in the driver for the GeForce3, this is the GeForce3 card on this machine. And actually, what we found is on other machines with lesser graphics cards you hit this much quicker.

So you actually, you know, you may think that you need to, you know, in order to optimize this application you need to keep beating on the CPU aspect, the, you know, that we're CPU bound. But in actuality, our OpenGL colleagues have told us there are probably better ways to be rendering this and, you know, that would be the next step. Okay, let's go back to the slides again, the presentation.

So in addition to the tools Shikari and Amber and SimG4, that's the instruction simulator, the cycle-accurate simulator that we've been showing you the output from, these are the other tools, some of the other tools that you get in the CHUD toolkit. Monster, which is probably the most direct way you'd think about accessing the performance counters. It's going to give you a tabular view, like an Excel spreadsheet more or less, of the performance counter data, and it can also give you a chart of that data over time.

ACID is another way to filter traces for performance statistics. And then Reggie is a way to look at special purpose registers that are normally-- they're supervisor-only, so you can't normally manipulate those from user space. And finally, all of our tools, Monster and Shikari, and others are built on top of the chud framework. And what you can do with this framework is you can make your own performance tools, or you can actually instrument your code to start and stop our tools, or actually instrument your code and use the counters directly.

So all these tools are currently available on the web at this address. They will at some point also be available on the developer CDs. And any issues you can report to [email protected]. And at this time, I'm going to turn the presentation over to Mark Tozer for a few more slides. Thank you.

Thank you. So I just want to clarify my earlier comments at the opening that if you also understand or remember how to do the square root function or symbol, you also qualify to be in this session, not just being able to actually perform one. So roadmap, let's see, other places where you can go since this is the beginning of the week. Some more performance tool sessions that you should visit and learn more about the CHUD tools.

Also more sessions on how to optimize your application for multi-threaded applications, specifically for MP systems themselves. One thing to know, Apple is shipping, except for one product line, the iBook, every system is shipping with a G4 processor with the introduction of the iMac. So the G4 and the Velocity Engine is available for you on all of our systems that are available to our customers, both in the consumer, the portable, and the PowerMac line.

So who to contact for further information? Here's my email, and feel free to contact me. Documentation, again, developer.apple.com slash hardware slash VE. Some more information with regards to our scientific computing group. Also have some more information on the BLAST information that worked, actually, that we've done with Genentech. That's also available off that same page as a link.