Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2003-507
$eventId
ID of event: wwdc2003
$eventContentId
ID of session without event part: 507
$eventShortId
Shortened ID of event: wwdc03
$year
Year of session: 2003
$extension
Extension of original filename: mov
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: ...

WWDC03 • Session 507

Mac OS X High Performance Math Libraries

Hardware • 51:46

Mac OS X unleashes the power of the PowerPC processor with highly tuned libraries, making fast and accurate math computations available to your application. This session gives an overview of the math libraries Apple delivers to these platforms in Mac OS X. The rich content of the standard Libm math library and the broad coverage of the high- performance vecLib framework are reviewed. Apple's engineers provide insight into the techniques and tools available for performance tuning.

Speakers: Steve Peters, Eric Miller

Unlisted on Apple Developer site

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. My name is Mark Tozer-Vilches. I'm the desktop hardware evangelist at Apple and Worldwide Developer Relations. This is session 507, Mac OS X High Performance Libraries. The Vector Numerics Group, if you're not familiar with them already through the different sessions this year as well as previous years, has been doing a lot of work in optimizing performance in the libraries that we ship in our operating system, as well as working with several of our developers and our own application groups. If you've used iTunes, a lot of the performance that you see that we achieve in that application is a result of this group.

Today we offer some more and new information. Yesterday the group also introduced a new library for V-image, vector image library processing. Today I'd like to introduce Steve Peters who will go over these new libraries, particularly understand how the 970, the G5 processor takes advantage of these libraries. Thank you.

Thank you, Mark. Welcome. Good morning. It's a wonderful time to be doing mathematics on the Macintosh, and I'd like to tell you a little bit about why I think so. One more time. So by way of introduction, I wanted to lay out what I think our charter is in the numerics and vectorization group. First of all, we want your apps to achieve optimum floating point performance on our platform. We're in a great position to help you with that today.

We also want to make sure that your apps deliver robust, high-quality numerics. It's been a tradition at Apple since the sane days back on the 68OXO to deliver really the best in numerical algorithms and the numerical science. We continue to keep the bar high and aim to set it even higher.

We want your apps to readily port to industry standard APIs in our Mac OS X frameworks. This is a leverage point. We take advantage of open source software in building these frameworks, and we pay a great deal of attention to optimization in these frameworks. And finally, we want your apps to enjoy easy access to the Apple value-added features such as Altevec in our platform, and where we can transparently give you access to those features, we do, and you'll see a little later how that works out.

So what you'll learn today. We'd like to inform you about floating point performance, the floating point performance work we're delivering for G5 in Panther. We'll show you how to leverage that work in your development. We'll walk you through a regimen that I've been using to squeeze out the last bit of floating point performance in the libraries. You might find that helpful in your applications. And we'll recommend some learning resources.

So this is where in many presentations you'll see the obligatory layer cake diagram and sort of one piece cut out and put in. It's the technology framework assessment and being a bit of a contrarian I decided well you know we actually touch in more than one place on that layer cake. Math performance depends on the silicon, the floating point cores, it depends on the frameworks that we deliver to you, and it relies ever more on the performance tools that help you optimize your apps. And that's in the Chud toolkit.

So let's review what's been done over the last year since we last met at WWDC. What do we deliver in Jaguar? Libm, our standard C math library, Libm is centered on the following design principles. First, that we should be standards conforming. You should see the C99 floating point function entry points on our platform just the way you do on everybody else's. And that's certainly true for doubles in Jaguar.

Our algorithms are numerically robust in Libm. It's really the central point that our core approximations should deliver correctly rounded results. and the other engineers provide insight into the features of the Mac OS X library. We bring you best of breed algorithms. These are modern algorithms, nothing from numerical recipes here folks, and matched to the hardware.

How do you code to this stuff if you're writing in C? Couldn't be simpler. Include math.h. If you have special needs and you know who you are to touch the floating point environment, you can include FN.h and otherwise compile and go. Libm is part of the Lib system umbrella that's automatically linked on the CC line, so you need to say no more.

So from time to time, we get questions and phone calls that say, gee, is it possible for me to go even faster than what you've delivered in Libm? The answer is yes, but you have to make some compromises. And it's been our position that we're not making those compromises in our deliverable Libm, but we can clue you in about what you need to do.

Libm emphasizes numerical robustness. If you're willing to compromise on that, you might be able to go a little faster. Libm uses the standard C99 APIs, which turn out to data starve our FPUs. Typically, a call like "sign" presents the math library with a single argument, and you're asked to -- the contract is to return a single return value.

That ends up just leaving bubbles all the way through the floating-point pipes. If you can provide more data through an API, it's quite possible to produce results faster and furiously out the other end. Libm is obliged to handle all the special cases -- nans, infinities, plus and minus zeros as needed. And Libm is obliged to preserve, protect, and defend the state of the floating-point flags. So there's a fair amount of detail work that goes on in the library. If you're willing to compromise on that stuff, you can go faster.

So here's a homely little example. A very respected developer came to us and said, "Hey, your hypotenuse function is way slow. I can write a one-liner that goes much faster than what you guys do." So here you see the Pythagorean algorithm, a little pithy Pythagorean algorithm, return the square root of X squared plus Y squared. And sure enough, you run this function on a triangle whose one leg is length three, the other leg is length four, and the hypotenuse is five.

Well, let's sort of scale this thing up. I don't know, is this the size of the universe as we understand it these days? In any case, The result of applying this function to these arguments is, hmm, not five times e to the 160th, but infinity. Ooh. So there's some intermediate overflow here, right? And this developer made some compromises, probably over the range of arguments that he was interested in using this function. Worked just fine. But Libm takes on all comers and we're obliged by standards to do something different. And here's just for... This is like my clicker at home after the kids have gotten to it.

There we go. The purpose of this is for gestalt, right? There's a lot more going on here than that simple one-liner. Those with sharp eyes, unlike mine, may notice, "Hmm, let's see. "There's some comparisons up front." a division halfway down, finally that square root, and a rescaling, and some work on the environment. So those are the kind of details that LibM and the lengths that LibM goes to to meet standards and robustness requirements. If you can compromise on those, you might be able to go faster. So that's the end of a long aside.

Delivered in Jaguar are vecLib. We view vecLib as a one-stop shopping place for math performance. We deliver digital signal processing, 1 and 2D, real and complex FFTs, the BLAS, level 1, 2, and 3. These are based on the open-source software Atlas, automatically tuned linear algebra system. It actually is a code generation system. Back at Cupertino, before every release, I grind through Atlas. It generates code for linear algebra matched to the processor.

We get extraordinary performance from this. And for certain of the entry points, it's actually SMP aware, so you can go multiprocessor transparently. If the processor's there, it goes ahead and uses it. It's a wonderful, wonderful thing. We also delivered LAPAC for solving linear systems. and eigenvalue problems, again, open source software.

We delivered tuned 4x4, 8x8, 16x16, and 32x32 matrix multiply, matrix matrix, matrix vector, vector matrix. These are for folks who know that they're dealing with very specific sizes and want to go really fast. These are basically completely unrolled loops and on the single precision side hit AlteVec very, very hard. And so that's the next point, is wherever AlteVec is appropriate and available to use, vecLib tries to go to AlteVec.

How do you code to vecLib using C? Include vecLib vecLib.h, that's the framework header. You might want to take a peek in there and see all the other header files that get included, cblas.h, clapac.h, vbignum, et cetera. From there, you can fan out and find the interfaces that you'd like to code to.

And then just add framework vecLib to the CC line. Well, we announced this week that there's an umbrella framework called Accelerate that will collect all of our high-performance math. And so in Panther, you could just as well, and it's probably the right migration path, link against framework Accelerate. You'll get the same stuff.

So again, that same question comes up every once in a while. Can I go faster than the subroutine that I'm calling in vecLib? Well, if you've chosen the appropriate entry point, we think the answer is no, probably not. We've squeezed these things pretty hard and we'll continue to do so.

So let's move to Panther. What's new in Panther for Libm? Well, clearly, we've tuned for 970. We've opened up all the core algorithms, recast them to exploit the two FPUs, the two LSUs, paid very careful attention to the way the instructions are issued to the machine so that we achieve maximum parallelism.

and we watch out for load store issues. You'll see one of those come up a little later. And at long last we have hardware square root. This should be the end of

[Transcript missing]

There's also an opportunity to inline Square Root. There are compiler flags now. If you're compiling and you know you're on G5, you can inline the Square Root to great, great advantage.

So it turns out having done all this work, We went back and said, well, now how's this going to go on the older models, the G4s? Are we going to have to maintain two copies of the library? It turns out not. At least for this code, doing these kinds of things to tune for 970 also helped us a little bit. A few percent, not a whole lot, but we're faster these days on G4 as well. Your mileage may vary. It's worth experimenting before you contemplate having multiple modules, multiple plug-ins. Go look and see if the things you do for G5 don't also help you on G4.

So I talked a little bit in the last slide about careful construction of issue groups, dispatch groups on the 970. I wanted to put up one slide and simply remark that the issue is this. We've got a four-issue machine, five if we've got a branch. Need to be very careful that if we're gonna try to get two, let's say, floating point multiply ads to issue on Simultaneously on the two CPUs, we need to make sure that each instruction is fed into the issue queue appropriate to the unit. And that means placing the instruction appropriately as it forms, as the machine forms dispatch groups.

Chud is a champ in showing us where we need to line up our instructions a little bit differently. The timers, the SIMG4, SIMG5, soon to come tools are also essential to do that. That's probably enough said about that. There's a trace later on where we can return. So how did we do when we opened up our core algorithms, did this work, and moved on to the G5? Here are your favorite Libm functions.

Here are the cycles, the number of machine cycles, invocations of these functions for typical arguments took on the 7455, the G4, the high-end G4 model. Here's what we've done on the 970. Here's what our competition publishes. for the P4. And in general, I think you'll see we're dominating the competition even through clock scaling.

We're gonna go faster than the fastest P4 you can buy. The only question is around square root, where it's neck and neck, if you're able to inline, you can get square root, the first square root out in about 40 cycles, subsequent ones in 35 cycles, and you can do that on two FPUs.

If you need to call the subroutine library, we pick up a little dynamic linking overhead and can only go essentially on one, it's one of those data starve situations, and we only go through one CPU, one FPU in 52 cycles. So let's call that a draw maybe. Unless you can inline two at a time, and then it's not so much of a draw anymore.

So turning to the other big component of our mathematics efforts, vecLib, what's new in Panther for vecLib? Double precision math performance, right? We've all been waiting for real good double precision engines. We've now got two. And vecLib takes great advantage of those now. Our DSP routines have been tuned in double precision for the 970. Blas has been tuned and has really, I think, just phenomenal performance on the 970.

The scheme we use, for example, for matrix multiply derives from Atlas. Given a very large matrix, say a thousand by a thousand elements, it gets broken down into much smaller pieces, 64 by 64 in our case. That data gets moved into cache for each of the operands, A times B, resulting in C. And then we cut loose with what's called the matmul kernel on that guy, which takes the data, which is lying in cache, runs it through the floating point units, develops the output argument, and repeat. In that matmul kernel, we're able to, oops.

we're able to sustain about 3.4 floating ops per clock cycle. That's 84% of the peak available on the machine, and it measures out just about 6.7 gigaflops at 2 gigahertz. When we look at the larger problem, reassembling these 64 by 64 block results, the overhead of pulling all this stuff in and out of cache, we get the DGEM performance on a 1,000 by 1,000 matrix. That runs from memory, from RAM, 1,000 by 1,000 matrix is going to sit in RAM after all, at 2.4 flops per clock. That's about 60% of peak, and measures out at about 4.8 gigaflops on a single 2.0 gigahertz processor.

We've also added double precision 4x4, 8x8 special size matrices, all tuned for the 970. They achieve similar, if not in some cases slightly better performance than the DGIMM numbers. We've completely unrolled all those loops. The matrices are small enough they sit in the cache all the time. It just goes like crazy. and finally, we've goosed up LAPAC a bit. If you're a fan of the singular value decomposition, you'll be happy to know we go really fast on SVDs now. And expect in Panther that LAPAC will be thread safe.

and David Hale are the founders of the Mac OS X library. The results are in. FFTs are really fast on this box. These are some slides giving performance of, you'll see 1 and 2D, real and complex FFTs going to the vector unit. Smaller numbers here in microseconds are faster, so we want to be underneath the competition here and 1.

[Transcript missing]

How about our linear algebra performance? The industry standard is the LAPAC 1000, which is a solution of a linear system, thousand by a thousand. We use the Atlas-based techniques. Those are blocked in the way I told you before, moving 64 by 64 chunks in and out of the cache and going like mad with the kernel. And let's just focus on this. The first number is the one that probably most folks... Here's the pointer, are familiar with. So our DLP, that's Double Precision LINPACK 1000, that's the Supercomputer Benchmark, 2.64 gigaflops. We think that's PC industry leading.

So here's a pitch following on from yesterday's Vimage session. There's a new umbrella framework in Panther called Accelerate. One-stop shopping for all your math and image processing needs. Minus framework Accelerate gets you all our digital signal processing stuff, the linear algebra I've described, a vector version of selected entry points in Libm at single precision. Some large number of arithmetic support and the new Vimage image processing library.

Moving forward, code minus framework accelerates. It's the right thing to do. And additional math stuff is going to end up in there. So it's a way of the future. Okay, so, actually let me go back from it. And it's how we leverage, right? Use Accelerate and leverage Apple's work.

Here's the point of leverage for you folks. By selecting the right API, you get the advantage of our efforts getting these things fast. So that's the segue into, well, what if you've got code that you'd like to go faster, really doesn't fit into any of the work we've done, how do you go about doing that? and David Pryor are the founders of the Mac OS X library.

Profiling on the 970 is very interesting. This is a machine that keeps hundreds of instructions in flight, has long and deep pipelines, Very often I've looked at a trace and had the "aha" experience. This wasn't where I expected the performance to be being spent. Let's arrange to tune for that.

Mark R. For a rough cut, you can use the command line tool called sample. But for the real action, you want to use Shark. That's part of the Chud toolkit. It will zero in on your hot spots and let you make really rapid progress. It's been key in our development. And so I'd like to introduce Eric Miller, who'll come up and talk a little bit about the Chud tools.

So anyways, I'm Eric Miller with the Architecture and Performance Group. Just a quick thing about the Chud Tools. There was a big session yesterday, I believe it was number 506. There we go. Microphone. Can everybody hear me now? Quick thing about Chud Tools. There was a session yesterday, I believe the number was 506, is that correct? And so that'll be on the DVD. A very lengthy demonstration with Shark that I think you should all take advantage of that and use Shark as much as you can. But there are a couple of other Chud Tools. There's a suite, so there are probably nine tools to go through.

We leverage the low-level performance monitor counters inside the hardware and in the operating system. We put the ones in the operating system in there just for this purpose. You can find the problems and improve your code with Shark and Monster. And all the best things about Chud tools is they're free.

We have an FTP site. I urge you when you get your developer tools CD and there's a Chud package, install that package, immediately update because we drop an update every several days while we're in our beta period. And once we have a gold master, it'll be probably on a semi-weekly basis.

There'll be fixes and improvements. So we're introducing the Chud Tools 3.0. Shark was formerly called Shikari in version 2.0 and 2.5. and the rest of the team are working on the software for the 970. The 970 is a great tool for performance testing. It can help you tune for the dispatch grouping that the 970 uses.

Monster is a spreadsheet for performance events, things like cache misses, instructions completed, instructions dispatched, whether or not, what kind of utilization you're getting out of the various floating point and factory units, integer units in the processor. Saturn can actually take advantage of the PMCs and do call grab visualization with additional information about what types of performance events are being utilized in a particular function in your code.

We have several tracing tools and Steve will probably talk to some of this stuff. Amber is the most important tool because it actually collects a trace from your application of every single instruction that's executed on the processor. You take this trace and you use that as input to ACID, SIMG4, and once it comes out SIMG5.

You can also take advantage of the CHUD framework, which allows you to actually instrument your applications to directly monitor performance events or control the profiling tools from inside your code so you can bracket important pieces of code. So I've mentioned performance counters many times already in just a couple of slides. What they are is a set of special purpose registers that exist in the processor and memory controller and off in the OS we have virtual performance counters.

These can be accessed by software, obviously, so we created the CHUD tools to do that for you automatically. Although there are actually user versions of some of the performance monitor counters that you can access from user code, but in general to set them up and drive them requires a supervisor application like the kernel, so we put stuff in there.

As I mentioned earlier, it's performance events and page faults is one of the operating systems you can measure. There's quite a number of virtual memory counters available in the operating system performance counters that you might want to take advantage of for a very high-level look at what's going on with your disk access and virtual memory and physical memory access in the operating system.

So, Steve has alluded to Shark. That's the icon in the upper right there. The nice thing about Shark is you can profile over time and with the Chud tools the profiling can be as small as 50 microseconds per sample or as large as a second. Or you can use the events. You can actually profile every so many cycles you can take a sample. Every so many instructions completed you can take a sample.

You capture everything with Shark from the kernel up through your application so you can know exactly where time is being spent, in what framework time is being spent. And as I mentioned the overhead is very low. It's about 40 to 50 microseconds per sample you can expect to expend taking data from the performance counters.

The other nice thing about Shark is there's automated analysis. Not only will it show you where you're spending time, it'll try and explain to you why time is being spent there and give you opportunities to try to alleviate that bottleneck. Steve also mentioned that you need, again, that we use static analysis to construct the theoretical dispatch groups on the 970. And this is all demonstrated quite nicely in the 506 session, which unfortunately was yesterday. You can save and review all the sessions, which of course should always be there, but it's new in CHUD 3. We didn't have save and review for our tools before.

[Transcript missing]

The heavy tree trace, in this particular case, square roots on top. And you can see some of the types of outputs you can get. There's some source code in the right picture. And some of the shark commentary that comes in the form of these little exclamation points in the column toward the right of the window.

I'll just briefly touch on Monster. Again, same thing, timed intervals, event counts. All the Chud tools have a hotkey, even the command line tools. If you're on the console, you can hit option escape and launch Shark's profiler and option escape to toggle it off again. Monster uses control escape. So they can both be on the system at the same time.

So the hotkey's kind of neat. Even in a command line tool, it'll just sit there waiting for you to start profiling. And you don't have to go into the shell and type something. You can just do it from anywhere on your, wherever your app is, you can go ahead and start that.

So the big thing about Monster is shortcuts. You can take these performance monitor event counts and combine them together with a simple four function calculator notation. So you can take cache misses and cycles. And compute cache misses per cycle or cycles per cache miss, whichever way you want to do that.

Ratio. You can also use the controller counter, the memory controller counters and compute bandwidth by collecting all the transactions and multiplying by a certain value that will represent bytes per transaction for reads and writes, this sort of thing. And you can also create, there are several shortcuts that are predefined for each CPU. But you can also make your own ratios and proportions and percentages to print out.

And they come out in a couple of different ways. You can put it in the tabular columns of Monster and then you can chart that data. Same thing, save review sessions, command line version. There's a picture. So there's some shortcuts that have been highlighted, those purple columns on the left, and then you graph them by pressing the draw chart button. These are percentages of load store instructions with regard to all the instructions that were collected in the trace.

and David S. The first step is to create a code that will record your function call history and instruments all your code by using the GCC instrumentation flags. Very similar to Code Warrior has some instrumentation flags that will put a prologue and an epilogue in every function. Then once you have those prologues and epilogues, the data can be collected.

You can see a typical call tree in the top half of the screen and then you can get the picture of the call tree. The call depth is vertical and the time the call was executing is horizontal. So if you have long spiky calls, you want to try to alleviate those issues.

You can use call, you can collect call counts, that's great, but you can also collect PMC event counts and see those things and what kind of duration they had. So, and we have these instruction tracing tools that I mentioned. ACID's kind of nice as a quick pass. It's sort of SIMG4, SIMG5 lite.

You just can collect trace statistics. So you use AMBER and collect an instruction trace that's accurate. Then you run ACID on it. You can get these pieces of data out of ACID very readily in maybe one or two screens in a terminal. Whereas SIMG4 and SIMG5 are very cycle accurate simulators for their respective processors and it takes some learning which Steve will get into to understand their output.

Using the Chud framework, you can instrument your source code, start and stop the other graphic user interface tools, and also directly read and write the performance counters in your code. There's also the HTML reference guide is generated every time we do a build of the framework. The HTML is updated to any new things we put in the prologues.

So a quick example, you can, you always chud initialize whenever you use this stuff. You require remote access and you tell Shark that you want to use remote access. Then you start Shark and give it a label. Then your important function executes. Then you stop it and release remote access so another thread or another client can use it. But then Shark will automatically profile your important function and your important function only. And you'll get the results in the GUI.

Slightly longer example where you actually set the counters explicitly, clear them, start them, then your important work happens. Then you stop the counters and then you take the, it returns arrays of double precision floating point values. If there's six counters, there'll be six entries, one for each, zero through five, in the output arrays. You take those output arrays and then do whatever you want to do to present them to yourself. Maybe log them or chart them.

This thing is a little finicky. So, how do you get CHUD? The easiest way is it's on the developer tool CD, but updates will come directly from the web. There's an updater that will run automatically the first time you install CHUD, and thereafter there's preferences. You can check the status of the CHUD package for new updates hourly, daily, weekly, or monthly.

We do have internal guys that do check hourly and they get upset when a couple hours go by and there's no new updates. So the best way to get in contact with our team is to use the [email protected]. And with that I will turn it back over to Steve.

Excellent. Thank you. Thank you, Eric. I think I'm going to go with a keyboard. Eric has covered the astonishing capabilities of Chud, but don't think you have to do anything really special. The most you need to do, really, is to learn to use the hot key. If you haven't been down to the performance lab, the G5 performance lab, with your app, please do come down.

If you find me there, I'm likely to be a pest and hover over your shoulder, wait until your app is running and grinding the CPU, and say, hey, can I start Chud? And it's basically start the app, hit the hot key, wait a few seconds, and then look at the samples. People have been dropping jaw at what they see and how simple it is. Chud comes up. I'll take a question at the end.

Quick question, you don't even need that, so, and we can show you why. So people are quite astonished to see how quickly it happens and sort of the almost immediate insight they get, gee, there's my high runner right at the top of the window. Let's click into that and see which instructions are slowing me down here. It's really a great thing.

So with some background on Chud behind us, I now want to talk a little bit about this regimen I go through to really torque down and get some performance out of floating point intensive code. First concern often is memory. The machine has an enormous amount of bandwidth if you can use it effectively.

You need to load data early so that it's available early to the out-of-order execution cores. If the data ain't there, the cores can't do the instructions, so get the data there early. So in examples, you'll see that I load polynomial coefficients, literal constants very early on in subroutines, even speculatively.

Even if they may not be used in a particular branch of the code, I'll often load them early just to have them available in the case we drop through and go. So load early, load often, harness the two LSUs to drive those two FPUs. If you can load data sequentially, there are hardware-initiated prefetch streams that are really effective at getting the data into the machine.

DST and vecDST, the moto entry point, are bad eggs. and David Koehn are the founders of the DPC-ZL class instruction framework. They're a big help on G4. They're execution synchronizing on G5. They probably ought to be avoided. You're better off if you want to prefetch data using the DCB-TL, DCB-ZL class instructions. You need to be aware that the cache line size on the 970 is 128 bytes.

So did I say that the 970 has two FPU cores? I think so. Use them.

[Transcript missing]

Since there are two FPUs, a simple strategy for making sure that you're getting both FPUs fully utilized is to think that you've got a 12 cycle pipe. That is, start a result, don't plan to use it until 12 cycles later, 12 intervening operations. That often means you need to think a little bit more about parallelizing software pipelining your algorithms to get that kind of distance between uses.

So here's a little piece on choice of algorithm. Sometimes you need to sort of just pop up a couple of levels and think, is there some way to recast the algorithm I have to be more effective on floating point? So the example is it takes two hands to matrix multiply. You're trying to form the matrix C as the product of A and B.

In high school, or I guess junior high school nowadays, you learn how to take the product of two N by N matrices by using two hands to form the output elements C, I, J. You take the Ith row in one hand and the Jth column in the other hand and you form a dot product.

2n fetches and n multiplied later, you have a single element of the output Cij.

[Transcript missing]

To form the 4x4 output block in matrix C, you grab four elements. and David Koehn will be joined by the two of them. Thank you. Possible products, pairwise products. Continue down in rows and in columns.

You've done, and let me, I'm just going to look at this so I remember it exactly right. You're accumulating 16 simultaneous inner products. It requires 8 end fetches, 16N operations, that's actually a factor of four reduction in memory bandwidth. You've used nearly all the registers and it's possible to keep the floating point units bubble free. And basically that's the trick that Atlas uses and our MatML kernel to drive the machine at, in the kernel, 84% of peak. So the take-home message is, think parallel if you can, but small parallel, you know, four by four, it's manageable.

So here's a little case study I thought we'd go through that gets a little bit more into the tracing tools. The code is the arm of the Libm sine function for arguments that are smallish, between pi over 4 and minus pi over 4. and just to set some landmarks for orientation, you'll see that there's an absolute value function taken at the top, comparison here to decide if we're in the right arm, some manipulation of the floating point environment, some arithmetic and then what looks like a polynomial approximation.

Formation of the final result. Some more adjustment of the floating point environment and out we go. So on the G4 series, we had a tool called SIMG4. And if we look at this segment of code run on, again, a smallish argument to sign, We see this picture. This is actually a very good picture, SIMG4 wise. At the top, let's see if I can get this thing going again.

All right, here we go. For our landmark, there's that FABs instruction, and we read across to see this FABs instruction issued roughly at cycle 1200, 1, 2, 3, 4, 5, 6, 7, 8. Spent two cycles getting instruction fetch, one in dispatch, a couple in execution, and then retired.

Next line, store word with update, additional time. The object here is to fall off the cliff. You're retiring instructions as fast as you can. So this looks pretty good. This code's really quite good on G4. Taking just that code unchanged, bringing it to the G5, the 970, and using the SIM G5 trace, well, first of all, it's a more complicated machine.

and the letters have changed. So we spend some time in fetch, Five cycles in dispatch, some time in the mapper, we finally hit execution unit five for that Fabs, which is now on the other side of the screen. We finish that operation, what, six cycles later, and we hang around and wait for the rest of that completion group to all finish up, and then we complete.

This doesn't look too bad either. Again, the object is to fall off the cliff. Well, until we get down to the bottom here, this is trouble. And there's a key for these letters, and if we were to look up the key, we'd see that we had essentially a load store reject. So somebody is trying to load a load store reject.

from an address that was recently stored. Well, how recently? Turns out that store occurred way up top. This is a machine that's putting hundreds of instructions in flight. These dependencies can stretch out over really large lengths of code. Beware, be forewarned, and watch out for these kind of things. By adjusting for this, it turns out this is the manipulation of the floating point environment.

We can end up doing quite a bit better. You'll notice, first of all, this is shorter. and the fall off the clip is much more precipitous. And we've eliminated that nastiness down in here. So, that's the kind of information you can gain from the SIMG4, SIMG5 class of tools. I think it's an adjunct to Chud. I think Chud with Shark is the first place to look. But when you, you know, wanna eek out the last little bit of performance, this is the tool that I turn to.

So here's the final version of that code. Just for landmarks, there's that FABs again. Still have the compares, but we've, turns out we've manipulated the environment by bringing it into register rather than storing. I've adopted this style where I split across a single line in the C code places where I think I can gain parallelism. So here's loading early and often of the polynomial coefficients. Here are operations that I believe go in parallel on the dual FPUs. And then out.

So, quick summary of the regimen I like to use. Start with Chud, look at Shark, go back to your code, pay attention to load store issues. Think in terms of dual FPUs. Even just organizing the layout of your code can help you think when you can take advantage of things going in parallel. Use as many registers as possible. Let the hardware-initiated prefetch streams help you get data into the machine early and often.

and when directed by SIMG4, SIMG5 kind of tools, look at dispatch group formation just to make sure that you're not crowding instructions that you think ought to be going to separate FPUs, one on top of the other in a single issue queue. That was the slide that came much earlier in the talk. So, time to wrap up. You can review these sessions on the DVD.

can contact myself or Ali for questions. Ali does a little skunk works operation every once in a while, but he can tell you about that. and for more information, there are two really fine tech notes now on the web at the developer site.

[Transcript missing]

There's some other interesting documentation at the, somebody.html here. We'll have to resolve this. And finally, I'm a big fan of the fellow who writes for Ars Technica describing the PowerPC. This is a really lovely introduction to the machine and a good place to start a quiet evening with your laptop.