Optimizing for the Power Mac G5 - WWDC 2004

Hardware • 52:11

The 64-bit G5 – processor and high-bandwidth architecture make the Power Mac G5 – a breakthrough in processing power. View this session to learn how to unleash all the power of the Power Mac G5 – platform. We focus on optimization for the G5: memory I/O, using the Velocity Engine, Mach-O, threading, loop unrolling, loop alignment, gcc G5-specific flags and other ways you can optimize your code for this powerful platform.

Speakers: Mark Tozer-Vilchez, Sanjay Patel, Steve Hikida, Ron Price

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. Welcome to session 503: Optimizing for Power Mac G5. It's been about a year since we introduced – actually, exactly a year since we introduced the Power Mac G5 chip. And last year, if you're here at WWDC, most of the content that we presented dealt a lot with the architecture of the G5 chip itself.

We had optimization labs running several days into the night, so late night at times, trying to get as much information to you as a developer to understand the differences between the G4 chip architecture and the G5. There are very stark differences there. This year, we want to, again, reemphasize the architectural differences and why it matters to writing optimized code.

But we also want to now make sure that you guys understand how to utilize the tools that we supply in our toolset, as well as the compiler options that you have in terms of helping you optimize your code itself. So we have several speakers this morning. From the compiler team to our performance group, as well as a guest speaker from IBM this morning. So with that, I'd like to start off this morning, introduce Sanjay Patel from our performance group.

Good morning. It's a tough time slot. For those of you who were here last year, we're going to do a bit of review. How many of you were here last year for this talk? Okay, so we're going to go through some G5 architecture and then we want to talk about things you can do to help improve your code on G5 and, in fact, all platforms, really. And then we'll turn it over to some of our compiler guys to help guide you through your process of optimizing your code.

So to start off with, when you're talking about the G5, you have to of course start with the PowerPC 970 chip. And this is of course a super pipeline, super scalar processor we got from – we teamed up with IBM to make. It's based on the Power 4 server architecture.

The big addition that we had to make it an Apple chip was to add what we call the AlteVec engine, also known as the Velocity Engine. This is a 128-bit vector unit, which does floating-point and integer math. And the other big difference is we have this high-bandwidth, point-to-point interface that connects the chips to the memory controller. And to help take advantage of all the bandwidth that we have available in the system in theory are these automatic hardware prefetch engines that help put all that theoretical bandwidth to use.

So this is a die shot of the 970 chip to – for hardware engineers, this is kind of like pornography. You know, you look at this and you'll hear them say things like, "Look at the FPUs on that one." For software engineers, what you want to take away from this is that there are lots of execution units available for your program that all operate in parallel. So two independent load store units, two independent IEEE-compliant floating point units. You have the full implementation of AltaVec, two fixed point units as well. So there's just a lot of space here to get things done in parallel.

Another way to look at this from the software perspective, again, is if you start at the top, you have the L1 cache, which holds your instructions. From the L1 cache, instructions flow into a fetch queue, and from there they get dispatched, up to five instructions on each clock. So at 2.5 GHz, you're dispatching over 10 – in theory, 10 billion instructions per second.

Now that's actually the narrow point of the 970 architecture. So as you can see, this stuff in the kind of greenish-gray is out-of-order execution. So you can feed 12 independent execution units with 10 issue queues once you've dispatched. And then again, your instructions complete at the bottom of this picture in order, up to 5 per clock.

Now, a good way to put this in perspective is to look at how the G5 compares to the G4. And so, we keep talking about the parallelism of this chip. One way to measure that is how many instructions you keep in flight simultaneously. And for the G5, it's over 200 instructions in flight compared to a little over 30 for a G4 architecture.

And we've also increased the pipeline stages, so you can see that it's more than doubled for a simple integer instruction. And usually you'd say, "Well, that's not so good. Why'd you do that?" The reason we increase the pipeline stages is to help increase frequency, so we just announced that we hit 2.5 GHz. In order to hit higher and higher frequency numbers, you increase the pipeline depths.

We talk about some of the execution units again. We've doubled up on the load store units, doubled up on floating point units. You now have two fixed point – two general purpose fixed point units, whereas on a G4 architecture you have up to three simple units that do things like adds and subtracts, and one complex unit to handle things like multiplies and divides.

The vector units are pretty similar. You have an ALU that handles floating point and integer, and you have a permute unit. For any of you who have done any vector programming with AltaVec, you know what the power of the permute unit is in terms of swizzling data out of memory and into the registers.

As we work our way out from the core, the biggest program or visible difference that you'll find is that the cache line size is different. It's under 28 bytes, whereas it was 32 for a G4. Now, that can either be a really good thing or a bad thing, and I'll show an example of how that happens. As we work our way again to memory, you have the L1 cache is the same size, different associativity and write policy. The L2 cache – sorry, the L1 instruction cache is also doubled up in size.

There's the L2 cache. The L2 cache is also doubled in size compared to 7450. You'll notice there's no L3 cache on a G5 system, whereas you had up to 2 megabytes of L3 on a G4 system. Now, we've made up for that by increasing the processor bandwidth substantially – one, by doubling the width of the DDR interface and increasing its frequency, but also the front-side bus. And this slide is actually a little out of date. At 2.5 GHz, we've increased the front-side bus frequency to 1.25 GHz, so you can actually, in theory, get 5 gigabytes per second in each direction.

So I want to talk about some programmer problems I've seen over the last year since we've introed the machine. And the one that comes up most frequently turns out to be a rather simple thing. It's conversions from floating point to integers. And the reason this shows up a lot is because when you write this in C, it looks really cheap, right? You just cast your variable into int or cast it to float.

But it turns out this is not cheap at all. Particularly on a G5, because it has so much going on in parallel, when it hits this condition, because the PowerPC architecture doesn't have integer register transfers, you actually have to go to the L1 cache and come back. So you have a load and store operation going on.

There are a lot of things you can do to avoid the problem, and the biggest one that I've found that actually turns out to be one of the easier solutions is simply don't do that. It turns out, again, because it looks so cheap and easy, that a lot of people just cast from one to the other without thinking about it. And when you examine that code, you realize you could have stayed in one domain or the other without hampering or affecting the algorithm in any way.

The other cool way you can get around the problem, and of course if you use this, Altevec, you're going to get a much greater speedup because of all the parallelism in the Altevec unit. But the Altevec unit handles floating point and integer identically in the same register set, so there's no memory operations when you transfer between types.

Another potential optimization is to use the gcc compiler flag – fast. This tries to schedule your loads and stores and inserts no-ops even to separate them out to keep things flowing through the G5. IBM's Excel compilers also do this kind of optimization if you specify the G5 architecture.

So I want to show you a quick and really simple example of bad code. So this is a real lame loop. All it does is int to float conversion because the loop counter in this case, i, has been declared an int, but we're adding it into a floating point sum. So that looks really cheap, but it's not. Every time you do that add, you're going to have to convert the i from integer to floating point domain to store it into the sum. So how would you get around this problem?

What you can do is create what I call a shadow of the i variable in the floating point unit. So I've just named it i_fp to denote that it's floating point value. And when I initialize i, I also initialize its shadow. And when I increment i, I increment its shadow.

Now inside the loop, we're going to use the floating point value for the sum rather than the integer value. So on a G5, I measured this. It turns out this code is three times faster than the previous code where you're doing the conversions because this code won't have to do all the load and store operations.

And the next biggest thing I've seen over the last year is just improvements to try to schedule your code. And like I say, the G5 has a lot of execution units that you can operate in parallel, but if you write dependent code – so one operation depends on the next, depends on the next, it's all serial – you're not going to take advantage of all those units, and furthermore, you're not going to take advantage of these long pipelines. You have to schedule your code so that you're filling in all these pipeline slots instead of causing bubbles in execution.

So, compiler help is here. You have gcc 3.3, which has G5 architecture tuning, so that tries to schedule for the available units and slots. XLC also has the same kind of flag where you specify that you have a G5 architecture. And the other thing you can do is use Shark, which you've probably heard about, and we'll talk about it in more detail tomorrow at 3:30. We have a full session on how to use Shark and what it can do for you.

Now again, I mentioned that we've increased the pipeline stages compared to a G4 on the G5. So what does that mean? Well, it means it takes longer in terms of clocks for a simple instruction to complete. So, for example, an addition instruction may take one cycle on a G4, but it may take two cycles of latency on a G5.

So what you want to do is account for that in your program by kind of grouping a bunch of the similar operations together. So that means you can unroll your important loops or you can use a compiler flag. And as well, you want to schedule your code for the G5, so you're going to fill in all those pipeline slots.

Now, people often ask, "Well, shouldn't the compiler do that for me?" And in all these examples, you can always ask that question, "Shouldn't the compiler do that for me?" In some cases, the compiler can do it for you if you specify the right flags. But there's always a downside if you just lean on the automatic way to work over this problem, because the compiler hardly usually doesn't know when a loop is important or not. That's something you have to tell it.

So if you choose to unroll all loops or unroll most loops, you're going to have a big increase in code size, which could be detrimental to your performance. That's why as a first pass, you should profile your program. And try to do manually some of these optimizations just in the important spots.

So here's another example of some code that's, again, just a silly example. We're just going to sum a bunch of ones in this case. The 970 architecture of the G5 has two floating-point units, and they're each six stages long. So this code is only going to get approximately one-twelfth efficiency because every instruction is dependent on the previous sum.

So this is a simple example. Now, the code has exploded, right? Because we're trying to fill all the pipeline stages, and here we actually only unrolled to eight ways of partial sums, so we wouldn't fill all of 12 pipeline slots. You would actually want to do 12 in order to maximize your gains on the G5. So you can think of the floating-point units as either one 12-stage pipeline or 12 single units. But they're all going to operate in parallel. So this code turns out to be 10 times faster just using partial sums instead of one variable.

The other big thing you have to worry about when you're optimizing code – and this goes for all architectures, but it's particularly bad on a G5 because the G5 core is so good. It makes memory look really, really slow. What you have to do is try to reduce operations where you're waiting on memory. So effectively reduce your latency. There are a couple of ways to do that.

One is to rely on hardware prefetch engines. And I'll show you another example of that. The other thing you can do is use software prefetch instructions to get the data before you actually need to use it for computations. For example, if you're in a loop, you can batch all your loads together at the top of the loop, do a bunch of math, and then do stores at the bottom. That's going to perform better than doing serial operations of load, math, store.

I mentioned that the data cache is different on the G5 than the G4, and the biggest difference is the cache line size. So it's four times as big. What does that mean? Well, it means you may get one-fourth the cache misses if your data is organized nicely. It may also mean that you're getting really terrible performance if you're accessing one byte, skipping 127 and accessing another one byte. At that point, you're getting less than 1% efficiency from your cache.

So what you want to do – and this is sort of basic CS, right? Pack your data together to maximize its locality. So as you walk through your array, you want to be stepping sequentially rather than jumping around. This has an additional benefit of triggering the hardware memory – the hardware prefetcher. So the CPU is automatically going to detect that you're walking a straight line either up or down through memory and start prefetching cache lines from memory into the cache.

So again, here's another simple example. This is a classic two-dimensional array where we're walking the wrong way through it. We're iterating on the columns – or rather, the rows – rather than the columns. So we're skipping large chunks of memory. In this case, what you'd want to do is just switch the four loops so that you could sequentially access every element in this array. So any guesses on how much faster this is going to be? It's a big difference. This is simple stuff, but 30 times faster if you do the right thing rather than the wrong thing. It highlights the problem of how important accessing memory is.

So I just want to summarize some of the things you should be doing and looking at while you're trying to optimize code. The first thing to do is try to unroll and schedule important loops. Because you have all these execution units – the independent floating-point units, the independent load-store units. You have hundreds of instructions in flight.

This number is actually out of date as well. You can now use Altevec to calculate more than 36 gigaflops per second at 2.5 gigahertz. This is of course the best solution if you have code that's just massively parallel. You can operate on all the elements simultaneously. For those of you writing floating-point code, the G5 has a hardware square-root instruction, which can be enabled with gcc with this PowerPC GP opt flag. XLC will recognize that this instruction is available if you specify the G5 architecture.

This has made a very large difference in some ray-tracers and renderers and other programs that I've looked at that have heavy dependence on square-root. If you're using 64-bit integers, long-longs in C, you can turn on flags to specify that you have a 64-bit machine, because G5 truly does have 64-bit integer registers. This can be a huge difference for code compared to actually breaking up into 32-bit chunks.

So again, the system and the chip were designed for high bandwidth. They were designed to do lots of things in parallel. It's part of the server heritage coming from the Power 4. You have 40 gigabytes per second to the L1 cache, up to 80 gigabytes per second between the caches, and up to 5 gigabytes per second to and from main memory.

And the way you want to take advantage of all this theoretical throughput and put it into practice is to take advantage of hardware prefetch engines. So these will start scooping data out of main memory and bringing to cache before you actually need it. And that's all I have. So with that, I'd like to introduce Steve Hikiida from the IBM compiler team.

Good morning, all. Good morning. I'm actually very excited to be here as we have now introduced our Excel CC++ and Fortran compilers for the Mac OS X – or, Mac OS X, my apologies. So IBM compiler technology. We've been in the business of compilation technologies for over 15 years and exploiting primarily PowerPC technology, but we've also been on about nine other platforms, mainly IBM platforms. In among all this technology, we've got numerous types of optimization patents that truly exploit the PowerPC technology.

Our goal in the IBM compiler team is actually threefold. The first one is to exploit the hardware. Our actual key here is to drive out the maximum performance we can possibly get out of the G5 processor. Among these things, we have an extensive portfolio of optimizations. These include things like intra-procedural analysis, which does whole-loop program analysis. It has the profile-directed feedback. Loop optimizations for parallelism, instructions, and all for locality scheduling.

One of the things that we do regularly is we actually work very closely with the chip architecture team, where we've been working with the core team that actually develops the chips and provide them with information as to what ways that they may want to change the actual chip versus also the type of information that we can exploit within our own compilers.

The second thing that our compiler group really is focused on is specifications and standards. For our C and C++, we are C/1999 and C++/1998 compliant. For 4-Trend, we're 4-Trend 70, 90, 95 and partial 2003. We also have OpenMP support. Primarily that was introduced on the AX platforms and we are bringing it over to the Mac OS X also. Our developers within our CEC++ and FORTRAN teams are also representative on the standards committees. They're not only on the ISO standards committee, but they're also on the OpenMP consortium.

So being really focused on compatibility and also on standard specification, our compilers is also source codes that can be pumped through our compilers are easily portable between numerous platforms. For example, Mac OS X, Linux, AX, and our mainframes at OS. The third thing that we're really focused on is the customer care.

We work very closely with various ISVs and also customers on tuning their code. As a matter of fact, we're down in the optimization lab all this week and some of our engineers have been working very closely with people that have brought in their code and we've actually seen markups from anywhere between 20% and 200% even in a short period of time by using our compilers to exploit their code.

So, for the CC++ and the Fortran compilers on Mac OS X, they are based on our AIX and Linux compilers. On AIX and Linux, we call it right now VisualHC++. So, the VisualHC++ for those platforms are essentially the same compilers that we have for the Mac OS X platform. So, this actually leverages all the proven optimizations and language specifications that we've already introduced on those platforms. Some of the common things between our XLC and C++ compilers and Fortran is, as I mentioned already, the exploitation of the G5 architecture. We are integrated with Xcode. Symbolic to Bangalore GDB. And also we support a number of the Apple's profiling tools.

The Shark one in particular is just an outstanding tool in helping tune your code. For sure, even ourselves, we wish that it was available on some of IBM's platforms. Among the two other things that we have is part of what we call a technology preview. So these are features that we are actually looking at trying to bring into our product. Although they're there right now, they're not formally supported. And in particular is the OpenMP. So our direction is to have full support for OpenMP 2.0. And the other one is automatic parallelization.

Specifically then for C/C++, as I mentioned, there's the standards compliance for C99 and C++98, exploitation of AlteVec. Although this compiler right now can actually generate codes to the AlteVec instructions, one of the things that we are ongoing looking at in our research and development is the automatic syndication, or otherwise known as automatic generation of AlteVec instructions. So these are things that we are definitely focused on and looking at in future releases.

Compatibility with GCC 3.3. This is two-fold. One is we are fully binary compatible, so you can intermix GCC objects with our compiler. And the other one is that we have a number of language extensions that are GCC-specific, so you can have source code compatibility. For C++, we also have an Objective-C technology preview.

And then for XL4TRN, as I mentioned, we already have the 4TRN 70, 90, 95 and 2003 partial. We also introduced many IBM and common industry standards language extensions. And these include some from VM, from ZOS and other well-known platforms for 4TRN. So that concludes actually just a quick overview of the Excel, C/C++, and Fortran compilers.

If you have any questions on how to exploit your code, how to gain even more optimization capabilities and performance out of your G5, come on down to the optimization lab where there's a number of us IBMers there to help answer any of your questions. Thanks a lot. And the next person up is Ron Price.

Thank you, Steve. Well, this has been a terrific year for the G5, and I think we all now understand what kind of power lurks in that box. But those of us who have really worked a lot with it understand what it takes to extract that power. And Sanjay covered some of that, and of course, Apple offers a set of tools that really facilitates understanding your program and being able to algorithmically allow it to extract that power.

I want to talk today about what the compiler can do to get you started on that path, because not everyone is ready to step up and start tuning their program and changing the algorithms and so forth. And so Sanjay mentioned a number of compiler options that can help you in certain situations.

And what we have done within the compiler group is actually put together a mode we call Dash Fast, and I want to talk about that today. I also want to talk about feedback-directed optimizations, which is another component of the Dash Fast mode that can help you significantly. And then, of course, you have all heard the announcement this week that we're on the path to deliver with Tagger our initial cut at auto vectorization, and I want to talk about exactly what that is.

So GCC and the Dash Fast Mode. Could I just ask in here if anyone's using this mode today? Ah, geez, amazing. We've had so little feedback on how it has been working for people that we've wondered if anyone's using it, and that's why we wanted to talk about it today.

The Dash Fast Mode is really a collection of a lot of the compiler options, but in many ways it's more than just a collection of options. We put them together in a homogeneous type of fashion to our best ability to target what I would call typical applications. And of course, we all know there's nothing like a typical application, but in this case I mean applications that are computationally intended.

So if you do a lot of mathematical computation, we've tried to target a mode that will give you a first step into getting some of the performance. However, the details of when you use that mode is important, so you can't totally say I don't understand my program and what's going on in my program.

And so I'll talk a little bit about the details that are important and give you a good feel, at least for Dash Fast and what it is. And then finally, there is a variant of Dash Fast called Dash Fast F, and that's really what you should be using if you're working with C++. There are some things that we do slightly different to try and address performance in the C++ world.

So what are some of the specifics about Dash Fast Mode? What are we trying to actually attack? Well, Sanjay talked about and others have talked about the deeply pipeline nature of the Dash Fast. So what are some of the specifics about Dash Fast Mode? What are we trying to actually attack? architecture and the wide functional units.

And so one of the things that you have to really be concerned about to get performance is keeping the pipeline filled, as we call it. And so there are a number of optimizations, some that Sanjay mentioned, that we have brought together to try and keep this pipeline filled. So we're feeding this monster at the speed it would like to be fed.

I want to talk a little bit about standard conformance and some of the things that we do to relax the rules so that the compiler can actually do a better job for you in terms of optimization. And then finally, of course, the G5 instruction set. This is a presentation on the G5.

So, to start off with, I don't know how many of you have ventured into the -03 level optimization mode, but I want you to know that's just the starting point for DashFast. And so, you'll get that with DashFast. And along with that come a couple important options. One is inlining functions. And basically, and you may understand this, but basically that says within a computational unit, the compiler can do some heuristics to determine how to inline functions within that computational unit.

And the real purpose behind the compiler doing that is that the more code that the compiler can also have an inline view of, the better all of the optimizations can be performed. And so, the bigger view that the optimizer has, the better the optimization. The second is rename registers option.

And what this simply does is it gives the compiler more freedom in terms of its register allocation. And it does that at the expense of you being able to debug your code, but if you're on this ragged edge of trying to get optimum performance, that is one of the pitfalls you have to deal with.

The second capability that I want to talk about is intermodule inlining or function inlining across the entire system. And that's the whole program. And what that does, where the inlining, the previous inlining option looked at one computational unit, the intermodule function inlining looks at the whole program. And so it gives you that many more opportunities to consider inlining throughout your program. And once again, there are heuristics that we have determined are the best when you're making guesses about inlining. And you really don't know.

Whether a function is called a lot or not. I'll be mentioning another feature a little later on, though, that deals with that. This is a command line then that would represent you implementing intermodule function inlining. And basically that's triggered by putting all of your compilation units on the same compile line so the compiler can look at them all at once.

The next – and Sanjay talked about this – has to do with loop unrolling. And the compiler can actually do that loop unrolling for you. Once again, a very simple-minded loop, but it will serve as a representation here. The compiler can actually deal with more complex loops. But inlining simply means that it reduces the number of iterations of the loop and it puts various iterations actually in line.

And so once again, what you're doing is cutting down on the branching operations and trying to give the scheduler more opportunities for scheduling the other operations in the functional unit. There is another form of loop unrolling that the compiler does called loop peeling. And in that situation, you can see here we have an even smaller loop, and these do occur in code. And the compiler will simply unroll the entire loop and eliminate the loop altogether.

The next option is loop transposition, and we have several loop transpositions. This is similar to what Sanjay was talking about, and I think he indicated you could turn on this option. But basically, we have a double-nested loop here, and it is stepping through memory in fairly large increments – in this case, 1335.

And that has a terrible effect on the paging within the machine, and so for data locality reasons, we include this loop transpose function. And what it will do is the compiler is able to recognize that situation and actually do the transposition of the loop, so that now we're incrementing in increments of one throughout the memory.

We have a specialized optimization called loop-to-memset, and what that is, is if you have initialization types of arrays where you're initializing things to zero, the compiler actually will transform that into what's called a memset. And memset on each of our architectures, including the GFly, has been highly tuned in such a way that you can't beat it with your own code.

And last, we talked about tuning for G5, and tuning for G5 is really important because it tells the compiler this is a G5 architecture. The compiler understands then how to schedule instructions for the maximum grouping so that we can keep all of the functional units going as much as possible in parallel. So we're really extracting the power of having a wide functional unit set.

Okay, I mentioned standard conformance and so I have a couple of relaxation rules here. One of them is having to do with an option called strict aliasing. And aliasing is a situation where you have two pointers and those pointers are actually pointing to the same object. So those objects are then aliased. And the compiler often can't tell, well, even if they're different data types, if in fact the objects are aliased. And so it has to make the assumption that they are.

Well, you can help the compiler out by, if you know in your program – and this is one of your program knowledge items – if you know that your pointers are never aliased within your program, you can tell the compiler, use strict aliasing assumptions here. And so in the example, basically what this means is – and this is a very simple example – strict aliasing tells the compiler, by the way, that if pointers are a different data type, they will not be aliased.

You can assume they're not aliased. So in this particular case here, without strict aliasing, we would actually have to reload the PI with a 1 before we return PI. By saying strict aliasing, we're able to understand that we don't have to worry about reloading that and in fact can be put into a register and returned that way. So this can have, interestingly enough in many programs, a pretty big impact.

The second that falls within the area of conformance is our fast math option. And fast math, you should understand, is not IEEE conformant. But, by the way, almost all code doesn't require IEEE conformance. And you know that if it does. And so, by relaxing that rule, we relax that back to where the compiler can assume that the associative, distributive, and community principles hold.

And so it can actually rearrange code in the fashion that I show here on the screen to best utilize the scheduling of math and the computation of these operations. This is another one that can really win for you. And if you really don't need to understand whether on the boundary conditions that it's not a number, it is a number, or it's infinity, then you should try using fast math and try that in your program.

Hardware-specific then, of course, we say mcpu = g5 and in the gcc compiler that says you're perfectly free to use any G5 instructions that are available. And then inline floor happens to make use of a couple of specialized instructions in the G5 to actually inline the floor intrinsic right in line.

The next three have to do with alignment. And one of the things that we've learned about the G5 is that it's very sensitive to alignment. And you can make dramatic improvements in your code performance if you try to deliver well-aligned types of data and well-aligned code. In this case, we're aligning loops, jumps, and functions all on 16-byte boundaries. And yes, this does cause some bloat in your program. But our experience is that the performance far outweighs the bloat that you get and the size increase in the program. The last item there, malign-natural, says, alright, align all data types on their natural boundary.

So you have to be cognizant of that if you're concerned about data being packed together or things of that nature because data types will then be aligned with perhaps gaps in them. I would just encourage you in moving out of the Dash Fast part of this that you should give that a try. Feel free to give us feedback in terms of problems you have. One of the things that it can help you trigger is the need for going the next step of analysis.

If in fact you use Dash Fast and you don't see speedup within your code, then you should be thinking that, well, maybe I've got some algorithmic problems or some memory accessing problems that are the real performance killers. And so there's no way the compiler is going to optimize you out of this. You will have to go to Shark and try and understand the problem. So you have to be cognizant of that. And try and understand the program and what's causing that to happen.

The next thing I want to talk about is feedback-directed optimization. And feedback-directed optimization is really an optimization that allows you to tell the compiler in more detail exactly how you expect your code to execute. And the compiler will take that knowledge into account and will do a better job of optimizing. It's used, number one, for inlining. The concern about inlining – and it was mentioned by Sanjay – is that, boy, if you over-inline, you can kill performance as well.

Well, using feedback-directed optimization, we actually tell the compiler from results from a training run exactly how many times that a call site was a function called, how many loops are in an iteration that has a function in it, and how many loops are in an iteration that has a function inside of that loop. And so you can make very good decisions in terms of performance versus size trade-offs, as opposed to using guesses, which are the normant – The second thing it's used for is what we call hot and cold partitioning.

And hot and cold partitioning – the best example I would have for that is an if statement. You have two branches, and one of those gets executed predominantly, and the other one is only occasionally, maybe only in an error condition. So we tag the hot one and we start grouping the hot code together, and we take the cold code and we move that off together at the end of the program. And so we help to compact the program down and keep the footprint for that program so that we reduce paging once again.

In operation, there are a couple of flags that you use to do this. So first you would use the create profile flag and you would actually create an executable that is instrumented such that it can gather the profiling information. You run that with a training set of data. Then you rebuild your program, optimizing it using the profile that you just created.

Not all applications, I realize, lend themselves to this type of profiling. Maybe it's an interactive type of application. But certainly if you have computationally intensive applications that work on large data sets, taking advantage of trying to train the application, the compiler to optimize the application for that is really a great thing to do. Well worth the effort.

Then finally I want to talk about AutoVectorization. Just out of curiosity, how many are using the AltiVec processor today? Okay, we have quite a few hearty souls. But you also understand that it doesn't come for free. It takes work to program it today. What we are doing is we're trying to open up the vista of using the AltiVec to a broader scope of folks. And so areas where you may not in fact want to spend the effort tuning it yourself. So what is AutoVectorization? It's simply the compiler being able to transform serial loops into vectorizable loops. So what is AutoVectorization?

And what are vectors? For those who don't know that, well, a vector is 128 bits. It can be operated on in a number of different sizes for integers and floating point and bit operations and so forth. All of those operations within that 128 bits actually occur in parallel, and so therein lies the speedup.

Just a quick overview: the types of operations are arithmetic, logical, compare, rotate and shift. They're all done within the vector unit. And of course the data types we just talked about. So, in your DVD that you've received, there is a preview compiler, a preview of the 3.5 compiler.

And that is a first introduction to you of the autovectorization. It has limitations and our goal is to really work on those limitations between now and the time it's released with Tiger. But today, what can it handle? It can handle loops with both known and unknown bounds. And there's different code that we have to generate to discover the loop iteration counts at runtime if they're not known. Loops with even and odd vector lengths. Loops with conditional loops. With conditionals in those, particularly simple conditionals.

And misaligned vectors on loads. So we're able to take unaligned vectors. And what I mean by that is, once again, AltaVec operates on a 16 byte boundary type of vector. And so if your vectors aren't aligned on 16 byte boundaries, and you can get that from malloc arrays and of course your own arrays that you allocate. But we go through vector operations to align them. And I'll show you a little bit about the performance penalty that can occur when you do that.

Auto-vectorization has difficulties with pointers and aliasing. Well, I talked a little bit about that before in the scalar part of the presentation. That's true here as well. In this particular example, there is no way that the compiler, unless A and B are globals, are local. They're certainly not local within this function. And so unless they're globals, there's no way the compiler can discover that these are not aliased. And so it'll have to make the assumption they are in today's world and not vectorize this loop.

However, you can help the compiler out in a simple way. You can actually use the restrict keyword. And the restrict tells the compiler, "Okay, this pointer does not alias with any other pointed-to object." And so that's a simple help. It turns the loop into one that can be vectorized that can't today.

The next thing that it has difficulty with that you need to watch out for is that scalar loops may have data dependencies that work perfectly fine when you're dealing in the scalar mode. But to try and transform that into a vector type of operation where you have a number of elements being computed at the same time, you can't have those dependencies. And so the first illustration of a loop here is one in which we simply couldn't vectorize it because you will have this data dependency. And the second one – the second one looks similar, but in fact there's no data dependency here because this is offset by n.

And then misaligned vector stores. We simply can't handle that in the preview. We'll have that available in the GM release, but if you're going to play with the 3.5 compiler, be aware that the vector that you're storing into needs to be correctly aligned in memory. So what is using the auto vectorization all about? Well, it's about performance.

And so I have some initial numbers here, and these are already out of date as we continue to tune the code. But for simple types of operations in loops, you can see speedups here that go all the way to 14 times. And we're now seeing even around 20 times in some of our work. If you have misaligned data, the types of impacts that you can see is you really reduce the performance significantly. Now this, as I said, I expect to improve.

We're in a very early stage with the auto vectorization. We have a limited set of loops that we're able to recognize and vectorize, and so I would encourage you to take a look at this. We are really open to you sending us kernels of code of things that we don't seem to be able to vectorize because we want to build up and mature that ability. And this will be something that we're working on. As you can see, though, the reason we're excited is because this can really offer some speedups.

And particularly if you haven't already been using the auto vector processor on your system, it's sitting there just wasting away, and you can get some real performance out of it. So the operation here includes the enabling of a couple of options. I believe that in Xcode today there's actually an option for auto vectorization that will do that for you and enable the process.

So if you're looking for more information, you can contact Mark Tozer or Matthew Formica. And Mark, do you want to come up? So to add to that, the reference library, some documentation that's on Apple's developer website, some tech notes that are written that were posted since last developers conference with a lot of information.

We'll have George Warner up here in a few seconds with the Q&A who participated in writing some of that technical documentation. A couple of takeaways I want to make sure you go away with this morning. You know, it's been a year since we introduced the PowerPC chip, as I said earlier.

So you should be looking at transitioning your code to the G5. You should be looking at optimizing the code and making sure that it performs at its best. Optimization is a skill. It is not something that comes for free. Although the tools that you heard about today in the compilers in both Apple's compilers as well as IBM's will provide a lot of assistance. But there is sometimes when you need to get in there and roll up your sleeves and do the hard work.

For that, we have optimization workshops at Apple. For the past year, we've had over eight workshops, one a month essentially, helping developers like yourselves to work through the problems of optimizing your code. So I encourage you to participate in those workshops. They're posted through the Apple developer connection emails. And they'll be continuing on throughout the rest of the year. I'll have the next one starting in August. I believe the first or second week of August, if I can remember correctly. And then we have the next one.

The other thing is that it does take a lot of work to do optimization work, but there are a lot of rewards to it, as Sanjay pointed out in some of the sample code. We're here to help you through those problems. So, as Steve mentioned, from IBM, there is the optimization lab here all week.

Please take advantage of those resources. We're here and committed to helping you guys write the best code for this platform. We feel that the G5 has a lot to offer. It has a lot of headroom to grow. The best applications on the platform are those that take advantage of all the abilities that the hardware has to offer. So please, again, make sure that that's something in mind when you're looking at revving your application, writing a new application, or just taking the time to take a look at what you've done in the past and maybe improve upon that.