Power Macintosh G5 Architectural Overview - WWDC 2003

Hardware • 58:27

This session discusses the architectural features and design goals of the newly announced Power Macintosh G5. Starting with the G5 processor, senior engineers from IBM provide an overview of the powerful 64-bit processor and the benefits software developers will be able to exploit to write powerful applications. Apple's hardware engineering team describes the features of the high bandwidth system architecture.

Speaker: Peter Sandon

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. Welcome to session 502, Power Macintosh G5 System Architecture Overview. My name is Mark Tozer-Vilches. I am the desktop hardware evangelist for Apple Computer. So what do you guys think of the Power Macintosh G5? Awesome! Well, great. Steve gave a great, exciting presentation yesterday. Today we're going to follow that up with a little bit more in-depth technical information, both about the CPU as well as the system architecture. So without further ado, I'd like to introduce Mr. Peter Sandon, the IBM Senior Power PC Processor Architect. Thank you.

Thanks, Mark. Good morning. As Mark said, I want to describe to you this morning the IBM PowerPC 970 microprocessor. Steve covered it pretty well yesterday, but he left me a few details to fit in. So I'm going to do that. What I'd like to do is provide some details that perhaps you'll find useful in your work with the G5. Other details I'm going to also put in that perhaps you may not use directly, but will take advantage of indirectly as you use and work with the G5 processor.

So last fall I gave a high-level overview of the 970 at the microprocessor forum, and I'm going to start with several slides from that presentation to give the high-level overview. That's the first two bullets. Secondly, I'll go into details on the several aspects mentioned here. So let me start with some key aspects of the 970.

First, this design was derived from the high-performance Power 4 microprocessor, which is used in IBM's high-end server systems. So the 970 is also a high-performance design. It runs at 2 gigahertz. It executes instructions multiple dispatch at a time, multiple issue. It also executes instructions out of order to a degree that you haven't seen in previous Power PC processors.

The 970 is a full implementation of the 64-bit PowerPC architecture, but is compatible with, in fact, runs natively 32-bit code. The G5 includes the vector enhancements called the velocity engine, also includes a prefetch engine to reduce memory latency. And finally, the high-speed bus that Steve mentioned yesterday to off-chip memory and I/O runs up to 1 GHz corresponding to 8 GB/s of peak bandwidth.

So this is a block diagram of the 970 showing its major components. I'm going to kind of use this block diagram as a map as we go along and discuss the different components. All the text surrounding I won't go through here, but I'll cover as we go along.

So let me start with the instruction pipeline shown on the left side of the block diagram. The L1 cache is a 64 K byte L1 from which eight instructions per cycle can be fetched by the instruction fetch unit and up to five instructions fed into the instruction dispatch unit, instruction decode unit and then on to dispatch. So as a group, up to five instructions can be dispatched, up to ten instructions per cycle can be issued to the execution units. And in all, over 200 instructions can be in flight at any one time.

The data pipe shown on the right side of the diagram starts with the 32k byte L1D cache. The two load store units below that L1D cache move data between the cache and the three register files shown there, the FPR, GPR, and vector register file. The two L1 caches are backed, shown at the top, by a half meg L2 cache, which in turn is backed by main memory via the BIU.

and continuing same diagram. In the middle of the diagram are shown the memory management arrays that support virtual memory. This is a 64-bit implementation, so effective addresses are 64 bits wide. Real addresses are 42 bits wide for a 4 terabyte memory range. Finally, down at the bottom are the computational execution units, the dual fixed point, the dual floating point, and the dual vector units. Along with the two load store units that I just mentioned and the branch and condition register unit, which aren't shown in the diagram, those comprise the 10 execution units of the processor.

So what I want to do now is repeat what I just said, but in a little more detail in certain areas, and starting with instruction processing. So this is a pipeline diagram showing how instructions move through the processor. Each block here represents a stage where an instruction spends a cycle.

Instructions move through the pipeline starting at the top where they're fetched from the cache, move down through decode, dispatch, then to issue and execution, and finally at the bottom come out and complete. The lower part of the diagram shows the individual pipelines of the individual execution units. The upper part just represents the movement of the instructions through fetch and decode. So I'm going to start at the top with instruction fetch.

[Transcript missing]

All of the caches in the 970 are organized as 128-byte cache lines, but the instruction cache lines are further subdivided into four 32-byte sectors. So it's a sector each cycle that gets fetched from the instruction cache, and therefore for maximizing performance, it's important to align your branch targets on these 32-byte boundaries to maximize the fetch bandwidth.

Once the instructions are fetched, they're put into the 32-instruction fetch buffer shown at the bottom, and then up to five instructions per cycle are removed from the fetch buffer to send off to decode and dispatch. So the goal of this part of the hardware is to keep that pipeline busy below the fetch buffer. And what could prevent that, for example, is a miss in the ICache.

When the IFAR address is not found in the L1 ICache, a request goes to the L2 cache, the data is brought back if it's found in the L2 cache, and the fetch stream continues. But that stream is stopped for 12 cycles while that happens. So when that L1 cache misoccurs, not only does the fetch hardware go after the missed cache line, but it goes after the next sequential cache line as well. It brings it back into one of the four prefetch buffers shown at the top of the diagram. So that the next time an L1 cache misoccurs, if that address is found in the prefetch buffer, there's only three cycles missed of fetching.

Similarly, on that table at the bottom are shown these latencies. When a branch is taken, predicted is taken, the branch prediction logic updates that IFAR register with the new address and there's a two cycle bubble in the fetch stream. Of course, the point of the fetch buffer is that as you're feeding it, it's starting to fill up so that when you get these two or three cycle bubbles in the fetch stream, you're still able to maintain the stream of data down into decode, the stream of instructions.

Branch processing occurs in two places in the 970. First, branches are predicted as they are fetched from the cache, and second, they are resolved when they get down to the branch execution unit. "Why do branch processing?" Steve asked yesterday. Because particularly in a deeply pipelined design like this, we're always fetching well ahead of executing. So if you had to wait until you execute, that is, until you know the conditions of whether a branch will be taken or not, you will miss opportunities to keep the pipeline full.

So what you want to do is predict branches early and predict them accurately to avoid those delays. So as instructions are fetched from the cache, they are scanned for branches, and up to two branches per cycle are predicted. There are two branch mechanisms, one to predict the direction a conditional branch will take, and that mechanism uses three branch history tables, which implement two different algorithms, a local and a global, for predicting the direction of the branch. The second mechanism is for predicting branches to registers. So there's a count cache that's used to predict branch-to-count branch targets, and a link stack to predict branch-to-link targets. Each of those data structures holds previously seen branch target addresses for later predictions.

So, predictions are made up as instructions are fetched. The branch now works its way through decode and dispatch. It finally gets to the execution unit and now it resolves. That is, now it knows whether the condition was true or false, whether the branch should have been taken or not.

If it predicted correctly, life goes on, life is good. If it predicted incorrectly, what the branch execution unit does is it updates the IFAR with the correct branch target address and it flushes, of course, all the instructions that were behind that branch because they now no longer belong to the correct stream.

The delay in that case to fill the pipe and get it going again is 12 cycles. So it's that 12 cycle branch penalty that one wants to avoid. This mechanism, this prediction mechanism over a wide range of applications tends to be accurate in the mid 90% range. So perhaps one out of 20 times a branch will be mispredicted and so very few times will you pay the penalty. The last bullet here simply points out that this dynamic branch prediction facility can be overridden by software using an extended branch conditional instruction in the 970 which allows the compiler or the programmer to statically predict that a branch should always be predicted taken or always not taken.

Instruction decode is a multi-stage process here. I'm just going to mention one aspect of instruction decode as it's different from most previous Power PCs. And it is as follows. The Power PC architecture is a RISC-type architecture, and therefore each instruction in general corresponds to one simple operation. However, there are exceptions to that.

For instance, a load with update instruction corresponds to two simple operations, a load of one register and update of a separate index register. What the 970 does is it cracks, as we say, that instruction into two internal ops, and those internal ops then flow through the pipeline. And furthermore, there are more complex instructions, like load multiple, that correspond to a sequence of several operations. Those are translated into a microcoded sequence, which then flows through the pipeline.

And finally, finally in terms of fetch and decode, we get to dispatch. This corresponds to the transition from the fetch decode to the execution stages. It also corresponds to the transition between in-order processing and out-of-order processing of instructions. So when instructions reach the dispatch stage, they can be dispatched as a group of up to five instructions if all of their hardware resources are available. Most instruction types can dispatch out of any of the first four dispatch slots there. The fifth dispatch slot is reserved for branch instructions.

So once dispatched, an instruction will take a place in one of these issue queues. All of the boxes there in the issue queues show how many entries can be in the issue queue. Once in the issue queue, an instruction can issue to be executed if all of its operands are available.

And so if one instruction is waiting on operands from a cache miss, for example, other instructions behind it can continue to be processed. And it's this massive opportunity for out-of-order execution of instructions that allows the G5 to keep processing even in the presence of pipeline and memory delays, which you normally run into in the normal course of processing. Finally, once instructions execute, they wait until all of the instructions in their dispatch group are finished and they complete together in order.

Just briefly on virtual memory, the memory management unit, one of its main features is the support of address translation for virtual memory. Now virtual memory is something that makes the programmer's job easier, its programming model is easier, it makes OS implementations easier, but it actually involves some complexities in the hardware to support it.

So briefly, a segmented, paged virtual memory system like this one requires a two-step address translation process. First, an effective address, what you program in, is mapped to a virtual address using a segment table. And second, a virtual address is mapped to a real address, what the hardware understands, using a page table.

And what's needed to support this two-step process and then look up in the cache is some sort of hardware optimization to make this efficient. So what's implemented here is the usual TLB, table look-aside buffer, which caches page table entries, but also a segment look-aside buffer new to the 64-bit processor. This replaces the segment registers of the 32-bit processors, which caches the segment table entries.

And still, that two-stage translation could be costly, except that we've implemented another level of caching of address translation. It's called an ERAT. That's the effective-to-real address translation table. It caches the most recent effective-to-real, the two-stage process, effective-to-real addresses in a small cache, small fast cache. So what the diagram shows then is that the effective address in the IFAR accesses the L1 cache, the L1 directory, and the ERAT all at the same time. And if all goes well, like it usually does, and those all hit, you get the instructions out on the next cycle. Similarly, there's a DERAT to go with data cache accesses.

For data processing, just a couple points to make. One is on the registers. What the programmer sees is a set of 32 general purpose registers, a set of 32 floating point registers, and a set of 32 vector registers. Those are the architected registers. What's implemented in the hardware to support those are more registers for two reasons. There's out-of-order execution and there's multiple execution units.

So to handle out-of-order execution, we need a place to put the results that we've executed out-of-order until they become the official result and go into the architected register. We call those rename registers, and since there is so much capability for out-of-order, there are more rename registers than architected registers. So the 970 has 32 GPRs architected plus 48 renames for a total of 80 registers, all 64 bits wide. The FPR similarly, 32 architected, 48 renegades. The vector registers similarly, 32 architected, 48 renames.

In addition, we've got multiple execution units and to keep up the supply of data operands to those units, we've duplicated those register files. So there's two exact copies of the 80 GPRs, two exact copies of the FPRs and so forth. So the 32 architected registers we've implemented as 160 registers for each of the register files.

The latencies at the bottom just show load to use delays. When you do a load of an operand and then you want to use it, you can issue the load and then you have to wait some number of cycles to issue the dependent operation. In the case of the fixed point unit, for example, it's three. Floating point, it's five, and the other values are shown there.

The second thing I want to say about the data side is that there is a data prefetch facility that in hardware initiates data stream prefetching. So the idea is that this prefetch hardware monitors the activity of the L1 data cache. When it sees two misses to two adjacent cache lines, it says, oh, there's a pattern. I'll go after, I'll prefetch the third cache line in the sequence.

If it then sees a hit to that third cache line, it'll go after the fourth line and prefetch it into the L1 and so forth. So it's demand-paced, which means it'll keep fetching ahead for as long as the data stream is accessed. Cache lines are brought into the L1. and further ahead they're brought into the L2 using this mechanism.

So in addition to this hardware-initiated prefetch, software can also initiate a data stream prefetch using an extended version of the DCB touch instruction. The 970 supports this extension of the DCB touch, which allows it to touch not just one cache line and bring it in, but to start this prefetch mechanism to keep fetching ahead. And a third mechanism for prefetch is the implementation of the data stream touch instruction associated with the vector extensions.

The computation units at the bottom of the block diagram, I just want to cover what gets executed where. There are two fixed point units that are nearly symmetrical. They both execute the usual arithmetic and logical and shift and rotate type instructions. They both also execute multiplies, so you can have two multiplies going at the same time. The difference is that the one unit executes the fixed point divides while the other unit executes the SPR move instructions.

The two floating point units are symmetric. They both execute IEEE single and double precision operations. They both support the IEEE formats for denorms, not a numbers, infinities, and so forth. They both support precise exceptions. They also both support the optional floating point instructions for square root, select, reciprocal estimate, and reciprocal square root estimate. They do not support a non-IEEE mode.

[Transcript missing]

And finally at the top, the L2 and bus interface, which will segue us into the next segment. The memory subsystem has a few subcomponents itself. The cache interface unit shown at the top takes four types of requests from the core. One from the fetch unit for iCache misses, one from each of the load store units for de-cache misses, and a fourth one for the TLB hardware table walker and the prefetch hardware. What the CIU does is simply direct those requests to the right place. For instance, an L1 iCache miss will be directed to the L2 cache where it will be looked up.

If the data is found, it will be returned. If the data is not found, the L2 cache controller will forward it on to the BIU and on to memory. The non-cacheable unit on the left side simply handles all of the other activity not associated with the L2 cache. The G5 processor is a very powerful and powerful tool for the iCache. goes off to the bus.

So this high bandwidth processor bus is what we call the elastic interface. It consists of two buses, two unidirectional buses, each four bytes wide, point to point. It's not a shared bus. Source synchronous. The clocks are sent with the data. And I put in this point about initialization alignment.

At power on reset, there's a procedure that the processor and system controller go through to de-skew all of the bits on a bus and then to center the clock within the eye of those data bits. And my reason for pointing this out is to say that there's a lot of work involved on both the processor and the system controller side to get a bus to run at one gigahertz.

The logical interface here supports a pipelined out-of-order transactions. The address and control information shares the same bus as the data. There are three types of command packets, read, write, and control. Each of those consist of two 4-byte beats on the bus that contain the 42-bit real address, transaction type, size, other control information, and a tag. Data packets come in sizes from 2 4-byte beats to 32-beats. To send one byte on the bus requires a 2-beat packet from 1 to 8 bytes. The 32-beat packet is the cache line size, 128 bytes.

On the right, the diagram shows a little bit more detail about what I called a four-byte wide bus. The bus actually consists of three segments. One, the address data segment, which is actually 35 bits, the 32 of data plus some control bits. Second, there's a transfer handshake, single signal, and two signals for snoop responses. And so the outgoing, with respect to the processor, and ingoing buses are shown here.

Here are those three segments per direction. Again, just to show an example of a read transaction. The transaction is initiated by the processor by putting a read command packet up in the upper left corner out on the address data out bus. And I'll give the end of the story first. Out on the other side to the right is the data coming back from the memory controller.

What's happening in between, without giving a lot of detail, is that there's handshaking going on to acknowledge transfer of information and also to support memory coherency. So again, this is a point-to-point bus, so one processor can't see directly what the other processor is doing. In order to maintain memory coherency, the system controller has to get involved and reflect commands back to all the processors so they can snoop and stay coherent. And that's what you see, some of that handshaking.

This looks like not very good utilization of the bus. That's because I just isolated the read transaction. Normally, all of this activity would be interleaved with all the other activity on the bus. And the other point, the numbering shows that there are, the way the bus is managed is that there are fixed delays between activity and responses to activity. And this is the way we correspond the handshaking with the original transaction because things are happening out of order and the snoop responses and the handshakes are not tagged or validated in any way.

Okay, so let me just go over one more time what I've said. This G5 processor is a high performance processor. It achieves its high performance by running at 2 gigahertz, also by its superscalar completion of five instructions per cycle, by its out-of-order execution of instructions. It's an implementation that supports both 64-bit and 32-bit applications and operating systems.

I've mentioned kind of the width of the pipeline that we can fetch eight, dispatch five, issue ten instructions every cycle. Also that the branch prediction scheme is highly accurate across a range of applications so that we avoid that branch penalty that I mentioned. We get high computational throughput.

By using two fixed point, two floating point, and two vector units, as well as two load store units to keep everything busy with data. And also this data prefetch engine, which keeps the latency to memory, the effective latency to memory, low by keeping things as close in to the processor as possible.

And finally, the high-speed bus, which I just mentioned, on the 2 gigahertz processor will run at 1 gigahertz for an 8 gigabyte percentage. bandwidth to off-chip memory and I/O. So that's all I have to say. I'd like to thank Mark and Jesse Stein from IBM for helping me prepare this presentation, and I'd like to thank you for your attention and your interest in the G5.

Thank you, Peter. And you thought he was going to only answer the branch processing questions Steve had, huh? So to point you to some more information, if you wanted to get some more documents specifically from the IBM PowerPC page, a couple of URLs here available for you. There are several documents posted there.

Later on in the presentation, I'll give you some more pointers to other references on the Apple site. So to continue our journey from where IBM handed off the PowerPC 970, the G5 processor, to Apple and what we did then with the system architecture, I'd like to introduce to you Keith Cox, principal engineer, systems architecture.

Thank you, Mark. So Peter told you a little bit about the G5 processor itself. I'm here to tell you more about the system we wrapped around it and how we, our vision of bringing that performance out and turning it into real world performance for your users and your applications.

So this is the general block diagram of the Power Mac G5. The thing I want you to get from this is that we started over with this system. We did not take the Power Mac G4 architecture and say, okay, how do we tweak it? We got to get a little faster. What we said was, we're getting a really cool processor from IBM. It's going to really chew up instructions.

It's going to really need data. We really need to keep this sucker fed. So we started from the ground up. We opened up all the pipes. So what I want you to get from my presentation is that not only is this the next generation Power PC architecture, but in addition to that, we've added high bandwidth buses everywhere.

We've improved the memory system greatly. We've increased the PCI buses in the IO system. And on top of that, we've added an advanced thermal management system because we know the users like their systems to be quiet. They don't like them to be loud and roaring like jet airplanes or anything.

So, this is the general block diagram of the Power Mac G5. It's actually very similar just in blocks to a G4 block diagram, but there are some important differences to note. The first is that the processor bus is not shared in a multiprocessor system. That's a key difference when you get to MP and the kind of performance that we have and the kind of bandwidth that we need to be able to deliver to the user. Another important difference is that the system controller connection to the I.O. system is no longer a PCI bus. It's actually a hyper transport bus that has up to 3.2 gigabytes a second of bandwidth, connects to high bandwidth devices down below the system controller. That's all new.

So if we compare the G4 and the G5 processors, you've just heard from Peter about how the G5 can keep a million things in flight or at least 200 and some odd. It runs at 2 gigahertz and can complete five instructions at a time. It just has a huge appetite and it's a big leap over the G4. The system, similarly, we believe is a big leap over the G4. The front side bus has six times the bandwidth of a G4 system. If you've got a multiprocessor system, it actually has 12 times the bandwidth of a G4 system.

The memory system is more than two times faster and the PCI system is seven times the bandwidth. So we've really tried to open up the inside of the system. Let's dig down in a little more detail on all of that. The front side bus is 8 gigabytes per second. We quote it as double data rate 64-bit. As Peter was just showing you, that's not quite correct. It's actually a pretty complicated bus to describe. That's what we put in the marketing fluff to describe it.

I mean, we really want our users to understand the basic gist of it, which is it's effectively 64 bits wide of data, and it's 8 gigabytes a second of bandwidth. In reality, that's two 4 gigabyte per second channels, 4 gigabytes a second going up, 4 gigabytes a second coming down on each processor.

There's a little bit of overhead for the packet headers and that sort of stuff. So the real achievable bandwidth number is a little smaller than that. But it is close to the 8 gigabytes per second total on that interface. Then if you had two processors, we've got two interfaces. So that's a total of 16 gigabytes a second, four up, four down, times two processors to get the full bandwidth.

In order to deal with that, you really need a really high bandwidth system controller. This was a ground-up redesign at Apple that really intended to achieve these real levels of performance and be able to deliver these kinds of bandwidths. In addition to just moving 16 gigabytes a second of data, there's all the coherency protocol that Peter was just describing where one processor requests something, you've got to check the other processors. It may have it modified in the cache.

So Apple's always delivered cache-coherent systems. We do that here. The G5 implements something called cache intervention as well, which it says that if processor one wants a line in the cache, processor two has it modified, the system controller actually delivers the data coming out of processor two straight across and back up to processor one without having to go through the memory system. What this does is it does two things. One, it doesn't chew up your valuable memory bandwidth if you don't need to. The other thing is it.

It takes full advantage of the high bandwidth of the processor interfaces to deliver things fast to the other processor while not really interfering with the other processor. Yes, it takes a few beats of the bandwidth for processor two to deliver the data, but it had to do that anyway. It had it modified. It owned that data. And so it cost it nothing else, and yet we got the lower latency and higher throughput by doing that.

In addition, one of the points you're going to hear throughout my talk is that all these links are point-to-point. We're connecting endpoints directly to get the highest efficiency possible, the lowest latency possible, and really just make the data scream through the system without bottlenecking at any single point.

So you just heard how the G5 processors can talk directly to each other without interacting with any of the rest of the system. In reality, the AGP bus has its own direct port into memory. The IO system through hypertransport has a port into memory. Each processor has their own individual read and write cues into memory. If you look inside the system controller, if you could open it up, there are actually direct point-to-point links between all the interfaces as well. So we've really tried to avoid any of the bottlenecks of some system controllers. controller designs where things really get choked up.

If we move on to the memory system, the first thing we did was we doubled the width. I mean, that's the obvious thing. You need more bandwidth, you go wider, you get more bandwidth. In addition, we pushed it up to 400 megatransfers per second or PC3200 DRAM or whatever label you want to apply.

This gives us a total bandwidth of 6.4 gigabytes a second. That's pretty much state-of-the-art. That's the best you can do with current memory technology without going really extremely wide, which starts to impact your cost in a very negative manner. Going 128 bits wide, you do have to put two DIMMs wide because each DIMM is 64 bits, so too wide to get 128 bits. So you have to install them in pairs. But one thing you'll see in the Power Mac G5 system that you don't see anywhere else is the depth.

Our memory system is two DIMMs wide by four DIMMs deep at 400 megatransfers per second. That, as far as I am aware, is not done anywhere else in the industry. It's actually a great challenge to get 400 megatransfers per second on four DIMMs that are all connected together to the same memory interface.

And that's one of the values that are one of the places where Apple put a lot of engineering to get both the memory speed, the memory width, and the memory depth so that we can have the large memory system. And the customers can get the eight gigabytes of memory and eight DIMMs that we support.

If we move on to the AGP system, it's pretty much a standard AGP 8X, AGP 3.0, all buzzword compliant or spec compliant interface. AGP Pro is new for us and it's a great idea. Our case, we support up to 70-watt AGP cards. The AGP Pro spec has different levels, and at those different levels, you can start growing your heat sink into the slot space of the PCI cards. So technically, at a 70-watt card, the card vendor is allowed to take up two of your PCI slots with just heat sink to cool that. So that's something to be aware of. I don't know that there's much more to say about that.

If we move on to the I/O system, coming out of the system controller is the last major bus, which is the hyper transport bus coming down to the PCIX bridge. That bus, hyper transport-- describes it as a 16-bit bus. It's really two 16-bit point-to-point interfaces, one each direction, similar to the processor bus. So you've got 16 bits up, 16 bits down, running at 800 megatransfers per second in our implementation.

Connected to that, you've got a PCIX bridge with two completely independent PCIX buses. So the PCIX spec says that if you have one slot, you can run it at 133 megahertz. If you have two slots, you can only run it at 100 megahertz. So that's what we did. We needed three slots. We had two buses. This is the bandwidth we get. It's seven times the 64-bit PCI bandwidth of what we've had in our previous systems.

So one thing you might be aware of is on the two-slot bus, if you plug in two cards and one of them's slow and one of them's fast, the bus has to run at the speed of the slowest card so it can handle the transactions and understand what's going on.

So as a configuration issue, maybe if you're designing cards and documenting how to install them, you should be aware that if you've got two cards that are fast and one that's slow, you might actually want to put the slow card in the single slot. Okay. as opposed to slowing down the other two.

Another thing to do with PCIX, the PCIX spec drops support for 5-volt PCI cards. That's really just a requirement to get the interface to run at the speeds that it runs at. So what happens is there are 5-volt cards. They're mostly very old cards. There's not new 5-volt cards being designed that I'm aware of or haven't been aware of for a couple of years. Most cards nowadays are 3.3-volt universal cards, as they're called.

Those cards can exist on a 5-volt bus but only signal at 3.3-volt levels. And then, of course, standard 3.3-volt PCI cards also signal 3.3-volt levels. Those two flavors, the 3.3-volt and the universal cards, are fully compatible with PCI-X. The bus controller figures out that I've got a PCI card instead of a PCI-X card and it's capable or it only runs at 33 megahertz, say, and it slows down the clocks on the bus to support that card. Likewise, there are PCI-X cards that only run at 100 megahertz, so even if you plug them into the 133 slot, they won't run 133 because they've reported the speed that they're capable of.

If we move on to the I.O. system, it also hangs off hyper transport coming out the far side of the PCIX bridge is another hyper transport interface. This one's only 8 bits wide. That's really, it's not 16 bits because it doesn't need to be is the basic answer. The 8-bit hyper transport has 1.6 gigabytes a second of bandwidth for I.O. Historically, the I.O. controllers had about 100 megabytes a second, so it's only 16 times.

So it was sufficient. We did move the gigabit Ethernet interface and the FireWire interface down into the I.O. controller, which works just fine because it now has plenty of bandwidth to do that. If any of you remember the G4 block diagram, those two functions were in the north bridge or the system controller in the G4 system simply because they couldn't get enough bandwidth off the PCI bus to exist there.

In addition, we've gone to serial ATA, which is a higher. It's actually roughly equivalent to Ultra ATA 100, but the thing is, now you've got two of them, and the disks are completely independent as opposed to an Ultra ATA master-slave where the drives really interact horribly as far as if you're accessing stuff off one versus the other.

You have to wait for one before you get to the other. Here, the drive interfaces are completely independent, so the drives can be run simultaneously at full bandwidth without beating on each other. A note about the USB 2 controller. I've seen lots of comments and confusion out in the technical community as well as the user community about when somebody says USB 2.0, is it really 480 megabit per second or is it just USB 2.0?

Which label did they have? High speed or full speed, one of those. They're playing games with names and saying they're USB 2.0 when they really still only run at 12 megabits a second. And just to be clear, this implementation is the full 480 megabit per second USB 2.0.

Also, we added the optical digital audio I/O. We have customers that really like that. Analog audio I/O in and out as usual. This machine supports Bluetooth and it also supports Airport Extreme. Since as you can see, this enclosure is basically a metal box, it's kind of hard to get an antenna out of that.

So there's actually ports on the rear with small antennas that stick out that are installable that either come with the machine or with the Bluetooth or Airport option when you buy it. In addition, we put some new ports on the front of the machine. In addition to the headphone port, we've added a USB port and a FireWire 400 port. That's really for connecting.

That's really for connecting those digital hub type devices, you know, when you bring your iPod or your digital camera, something that you plug in and out all the time. It's really just for convenience. And I'm glad to hear that you guys like it because there was quite a bit of debate about that. It's hard to do, believe me. It sounds simple, but SCC gets involved and they like things not to interfere with radio stations and such.

Anyhow, now I'm going to talk a little bit about thermal management in the system. This is one of the places where we really put a lot of thought and a lot of effort and really wanted to do a good job. Thermal management, in some sense, is about cooling, but it's really about noise. It's really about you walk into an office, or you walk, much more important, you walk into that bedroom or office in your home where you've got your computer, and if it's roaring away, it's just a horribly noisy, annoying thing.

We implemented sleep a few years ago as one way to help solve that problem because when you put your machine to sleep, it goes virtually silent. For this machine, we wanted it to be virtually silent while running. Now, that's a challenge because you've got two of these G5 processors which have just huge amounts of processing power and it takes electrical power to do that, which generates heat.

You've also got PCI cards that in some people's systems can take huge amounts of power if they're doing video processing and that sort of stuff. So managing all this to a least common denominator type solution just would not work. The thing would roar like an airplane. And we knew that wasn't acceptable. So what we've done is we've broken the machine into separate discrete thermal zones.

You can kind of see them coming in the picture on the left there. Let's start at the bottom. That's the power supply actually hiding under there. It's pretty much hidden from the user. You can't see it below this edge, but there's actually a wall right here in the bottom of the box. The power supply takes in cool air from the front and exhausts hot air out the back. That means it's not preheated by the CPUs, nor does it preheat anything else.

The power supply manages itself and the fact that it's getting cool air means that it does not have to run its fans very fast to keep all its parts within specification, which has been a challenge to us in the past. If we go up to the top of the box, that's where the optical drive is, that's where the hard drives are.

That zone has its separate thermal chamber as well. Air comes in the front, goes through the box, and comes out the back. In this particular case, we have a temperature sensor mounted up in the corner of the box that monitors exhaust temperature constantly. If the machine moves into a hot room, we need to move a little more air to keep those drives cool. If all of a sudden you're hitting your hard drive hard, it's going to be putting off an awful lot of power, heating up the air, we see it get hotter, we turn up the cooling to keep that drive cool.

We maintain that zone within spec, but only to the amount you're using it and only to the amount required by your environment. So if you're in a cold room, your machine's quieter. If you're in a hot room, it has to move the air a little faster to keep the machine cool. But as I say, it's absolutely the minimum required to maintain the machine in its operating state.

If you go down into the next zone right here, you can actually see the kind of dip in the plastic chamber. This guides the airflow over the PCI cards. So rather than all the air running up over the top and out the back as fast as it can, it actually runs through the cards between the cards and keeps them cool individually.

Given the huge variety of placement options and power configuration options there are in PCI cards, there's no way we can predict, you know, that the slot card in slot two is going to be hot while the card in slot three is cool. And we can't put a temperature sensor anywhere to determine how to cool that zone.

So instead, we went to actually monitoring the power consumed by all your cards. So if you have a graphics card in there and that's it, and it's consuming very little power, the fan's going to run at its minimum speed, which is quiet. It's really quiet. You can't hear it.

If you have a high performance NVIDIA card or ATI card that actually is pretty high power, but you're not gaming right now, you're not using that power and it's not being consumed and the fan still runs low speed. If you're gaming, yeah, the card starts to get hot, but we just start turning up the fan and keeping it cool just to the absolute level required to cool the machine. We've got lots of airflow to work with. We don't have to work incredibly hard to cool most of these cards. Until you get to a full PCI configuration, that fan runs relatively slow.

The most complex zone in the system, of course, is the one that handles the G5s. If you notice, there's actually two fans in the front of the box and two fans in the rear of the box. There's actually these two right here and then two right back here at the back of the box.

Now, I've been watching the web and people are saying, you know, with nine fans, man, the thing's going to roar. Well, it's actually the exact opposite. As I've been explaining about the other fans, you know, we only cool to the minimum possible and since we don't preheat any air going from one device to another, everything's getting cold air so it just takes much less air to cool it.

The CPUs have this same philosophy, and the push-pull nature of those fans actually let us run them slower as well because the heat sink has a resistance to airflow. So as we push air through it, if we didn't have something pulling on the other side, then we'd have to push harder, i.e. we'd have to run the fan faster.

The fan pushing against that pressure actually is what makes it make noise, or a good portion of that noise is actually the back pressure the fan feels. So by putting the two fans in the push-pull configuration for a given amount of airflow, we're actually much quieter than we would be with a single fan.

In addition, they're paired top and bottom to match up with the CPUs. I mean, you can see the lines in the animation. This fan and this fan cool this CPU for the most part. I mean, there is some cross coupling and we call it one zone, but the two pairs of fans are controlled separately.

So say you have a multiprocessor machine and you have one thread that just eats the CPU and then the other CPU is sitting idle. We don't have to turn up the fans on both CPUs even. We just turn up the fans on the CPU that's getting hot. In addition, The fans are controlled by the temperature actually of the CPU.

So we're actually sensing the temperature that's important to keep within specification. So we're once again cooling only the amount required by the CPU. This brings up another trick that we've got in our back pocket, which is on PowerBooks for a few years now, you've seen the options to run them faster or slower or automatic switching.

Today, in the Power Mac G5, we've added that technology to the G5. What the 2 gigahertz machine actually does is when you need it, it runs 2 gigahertz. When you don't need it, it runs at 1.3 gigahertz or two-thirds of its full horsepower. Now, in reality, most of the time, for what you're doing, that's plenty. I mean, a 1.3 gigahertz G5 is a screamer.

But, you know, there are people running Photoshop and Final Cut Pro renders and all sorts of high-end applications that really chew on the processor. And that performance is fully there for them. And it's fully there for you whenever you need it to run your compiles or whatever. But the thing is, is when we can drop that performance by that 1.3, we can save roughly 60% of the power consumed by the processor itself. And when we put on all our dynamic scaling, we actually get down to about 1.6.

That's 1.6 of the maximum processor power. So when your machine's sitting there idle in the finder, it's consuming 1.6 of the power that you have available to you any time you need it. It switches in milliseconds and speeds back up. And there's actually not even a processing latency hit to speed up and slow down. We continue execution as we go from 1.3 to 2 gigahertz and then back down from 2 to 1.3. You just slowly get faster and slowly get slower if you don't need it. So that allows us to save a whole lot of power.

And then we can run the fans incredibly slowly on the processor. In fact, when the machine is idling, we may end up with the fans spinning a little bit just because, but we don't need to turn them at all. That's the efficiency of the cooling system that we put into the G5 machine.

We do not actually have to spin the fans to cool these CPUs when they're sitting there idle in the finder. So if you leave your machine and you get up to go to the bathroom, It's not doing anything. Or you're just sitting there staring at your mail. It's not doing anything. Well, some mails take time to read and process.

Actually clicking next on your little mail program doesn't take any real horsepower either. So a lot of the stuff that you do, you know, editing source code, for example, doesn't take a lot of the CPU. So when you're in that mode, we're down at a sixth the power and the fan's hardly spinning at all.

If at all. So we think that's really important and it's of real value to our users. And one of the messages I just want to get across is although there's nine fans in there, that's so we can spin them slow. If you only have one fan, something's probably going to be hot just because it's been heated by everything else as the air winds its way through the box and all the different heat sources.

And you've got to run it fast all the time. And by putting all the different fans in that control the different zones independently and only to as much as they need, we can manage to keep all the fans running slowly as much of the time as possible and keep the whole system quiet.

So I guess in summary, I'd just like to point out that the real goal of the G5 architecture was to take the G5 processor and wrap a system around it that could allow you guys to deliver the applications and the performance to your users that really screams and really makes them want to buy more computers, really. That's my only personal take.

But anyhow, so what we did was we just opened up all the pipes in the system. We've got the high bandwidth interfaces from the processor, the system controller, the system controller that connects everything together. And then high bandwidth memory system, AGP interface, and IO system to boot to deliver everything to everybody that they need. Thank you very much.

In terms of reference tech notes that we've posted that went live yesterday, there are two important ones here, Tuning for G5, a practical guide, and the PowerPC G5 Performance Primer. Now, the presentation that you saw today regarding the G5 processor from Peter and the system architecture from Keith, I don't want you to leave here thinking, great, Apple delivered this super fast system, my application is going to run fast. And yes, it will.

But there's a whole lot more performance that you can achieve out of this architecture. And that was the goal. The PowerPC G5 has a lot more to offer than what you see here. We've provided a lot of resources at the developers conference, online, and following the developers conference in terms of developer kitchens that we will have to help you understand how to unlock that performance in your applications. There are several sessions that will cover how you do that. So I want to go and show you just a few of those. of those here on the roadmap.

Wednesday, there's a session entitled "Chud Performance Optimization Tools in Depth." That's session 506. Highly recommended. If you are not profiling, analyzing the performance of your application, looking at where your function calls are spending the most time, you are leaving a lot of performance on the table. You need to be at this session to understand how to optimize your applications for the G5 processor. Session 507, Mac OS X High Performance Math Libraries.

Our math high performance group has worked extensively to tune these libraries specifically for the G5 processor. These are libraries that come in Mac OS X that will be available as well in Panther as well as in Jaguar that will give you high performance access to arithmetic functions. Session 304, GCC in Depth, will talk about how using the compiler you can set flags appropriate for the G5 processor to again unlock that performance.

And then finally, throughout this whole week and today until midnight, we are holding a G5 optimization lab on the first floor in the California room. There are 40 systems set up to enable you developers to bring and work with our engineers on your source code to understand exactly how to use the tools to profile your application for performance and what changes you need to make to unlock that performance.

Again, one of this, the main goal of this lab is not to sit down, take a test drive, see how fast these dual processor systems work. It's really to sit down with an engineer and work on your code. Later on in the week, there is an ADC compatibility lab at the very end of the labs on the first floor where I'll have a system there. If you want to take a look at the insides and just kind of get a feel for the system itself, I'll have a system there for you.

But again, the lab itself is really gold for you to work on source code, work with our engineers. We have engineers from IBM. We have engineers from Apple, several of Apple's engineering groups. So please take advantage of that. Again, the hours will be today, all day through midnight, Wednesday, Thursday, Friday, 9 a.m. to 6 p.m.

Who to contact? If you have questions, information, follow-up on any of the information that you saw today, please contact me via email, tozer at apple.com, and hopefully you'll be hearing from me shortly after the developers conference on kitchens specifically designed to help you, again, optimize your applications for the G5. Thank you very much. We'll start our Q&A.