HPC Technology Update - WWDC 2004

Enterprise • 1:26:50

Learn how Apple's high performance computing technology is being used to build powerful computational clusters with Xserve G5 – and Xserve RAID. Hear how customers are designing and deploying large installations to solve a variety of complex computing problems. This session is a must for IT managers, project managers, and scientists who are building computational clusters.

Speakers: Douglas Brooks, John Thuotte, Dave Paulmark, Benjamin Singer, John Medeiros

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it may have transcription errors.

Okay, good morning. My name's Doug Brooks. I'm product manager of server hardware at Apple. And I'd like to, uh... This session is entitled "HPC Technology Update." In this session, we'd like to take a look at Apple and HPC, and specifically Apple products and technologies that contribute to HPC deployments. We'll also look at industry-leading third-party products that complement those solutions, and also we'll hear from two customers to talk about their HPC deployments using Apple technology. First, let's take a look at Apple and HPC. Now, in the last year, when you think of Apple and HPC, usually the first thing that comes to your mind is this.

There we go. Technical failure. Virginia Tech. Virginia Tech really was an early leader in that they saw the vision, the power of the G5 and Mac OS X server, and combining over a thousand Power Mac G5s and Mac OS X server software using technology for interconnect and FinnaBand, achieve phenomenal performance, achieving over 10 teraflops of computing power on its debut it ranked number three on the top 500 list and is the number one academic supercomputer in the world. An amazing achievement. And they also really prove the value and the price point that you could build and deliver a very high performance supercomputer system with Apple technology. Now what's interesting is while they may have been the first and definitely the largest G5 cluster deployment, they were definitely not the first Apple cluster deployment. Matter of fact, a lot of the early cluster work done on the Macintosh platform was actually done a number of years earlier, most notably at UCLA with the Appleseeds work. You may remember this from the late 90s. Actually, I believe this is circa 1998. UCLA with the Appleseeds project took, at the time, pretty fast technology, Bayes G3, 233 megahertz systems, 10100 Ethernet as an interconnect. It was running early on Mac OS 8 and was using Apple events as the middleware. Nevertheless, doing high-energy physics work that they were doing, it achieved phenomenal performance at very low cost. And actually, the system you see here on the screen outperformed Cray YMP running similar codes. So again, showing the value, some of the same things Virginia Tech has proved, with the G5. They actually shown some of those same capabilities much earlier. Of course, we've come a long way since then, from the G3, Bayes' G3 days, of course, to the G4, introducing desktop supercomputing with the Velocity Engine, bringing phenomenal vector processing capabilities that many applications have been able to take a tremendous benefit from by leveraging that power and that technology. And of course, most recently, the G5 bringing phenomenal floating point performance, vector processing with a system that delivers very high memory bandwidth and system throughput through this processor, providing a phenomenal foundation for computing.

And of course, the customers have responded. Of course, higher education customers tend to be on the leading edge. They've been some of our most strongest early adopters in the higher education market. And of course, if you heard the last session, you've heard about some of the deployments in the scientific field. Matter of fact, life sciences in particular has been adopters of our technology, primarily because many of the key applications that are run day in and day out in the life sciences arena have been velocity engine optimized and run very high performance on the G4 and the G5 processors. And so we're seeing lots of deployments taking advantage of our tools and our technology, combined with the ease of use that we provide in our systems in the sciences field.

And of course, we've had a phenomenal community of vendors bringing technologies, products. If you're developers, your applications coming to Mac OS X has really grown this community of people and applications that they're able to run on this platform. And so we're seeing more and more solutions come available into the HPC community on Mac OS X and growing more over time.

We've also seen a community of expertise build up in the HPC community, in the cluster community around Mac OS X and around the G5. Matter of fact, a notable one is the BioTeam, a consulting company in the focuses on the life sciences arena, really delivering great solutions for life sciences based in many part by Apple Technologies. And of course, Apple's responded. Apple's responded with products really focused and targeted at this market. Most specifically, the XServe cluster node, a machine tailored and streamlined specifically for clusters and high-performance computing configurations, streamlined for dual processing computer performance without a lot of the extras you don't need when you rack a whole rack full of extras in the system. And we've seen that gone even further. As a matter of fact, most recently, the Apple Workgroup cluster taken a complete solutions approach to this product line, been able to offer not just hardware, but hardware, software, cables, power, rack, interconnect, everything you need to perform a complete solution around bioinformatics. And it's been very, very popular. The response has been phenomenal. We're very proud that it won Best of Show at BioIT World earlier this year. And again, for customers doing bioinformatics, it is the easiest cluster to set up, really bringing Apple performance and ease of use to the cluster space.

And finally, we've had a phenomenal customer response. This is just a small sampling of customers who have recently deployed clusters based on Apple technology, XServe, and Mac OS X Server. Really, we see this market continuing to grow and more very exciting customer deployments. And again, you're gonna hear about two of them later on in this session.

As a matter of fact, we're seeing right now roughly 40% of our XSERV units going into clusters and high-performance computing. And as we see as XSERV continues to grow, we see this continuing to grow as well as an important slice of the XSERV pie, you might say. So... What does it take to actually put a cluster together with Apple technology?

Obviously, it's a lot more than just racking a bunch of XSERVs in a rack. So what we want to do is take a look at the HPC technology stack. What are the components on Mac OS X that it takes to build an HPC deployment, and what technologies and products are available from Apple and third parties to complement this stack?

So this is a view of, again, what we'll say HPC building blocks, the components that are required to build a HPC deployment in technology space. So if we look from the bottom up, from the hardware, the actual hardware platform itself, the operating system, the interconnects, the compiler and optimization tools, the communications middleware, and finally the management tools, to really do a complete cluster deployment, we need components from all of these. Now, if we take a look at what Apple products and technologies provide, you'll see Apple provides products that fit really in about four of the six areas. And if we look at third-party technologies, really industry-leading third-party technologies, we have, again, about four or six of those components that we have to choose from. And as we walk through, you'll be able to see we have a wide selection of products and technologies for deploying clusters on Mac OS X. So let's take a quick review, walking the stack from the bottom up, and let's take a quick review from an Apple perspective at the hardware components. So this is pretty straightforward. First and foremost, we have the G5 processor. The G5 processor really stands out as a processor for high-performance computing. With the 64-bit capabilities, with the massive floating point and velocity engine support in this processor, coupled with the power advantages that it has with the smaller 90 nanometer technology, really delivers a phenomenal bang for your buck and for your power and heat output as well. So obviously that becomes one of the core foundation pieces of the product. As we look up, of course, then we wrap that up in the XSERV G5. Been able to provide dual processor performance and a 1U form factor, a system that delivers peak performance of 16 gigaflops of double precision floating point or 32 gigaflops of single precision floating point with a velocity engine. Again, a very, very powerful system to deliver phenomenal performance. Couple that with the latest of IO technologies, PCIX, ECC memory, integrated hardware monitoring for systems management and monitoring.

And again, very low power when we look at it compared to competitors. You know, roughly a compute node configuration of XSERV at 100% CPU is under 250 watts of total power usage and well under 1,000 BTUs an hour in heat output for the data center that needs to cool these systems. That significantly lower the competitive systems and a great advantage that we have with the G5 processor. We also have XSERV RAID, And kind of the corollary to a lot of computing power is the data that needs to be fed into those systems, and that data needs to be stored somewhere. We see XServeRaid as the ideal storage device for storing data for high-performance computing clusters. With phenomenal performance and phenomenal capacity, you have the ability to, through Fiber Channel, quit quite a large amount of storage online and serve that throughout your cluster. And again, a breakthrough price performance. with XServe RAID. Just, again, ideal storage device for high-performance computing clusters. years.

If we then take a look up the next step, obviously, from an operating system perspective, we have Mac OS X. Mac OS X has really provided a key foundation for cluster applications. What's really unique about Mac OS X is that we have the ability with Mac OS X to combine the power of Unix under the hood, which allows us to bring applications and technologies over that allow us to compile and run those applications on Mac OS X, but combine that with the great ease of use, ease of deployment, ease of management that the Mac OS X server services provide. And, of course, with the G5 optimizations, we've been able to deliver the performance out of the G5 processor. And, of course, if you've been to any of the sessions yesterday, you know that we've introduced Tiger. And with Tiger Server, one of the most important things in the HPC space that we're able to bring with Tiger is true 64-bit user space environment.

been able to break that 4-gigabyte barrier in user space. I mean, we've always been able to access more than 4 gigabytes of memory with the G5 system, even on Panther. But now we have the ability to have applications access large data sets. So especially code coming over from other platforms, from other operating systems, we'll be able to take advantage of that large memory footprint. Of course, we expect to see a number of across-the-board improvements in other areas of Tiger, improved SMP performance, improved network performance, improved NFS performance. things that we think are going to really deliver a phenomenal platform for the future of high-performance computing with Mac OS X.

It's also interesting, I encourage you to attend if you're interested in some of the XSAN sessions later this week. XSAN is Apple's SAN file system for Mac OS X, and XSAN has a role to play in clusters as well, especially on larger clusters where file I.O. bandwidth is a concern. XSAN gives you a file system that has the ability to scale out in file I.O. services. So XAN, again, plays a role in high-performance clusters as well, being able to share a large data pool across a large clustered environment.

Then as we move up the stack, we look at cluster interconnects, and this is an area that we really work with industry-leading third-party vendors to provide phenomenal solutions on Mac OS X. You know, today with interconnects, there are really a couple leading choices. First and foremost, we have gigabit Ethernet, kind of the common denominator for cluster interconnects. And, of course, from an Apple perspective, we provide on XRF G5 two very high-performance gigabit Ethernet ports. So out of the box, you know, you're ready to connect those systems together. But for clusters that really need higher bandwidth at lower latencies, there are two really leading choices available for Mac OS X. The first is mirror net technology, and the second is InfiniBand. And I wanted to touch on both of those. First, I wanted to talk a little bit about mirror net. Mirror net's a technology that's been in the HPC space for many, many years. It's really established itself in this space, And matter of fact, if you look at the top 500 list, you'll see quite a number of clusters that are built on MirrorNet technology. We're really proud that Mac OS X has very mature drivers to really deliver excellent performance with MirrorNet. And matter of fact, MirrorNet's actually one of the cards when we first did the original XServe, which introduced 66 megahertz PCI slots, MirrorNet were the cards that we actually used to tune the bus to verify we're getting maximum performance out of the system for PCI transfers. As a matter of fact, in working with their engineers, they told us we were seeing much higher throughput than even the latest PC chipsets at the time. And so MirrorNet is a really excellent choice. Here's some performance... Oh, sorry, excuse me. Here are some views of their latest PCI-X card and switch components, the key components that are used in the MirrorNet deployment. It is a PCI card that goes in your system. And here's some performance numbers that were provided by Miracom on what Miranet's capable of achieving. And again, you can see here, the latencies are much, much lower than gigabit Ethernet. So for applications where that's really critical, Miranet is an option you can choose from here.

And the way they achieve this is actually the same way many of these interconnects work, is that they are able to bypass the largest section of code that provides a lot of latency in this system, and that's specifically the IP stack. As you can see from the diagram here, applications that call through MPI stacks through MirrorNet bypass the IP stack and go directly into the hardware and are able to achieve that much lower latency performance than what gigabit Ethernet can provide. So again, for applications that require that, that becomes really key. Marinette just actually had a number of recent announcements at the last supercomputing conference earlier this month, and most significantly, what they've debuted are much larger switch form factors enabling you to scale Marinette to much higher cluster node numbers at much lower price points. And so Marinette is definitely a very compelling solution on Mac OS X for people looking for lower latencies at very attractive price points.

What I'd like to do now is introduce John Thouet from Voltaire, who's gonna talk about InfiniBand on Mac OS X. John? Thank you, Doug. All right. So to talk about InfiniBand is certainly to talk about clusters, and Doug just gave a good introduction there. But what we're seeing in the HPC market is a lot of transition of supercomputers, mainframes, SMP machines to a bunch of interconnected servers. Why? The reason, cost is the overriding reason. We'll see if we can put together one of these clusters with a bunch of servers for realistically one-tenth the cost of some of the existing SMP machines. Historically, though, that's been fraught with complexities, underutilization of processors, storage bottlenecks, just the complexity of hooking these up.

So what does it take to build an effective, efficient cluster? You have physical distribution and logical consolidation. On the physical side, you need a high bandwidth. InfiniBand offers 10 gigabits and 30 gigabits per second today. Low latency for interprocessor communications. Low CPU overhead. You don't want the CPU spending all their processing time communicating with the other processors. And the ability to scale. You also need logical consolidation, the ability to logically group nodes and systems into logical sets or groups domains. So you need a high performance interconnect and an intelligent interconnect. What InfiniBand brings to the table is it's an open standard. So it is the first open standard interconnect designed from the ground up for high performance interconnect and RDMA support. So the extraneous features and functions that you might get in other technologies are not there. This was designed for the ground up with high-end clustering in mind. Because of that, it has the significantly, we think, lower cost performance ratio than other options do available for clustering. The latency is about 140 nanoseconds per hop and 5.8 microseconds end-to-end latency.

Key feature of InfiniBand is it supports multiple types of traffic over the same fabric. So whether it's file, block, network, IPC, it's all using a single technology efficiently. Right from the beginning, we built in extensive management monitoring capabilities, so high availability, quality of service, partitioning type capabilities are built in from day one. Again, it already supports 30 gigabits per second today. There's something called DDR double data rate and QDR, quad data rate, for supports up to 120 gigabits per second being worked on today. So three real key points in FinnaBand, high bandwidth, low latency, low CPU utilization, and the ability to scale.

This is a high-level representation of an SMP machine on the left, showing an 8-processor with a proprietary interconnect versus four two-way servers interconnected with InfiniBand. I want to make it clear the SMP, the symmetric multiprocessing systems, will still be a perfect solution for a lot of applications. To use the cluster, you have to be able to parallelize the application. and we're looking at near memory speeds, but the cost performance compared to the SMP is drastic. Thank you.

On the InfiniBand link protocol, just really trying to make one or two points on this overhead. One is with a single event, you can move a large amount of data, two gigabytes of data. And the entire link protocol is handled in hardware. So all reassembly and segmentation and all that is all handled in hardware. And we are in the third generation of ASIC technology for InfiniBand.

Some more InfiniBand link attributes. Each packet is sent with a service level, so there's up to 16 service level supported SLs. There's also something they call a VL, virtual lane. So there's 15 virtual lanes possible over a single physical link. So the SL is mapped to a VL, which is then arbitrated across the physical link. Basically, that is the basis for your quality of service type implementation. which lets you mix and match efficiently different types of traffic across the same InfiniBand connection.

InfiniBand also defines what we call partitions. So this would be similar to, I think, zoning and fiber channel or VLANs in the IP world. So it's a mechanism for defining isolated domains so each port or node can be defined to a certain partition and communicate with only nodes in that partition or given full or limited rights within that group. And that's all defined by the subnet managers, the SM manages by assigning partition keys.

InfiniBand is based on a 2.5 gigahertz signaling rate. So when you hear the rates for InfiniBand, 1x is 2.5. There were really no implementations done at that data rate. 4x is really where all the implementations are, most of the implementations today. 12x is the 30 gigabits per second. So today, Voltaire, we can support the 10 and 30 gigabit per second rates. Over copper cabling, it's a 17 meter distance limitation, one kilometer with multi-mode fiber. And then, you see there's efforts to, in the works, the 5 and 10 gigahertz signaling radar and process.

InfiniBand has a very rich protocol stack to find. On the upper layer, you'll see a bunch of stuff that looks familiar. You know, NFS, RDMA in version 4. NFS will have RDMA, InfiniBand support. MPI is, I think Doug mentioned, message passing interface. Far and away the most popular. MPI is the most popular IPC API in the HPC world. I think that was four three-letter acronyms in a short sentence. So let me see if I can say that again. So message passing interface is the most popular application program and interface for interprocessor communications in the HPC market, and that's supported in the Apple world today. iSCSI is for storage support across the fabric.

SDP is Socket's direct protocol, and any application with a Sockets-level API can utilize that. Of course, TCP and IP over IB, and DAPL is Direct Access Programmer's Library. A lot of acronyms here. And DAPL defines the API to RDMA. And then there's a full suite of InfiniBand services below that for management and monitoring, and in my 10-minute time slot, but we won't be going through those right now. And HCA is terminology used for the host channel adapter. That's sort of an infiniband term just for a network interface or a host bus adapter. We call them HCAs.

So what does InfiniBand, the value it brings to the HPC market? It's the first industry standard to enable server clustering. Doug mentioned Marinette, and there's Quadrix, two other ones out there that are proprietary interconnects that are available. The clustering is the fastest-growing segment in the HPC market, so that's why, as a company, Voltaire and Apple, working together, are very interested in that space. excellent performance advantages over other options. We're currently seeing lots of interest at the universities and labs. The DOE labs specifically are very aggressive in pushing forward the standard and purchasing product and implementing large clusters.

And pricing has dropped quite a bit. We're certainly not at what I would call economies of scale yet, but we've still seen about a 50% price reduction in the last 12 months or so, and we think we'll see more reductions as-- as volume increases. And the Virginia Tech system, Doug mentioned, 1,105 five-note clusters, number three in last November's top 500 list.

Really key, the 5.2 million, which is a lot of money anywhere, but for this type of system, it is literally one-tenth the cost or greater than that of the other systems in the top 10. Want to thank Doug and Apple for having us here and working with them, and that's all I've got there. Thank you. Thanks, John. We're really excited about Voltaire's InfiniBand offerings, because for customers who are looking for a very versatile interconnect with great latency and bandwidth properties, InfiniBand is very attractive and gaining tremendous momentum in the HPC space.

I'd like to move on, go back to walking up our HPC building block stack here, and take a look at compiler and optimization tools. You know, the interesting thing about HPC is that it's really a segment of end users and developers. I've never met an HPC to deployment that's not taking advantage of their own tools or compiling their own programs. And so the compiler and the optimization tools that are needed to really eke out the most performance out of their code, it becomes a very important piece of the technology. From Apple's perspective, of course, we have Xcode. And Xcode is just a phenomenal development environment to be able to leverage the productivity features, be able to deliver great user interface, great tools to build, develop, debug applications. I mean, the fact that I can write a program on my PowerBook and then send it up to my cluster for execution optimized for the G5 is incredibly powerful.

Of course, as we improve Xcode, for example, the betas that you've received, the pre-release versions you've received this week, actually begin to introduce some of the 64-bit capabilities for large memory space. So you can already begin working with those tools. A very important part of Xcode are actually the CHUD tools. If you're not familiar with them, CHUD stands for Computer Hardware Understanding Development, and these are tools originally written internally within Apple to help us optimize and understand implications of code executing on our systems. These tools turned out to be extremely powerful and actually have been made available as part of our developer tool set and now are a standard part of Xcode installation. If you have Xcode installed, you'll find them right in your developer folder. These tools are incredibly important in this space to be able to really understand the performance bottlenecks and implications of your code running on our systems.

I've seen numerous of examples, for example, people who are convinced their code is processor-bound on a G5, and with some simple profiling with Shark, for example, one of the key tools in the Chud set, really find out maybe it's more memory-bound and there's some tuning that can be made to improve throughput.

So these are very, very important tools to our tool set. And if you have an opportunity, really encourage you to go to some of the sessions this week on the Chud tools for better understanding of their capabilities. studies. Another important piece of this space is Fortran. Fortran continues to be one of the top scientific programming languages, and it's an area where Apple works with third parties to really develop and provide great solutions on top of Mac OS X. So it gives me a lot of pleasure to introduce Dave Paulmark from IBM, who's going to talk about Excel Fortran. Dave. Dave.

Thanks, Doug. I'm really happy to be here today. It's great to be able to stand in front of a bunch of Apple developers as an IBMer and talk about our technology. It's not just the processor this time. We're going to talk about some software today and some hardware, too. Let's see. There we go. So what have we brought to the Apple processor and ACCO S10? We've got a compiler that's got a long history behind it. This is the IBM Fortran compiler that's been behind our systems since the very early 90s and even going beyond that. We use this technology inside of our C compilers as well, and we have Excel compilers for both C++ and Fortran on this platform. And it's been used by some very important IBM customers, mainly on AIX, but we're starting to see some movement to Linux and Mac now. people like LL&L, NERSC, NCAR, and a European weather forecasting group. And we deal with these people every day. We understand their problems. We understand the kinds of applications they have to develop, and we built a compiler for them.

Now, when you pick up XL Fortran, you're not just getting performance. You're getting language standards and conformance. So this helps a lot with porting. If you have something that runs somewhere else that's conformant, we're going to be able to handle that. So we're fully Fortran 77, fully Fortran 90, fully Fortran 95, and we've started on Fortran 2003, which we expect to be ratified hopefully the end of this year. But people have asked us for some things early in that, as the standard congealed and got a little bit more stable. We went ahead and did things like IEEE module, allocatable components, stream IO, things that people were asking us for. And we have people on those standards committees so that we know what's coming, and we have a voice in there. We also handle things like OpenMP, which we also have folks on those standards committees. Okay.

We're not just standards conformant, we also have extensions. Surprised. It's Fortran. So we do open MP 2.0, fully compliant to that. On Mac OS X, it's a technology preview as yet. It's a preview of some of the technology that we've deployed on AIX and Linux PowerPC that's been out there for quite a while now. But we do other things from other companies. We've got Cray pointers, 128-bit floating point, 64-bit ints, structure record, union map, and so on and so on. There's way too many options to talk about to try and describe them all in this group. But suffice to say, things like structure record, union map, we've had customers come to us with things like, you know, we would like to buy IBM hardware, but you don't have this. Well, they got it, and now they have IBM hardware. We do that sort of thing all the time. these kinds of requests come in. You know, we want to hear what you need, and we'll talk to you about that. We've also got some very important extensions for the PowerPC in particular. The PowerPC hardware intrinsics functions and directives get you access at a source level to the hardware instructions. So you can code something as a directive or as a function call, and what you're going to get there is a particular instruction that you need point. Something like data prefetch, for example, is very powerful. We also give you an XLF utility module that you can use to get access to some common system services. You don't have to go off and code that yourself.

Now, we're in Xcode, and that's real exciting. But for folks that still like their Unix command line, we're there, too. We make files still work. GDB works with us. We work well with GDB. Now, as you go up the op levels, obviously things start to go down a bit, but we've given sort of the support goes down a bit, but we do have some directives and so on that you can put in your source at certain points to get you the information you need to debug, you know, that stack, that traceback, or whatever it is you're having trouble with at the time. something that isn't here that shouldn't be Shark I love that tool I wish we had that in AIX we work well with Shark and that's the message, use it you can find some amazing things it's one of the most popular things in Toronto for digging down into these problems that we get from our customers when we're analyzing performance problems but it's not always just debuggers and so on we also give you some options to use for finding problems in your code so you can automatically insert checks to find, you know, oh, I went off the bounds of that array. It'll trap and tell you that, stop you from going off and corrupting memory. You know, automatic initialization of variables where you need that to happen, and a rich set of listing information that you can dig through to understand what's going on with your program. Thank you.

The runtime environment-- we have our own Fortran runtime that we ship with the compiler. And the message there is that is something that if you build an application with XL Fortran, you can take that runtime and give it to your customers as well so that they can run that Fortran code on their systems.

We give you a lot of tuning levers and buttons and dials through environment variables. And you can control things such as the characteristics of the I/O that's going on when you're doing that, error reporting, what kinds of messages you want. Do you want to know when you're doing something that isn't Fortran 90 conformant, for example. You can do that sort of thing. But in this space, certainly, thread scheduling models, number of threads, thread profiling environment variables, these are all important things. And of course, all the things that OpenMP defines, we've got those.

Now, binary compatibility is a very important thing in this. You can take our objects, work with other objects from GCC, G++, and, of course, IBM Excel C and C++, take that whole bundle, put it together, and there you go. You've got your applications mixed as many languages as you like. And we've added some things like QFloat, Complex, GCC, minus QX name, just option names. But the message is, where we need them, We've added some things to help out with that binary compatibility.

Now, we exist because of optimization. If we weren't a good optimizing compiler, we wouldn't be there in Toronto doing this every day for the last 14, 15 years. So the optimization components that are in XL4Tran are in all of IBM's core compilers and all our important systems. C, C++, COBOL, PL1, on AIX, Linux, the mainframes, Mac. and now P Series, I Series, and of course G4, G5. The message is we've taken all that that we've built up over those number of years and all those different platforms and brought it down to the Apple platform, and we're seeing some really important success with that.

The Excel compilers are used by IBM on AIX to announce spec performance numbers. So again, the message is we know how to tune for those chips. IBM does the chips, we work with the chip designers, we know what's coming, we know how to tune for those things. And we build our own software with it. AIX, DB2, Lotus, Domino, they're all built with IBM compilers, as you might expect.

Optimization options. We go to five at the base level. So zero all the way to 05. And you can go from basically almost no optimization up to wow, what did this thing do to my code? I can't recognize it anymore. And we've got a whole set of, again, switches, dials, knobs, and levers that you can play with in order to tune the optimization to what you need to have happen on your application. Things like minus QHOT enables the high-order transformation loop optimizer. It was built to understand Fortran 90 array language and syntax. It can take those loops and do some amazing things with them. It'll also work with C code, with C loops as well, when you use it in our C compiler. The QArch option tells you on the Mac OS machine, do you want to target a generic PowerPC, in other words, G4, or do you want to go to G5, which I'm sure most of this group is interested in. And that enables, inside of our optimizer, all the modeling and tuning capabilities that we bring to bear from, that we brought up and specifically done, well, person years of effort tuning to the G5. gives you access to all, well, it doesn't give you, it gives the optimizer.

Using QRHG5 allows the optimizer to precisely model your code as it's going to see it, because it understands the chip, understands how many units are going, and how to keep that processor busy. That's what it's trying to do with the scheduling model. And using QRHG5 also gives you access to those rich set of PowerPC intrinsics, again, that you can use for things like cache control, certain arithmetic operations that you might need. And floating point control, you want to toggle things in the status and control register, for example. And the nice thing about that, if you're interested in moving your code from one IBM kind of system, one IBM chip to another, those same intrinsics work on compatible chips if you're going to AIX or Linux. Same story in the other direction. You have some code up there on AIX you want to bring down to G5, those intrinsics are going to work too. So, IPA is the keystone of our optimization technology and really differentiates us in what we can do with your application. When you've got IPA involved in your compiler and in your compile, and it runs automatically at 04 and 05, what it does for you is when you compile your code, is it inserts information into your object.

which is essentially invisible to the linker. So if you just take those objects and feed them into LD, out comes your ADOT out and you're happy. But if you then use IPA when you link your application, it then extracts that information that's hidden away in the.Os and re-optimizes your code again, this time not on a file-by-file basis, but it's got the entire application there. It's got all the.Os that make up the whole thing. And it understands that called that, and it was called with this, and it was called with that. And so we don't have to worry about this parameter. We'll just stick a 7 in there, that kind of thing. So what it can do is it repartitions your application into more logical units that keep memory together and does massive amounts of inlining where it makes sense. And it can even go across languages.

So if you build your application mixed mode with C, C++, and Fortran, if you build all that with IBM Excel compilers, run the IPA link step, it will do things like take your C code and inline it into your Fortran application. And that's an amazing technology that we've been able to bring down to the G5. And of course, after it does all that, then we go back down into the low level optimizer again, which is the one that really understands the chip and tunes for that.

PDF, profile-directed feedback, is another important technology, especially useful for codes where you may have it instrumented with debug or perhaps some tuning information that you want to use to gather statistics. What PDF will do for you is you build your application once with -qpdf1, run your application with typical sample data, That will write out a statistics file, compile your application again with PDF2, and it will read that statistics file. And that will tell the optimizer, oh, look, 99% of the time you take the branch this way, not this way.

And so we can take your most frequently executed code and put that in line. And the stuff that almost never executes goes off to the side, and you get a much better performance out of that. And of course, again, the message the Excel compilers share the technology, so you can--if you build stuff with our C compilers, use PDF, you can mix that in with the Fortran compiler.

OpenMP and SMP are very important to this space, and we've got a lot of experience with these. Again, technology preview on Mac OS X right now, but again, we're bringing that down from some platforms where we've had a lot of time to work on that. We fully implement the 2.0 standard, and the important thing about OpenMP for us is our optimizer fully understands what OpenMP is and what SMP is, and so we can take things like a minus QSMP auto option and put it in our compiler where it can take a look at your application and automatically parallelize things where it makes sense to do so. So you've got a couple of choices in the way you want to do things. If you want to code to the OpenMP standard, that's great. We'll handle that. but will also automatically parallelize for you where we can. And again, it's another one with dozens of switches that I can't talk about right now.

We give you lots of directives and options on the optimizer, as I said before. There's a couple of variations on this where you can go into your source and say things about your code to say this loop has this characteristic, and that'll give the optimizer even more opportunities to go and do things that it might not be able to recognize otherwise. But in some cases, you want to constrain the optimizer. A lot of older code, especially, may not be 100% standards compliant. So things like -q alias, non-standard, will let you crank up the optimization level and still have your code run correctly, even though it might not be as opportune as if your code was standards conformant.

And of course, things like -q prefetch and --- we'll automatically insert prefetching directives where that's useful. Had a great example of that yesterday where we had a gentleman in the lab across the hall working with us. We brought his code in. And just with some analysis with Shark and looking at things, we stuck in one directive and speeded up the core loop in his application by a factor of two just by doing a prefetch.

So the summary is, IBM XL, Fortran, and XLC bring to the Apple G5 systems technology that's been in the works at IBM since, honestly, the mid '80s. And it's been improved every year. We have a large team in Toronto, and we work closely with the chip folks. We're fully backed by IBM's premier customer service. Doesn't matter if you buy the compiler from Appsoft or you buy it from IBM. It's still the team in Toronto that's gonna be looking after you. And our standards compliance and the large range of extensions that we have let you bring your code down from pretty much anywhere. And we'll help you out with things that you need. Thank you.

Thanks, Dave. Great, thank you, Dave. Okay, continuing along our stack, I wanted to talk a little bit about communications middleware. This is typically what we see as the MPI layers of a cluster. The great thing about Mac OS X, again, leveraging off that Unix foundation, is that just about all the major MPI stacks have been brought over to Mac OS X and run really, really well. Matter of fact, some of them have been really optimized for Mac OS X and are available, for example, LAMMPI as a package installer for really ease of installation right on top of Mac OS X. So great selection of tools. As a matter of fact, if you have experience with a particular MPI stack, hopefully you'll see that the exact same stack is available on Mac OS X and can leverage that familiarity on the platform. So both open source and commercial stacks available for Mac OS X. There are a number of other pieces of middleware. obviously talked about OpenMP, Globus, PVM, Paradise Linda from SCA, and a recent product, Accelerate, from Gridiron Software, all also fall into this communications middleware stack, and of course, all are available on Mac OS X. Thank you.

Finally, I wanted to touch on management tools. This is an area where we think Mac OS X really shines because, again, you have the best-of-breed tools available from Apple to really make managing these systems, particularly head nodes and things where you're providing file services and network services, are able to provide very, very ease of use for system administrators managing, you know, whether it be a small cluster or a large cluster. We also have the benefit of great open-source tools to really provide added value and functionality. So if we drill into this, of course, first of all, we start with Apple's management tools. Server admin, workgroup manager for providing kind of the bread and butter, you know, file services, DNS, DHCP, directory services, things that kind of, you know, you forget about, but, you know, it's a network infrastructure. You need these to support cluster operations. I wanted to highlight server monitor, the tool that's unique to XServe. XServe G5 has over 30 sensors on the logic board. I like to joke it's one of the most instrumented one-use servers in the industry. Server Monitor is the tool that allows you to wrap up that data and provide that status information about the hardware. Temperatures, predictive drive failures, power consumption, all that data is available in Server Monitor. It's a great complement when you're managing a large number of machines. Beyond that, we also have a new piece of technology from Apple introduced not too long ago as a technology preview, which is XGrid. Again, taking that ease of use approach of how do we make deploying clusters easier, how do we make distributed computing easier? XGrid's really a great solution for these class of problems where you want to distribute workloads across a number of machines. What's interesting about it is that not only can it take advantage of dedicated cluster resources such as a rack of XSERVs, you can also bring ad hoc resources through rendezvous technologies out across to desktops and other machines on your network. The recent technology preview, too, added MPI support, which makes running and dispatching MPI jobs across your cluster much easier. And, of course, it provides great user interface all the way down to the tachometer to let you see how much performance you're getting on your jobs. So we're really excited about Xgrid. And, of course, now with it being brought into Tiger, it's going to be very broadly available to Mac OS X systems.

Finally, again, wanted to touch on some of the leading open source and commercial tools in this space, most notably schedulers. You know, again, again, the top schedulers available in the industry are available on Mac OS X. Platform LSF in the commercial space, PBS and OpenPBS, Sun Grid Engine, now called N1 Grid Engine. Even the Maui Scheduler are available for Mac OS X. and also some of the leading cluster management monitoring tools, tools like Gangly and Big Brother, also available for Mac OS X and very valuable resources there.

So in summary, if we look all the way from the hardware up to the management tools, we have a really compelling set of products and technologies, both from Apple and industry-leading third parties, that allow you to build really phenomenal cluster solutions with Mac OS X and PowerPC, G5 at the foundation of this stack. What I'd like to do is now introduce some customers who are going to talk about how they've deployed XServe and Mac OS X Server to solve some of their high-performance computing needs. First customer I'd like to introduce is actually Ben Singer from Princeton University, who's going to talk about his deployment of XServe in their center. Ben? Thank you. Thanks, Doug. It's a delight to be here. I'm here to talk about a little bit about the Princeton Excerpt cluster at the Center for the Study of Brain, Mind, and Behavior that we're still setting up. We got it about a month ago.

We're having fun setting it up. What is the CSBMB? Well, we're a consortium of Princeton faculty interested in the neural basis of cognition. And really what that is is one of the great unanswered questions in science, which is how does all this activity in the brain lead to consciousness and awareness and action and motivation and all the associated behaviors that we do every day and just take for granted. The consortium is made up of faculty from applied mathematics, computer science, chemistry, physics, biology, psychology, and philosophy. And actually, psychology is our home building, and so we have a lot of people in psychology that are working with us. And to point out one of the others in applied mathematics, our biggest collaboration is actually with Ingrid Dobeshi, who is the mother of wavelets and is applying some algorithms to brain imaging analysis.

So really what we are is a place that provides resources for all these faculty. And we have staff and we have resources in the computing and data acquisition area. And on the staff side, there's software engineers, MRI physicists, system administrator, and administrators for running the center. The big data acquisition instrument that I was alluding to is the MRI brain scanner from Siemens that we picked up a few years ago. And it was the first... At the time it was installed, it was the first research-only installation. So most of the time when you use an MRI, it's in a hospital setting, so it has first priority for clinical applications, and you end up doing work at 3 in the morning or something. And one nice thing about our facility is that it's there just a few doors away in the psychology building from the CSBMB staff center.

And that provides all the data that I'm going to be talking about and why we ended up getting an XServe. We already had a file server when I went shopping for a cluster, which was actually the first thing they had me do when I came about six months ago. And that was in place already. It's a BlueArc 9-terabyte file server to store all this data that comes from the MRI brain scanner, and we need to back it up, and we need to process it. So that's how we ended up with 64 XSERV G5 nodes. And I'm going to explain a little bit about how we chose the XSERV. But before I do that, I want to just say what it is from a computing perspective that what we do is motivating us to pursue an XSERV in the first place. We have a whole lot of brain data coming out of the MRI. A single study will produce hundreds of gigabytes of data. You take a single scan from somebody, and if you're doing functional MRI, even though a single slice of the brain is at lower resolution, 64-squared image, you're taking 25 slices, and then you're taking maybe 30 of these a second. And in one experiment recently, we had subjects watch Raiders of the Lost Ark for two hours and recorded their brain for two hours. That produces a lot of data. And so we did that with multiple subjects, too, because we want to see, are their brains doing the same thing when they're watching this movie? That's sort of a fun example. And to crunch through that is going to take some computing power. The other thing is people are moving their head in the scanner. They have a little head rest, so we tell them not to move, but they still do, and that's natural.

And so we need to align every image with the first one or some reference, and that takes a lot of time in the workflow. So does filtering in space and time. There's a lot of noise coming. This data, when you first get it, it's not like suddenly something pops out at you and you know exactly what's happening, except in very simple cases. There's a lot of noise in the data. It needs to be filtered out. There's other machines in the room that will put a signature in the data, maybe some low-frequency noise, maybe high-frequency noise, so you have to do filtering. And then finally, you need to do a statistical analysis, and you're comparing brains where they were just sitting there doing nothing with when they were doing the task that you have them do, and so comparing those two things is a simple statistical test, but you need to do it for every voxel in the brain. So that's thousands, hundreds of thousands of statistical tests. And that can take traditionally days of CPU time to do a single study.

And one problem with that is that when people are, they've got all this data and it takes all this time to analyze it, they don't tend to play with it much. They don't tend to try new things or look at it from a new angle because there's a big cost to doing that. They're gonna tie up the lab resources for a day. They can't just put this data on their portable and run away with it and they have to stay and use up the center resources. And sometimes people won't do it and so it just stifles creativity. That's one thing.

So why did we choose XServe? Well, when I first started looking-- and we all were a group, but I was sort of the one that was doing it at the time-- I got really my head deep into benchmarks. And although the XServe does really well with benchmarks, I think the reason we chose it wasn't just because of benchmarks. But anyway, let me point out the benchmark that I have on the slide. The AFNI Speedo score is from the AFNI package. It's from the National Institutes of Health. It's a free software package for analyzing MRI brain scans. And off the website where they published their single processor 32-bit benchmarks come the bottom three bars here. And then I ran it last week on our XServe, and it came out a little better. This benchmark tests the whole system. So maybe it was I/O or something that caused the XServe to do better than the desktop, even though it has the same speed chip.

Again, like I was saying, it wasn't only benchmarks. There were some benchmarks actually which early on, if we compiled, say, 64-bit Linux Opteron systems, we came out with different results. The XServe was always close. And what we were doing when we were buying this cluster was looking at the whole package and thinking about the future. and being able to actually house this thing and to be able to maintain it. And we also knew that in the future Apple's operating system would be fully 64-bit to do a fair comparison with that compile on 64-bit Linux. So we knew it would get better.

The power and the cooling, as been alluded to earlier, were a great story for us because we are in a small area and we said we wanted to get 64 nodes in the facilities. People laughed at us. So we said, well, how if we put it in that room there? And they said, well, good luck. So we thought about it. We did get them to put in some additional air conditioning. And then we looked at the stats and the specs and the G5 XSERV, which had just come out at the time, time, the specs show that it used about half the power and the cooling, I think, roughly at that time. And we were really happy with that. So we could actually buy it. It was actually a great, great feeling.

And we knew that we'd be able to cool it. And it's also very quiet. I think in the last session, Bud Tribble mentioned that we went into the room with them all on for the first time, and there was this strange high-pitched noise, and we thought, oh, great, this is going to be kind of noisy. It turned out it was the two Dell PowerEdges in the far-- so we were happy with that.

The great thing, we have a whole lot of people that are, we're coming off an SGI system and a bunch of Linux systems. So there were a lot of people that said, well, we're not going to be able to use these open source packages and we're going to have to recompile them. Well, I've been, in the beginning of the process of porting, it's been really pretty easy to do. The G5 is becoming a more and more popular target in GCC make files for the packages that we depend on, including AFNI. And so, in fact, they even have a binary distribution for the G5 already. So that was great. And the administration of this thing has been really straightforward so far. We're really still bringing it up. But the server admin and the server monitor tools that Doug showed in one of his slides have been really helpful. I can just bring up my G4 desktop and look on the screen and see what's going on with the cluster. And I don't have to be a full-time sysadmin. Thank you.

This is our system. What's a little different about us is what I emphasized in this slide. We have all this data, and we have this file server already, so we sort of had to work with it. And so we have this second network. Each G5 node has two built-in Ethernet ports on the back, and so we decided to use both of them, which when we were setting it up, we realized we had to do 256 crimps of network cables, so we were wondering why we did that. But then we realized once we set it all up, And we had a lot of help from Apple doing that. And someone came and did most of the crimps and redid the ones that I did. Uh... And we got all this stuff up and running. And what we have here, we have a foundry switch that we got along with this thing, this XServe cluster.

I shouldn't call it a thing. It's got a few gigabit ports on it, which goes to our existing BlueArc file server, which is an appliance, sort of, has most of its software as in firmware. And we connect out to the world with that into the head node of our cluster. Our cluster in red uses what Apple ships with the cluster, the Asante gigabit switches. And then down below is the foundry connections going. And that's where all the NFS traffic is between, well, it stays on its own network, so it doesn't interfere with what's going on on the other network. And most of our applications are single processor, embarrassingly parallel, so we don't have a need for MPI yet or any of the high-speed interconnect. So we were happy with this. And we just need to move a lot of data on, it for a couple hours and then write it back out. So we didn't have a need to do MPI yet. And there will be a lot of opportunity for that.

Just to conclude, I think we're going We're seeing five to ten times faster results than we had before with the system that we moved offsite, an SGI origin. We haven't even begun to try to optimize, so that's something that we want to do, although a lot of this software isn't really written by us originally, but it's fertile ground because like I said, every voxel in the brain, the analysis is independent.

So you could have a node for every voxel if you wanted. So there's a great opportunity to speed things up. We don't need to hire anyone, at least I hope that's not why I was hired, just to do this. But I'm having so much fun doing it that actually I might spend more time playing with it than what I'm really supposed to do. Our students are happy. They have a feeling that something's coming. They see it there. They see the lights blinking, and they know that they're going to get a chance to play with it soon and they're going to be able to have less than an excuse to say that their work is not done yet because it will be done quicker. But the good thing about that is they'll be able to try new things, do more stuff, and actually probably discover some things that they wouldn't have discovered otherwise, and that's the bottom line.

And like I said, one reason we chose the XServe is just we've had both I in the past, I'm an Apple user and so is the director of the center, and we know that Apple is always innovating. The hardware is solid. The operating system is too. And with Xgrid, for instance, you see that Apple's eye is on this problem and that more great things are going to come. And so we're looking forward to that.

Thank you, Ben. Okay. The second customer I'd like to introduce is Dr. John Medeiros. Now, if you read any news last week, you may have heard about a small little cluster going in down in Huntsville, Alabama, a 1,566-node cluster being put together by Colsa. And I'd like to introduce Dr. John Medeiros, the senior scientist from Colsa, who's going to tell you a little bit more about it. Thank you, Doug.

Thank you. As Doug mentioned, I want to talk to you about our cluster and tell you a bit about it and about why we got it. But to put things in context a little bit, I'd like to tell you a bit about who we are, what kind of computing we do, and why we need so much of it, and what process we go through to pick the cluster system that we did, and why we picked the XServe cluster.

that we did. In any case, who is Colsa? Well, we're a small engineering services contractor, about 800 people based in Huntsville, Alabama, as Doug mentioned, and we have a few offices throughout the U.S. I have to mention our company president, Al Sullivan, and Dr. Tony DiRienzo, our executive vice president, who actually have championed this project for us, providing a lot of corporate support for the vision that we had in terms of bringing that system to bear. My particular project I'm involved in is the hypersonic missile technology program and we have a dedicated corporate facility that we recently renovated for the system called the Research and Operations Center. So that makes us the HMT Rock, which sounds a bit like a radio station, but really the only music there is on the iPods. Anyway, our program manager there is Mr. Mike Whitlock, and I'm the technical lead on the project. Our primary customer is the U.S. Army's Research Development Engineering Command out of Redstone Arsenal, RDECOM. And the principal scientists there on that project are the doctors Billy Walker and Kevin Kennedy.

So what kind of computing requirements are there for the project? Well, supporting their hypersonic aerodynamic analysis of flight in scramjet engines and focuses on the computational fluid dynamic analysis of the hypersonic endo-atmospheric regime, that is the very fast and near-Earth atmosphere. The cartoons on the right show some of the schematic data that comes out, visualized data that comes out of it where you display parameters of the space around an object flying very fast and it's in fact a very complex and difficult problem that we are simply attacking by brute force using a code that's proprietary double precision Fortran code to solve the Navier-Stokes partial differential equations and it explores the full combustion chemistry that goes on in that regime and we explore problem sizes with the space around an object divided into 20 million or more individual points at which the computations are done. That's a lot of points, but the good news is that the blocks of those points can be assigned to a given processor and computations are carried out within that processor and then the results are compared, stepped through, and iterations continue. And as a result, the way that this whole process works, the problem is very CPU intensive and very little time relatively spent in inter-processor communications.

It's the category which you might call almost embarrassingly parallel, which is good from our point of view and in fact drove the design of the kind of cluster that we went after. Now we've been doing some computing in this project for a while and we've done it. Well, systems we have include a traditional supercomputer system. We have an IBM SP Power 3 system with 284 CPUs that when we got it, as pictured there, back in June of 2000, it came in as number 47 on the top 500 list and four years later today, it's completely off the list. It gives you an idea of how things are progressing.

So our goal was, I mean, for this project, like it can't be too rich or too thin or have too much computational power. We needed a lot more than this. And mainframes, as expensive as they were, while they work very well, were too expensive to get to the kind of computational levels that we wanted.

So we began exploring clusters. And in the interim, since we got that mainframe system, we've acquired and put together and played with number of clusters and explored a whole range of architectures from major vendors including AMD, Intel, and Apple. Our first system back in June of 2000 was a 34 processor AMD Athlon system and about the same time frame, a little later, we acquired a G4 system about the same size, which at that time I believe was probably one of the biggest Apple clusters around. That system performed fairly well, but it was only 800 megahertz and we wanted to scale up very substantially. So we want to look at other architectures including rack mounted obviously.

Only a few of the systems that we have now, we've worked with, are shown here. The first up or two for historical reasons were the early days of looking at clustering. So we did tower systems for PCs and apples, and that little apple cluster system we affectionately called Apple Orchard.

We've looked at 64-bit systems very extensively, including the Optron system. And a lot of our computations now are done on an Intel architecture 32-bit system. You can see there our largest cluster now is a 522 processor system. But we, in fact, needed larger, more. So we explored additional possibilities. Now the whole thing is set up in this unpresupposing building in Huntsville that we acquired back late last year and the building was gutted internally, virtually internally, and this shows the computer room being put together back last fall and we renovated it literally from top to bottom, the ceiling and the computer room floor. We've got about 3,000 square feet of computer room floor, and this shows our configuration with that 522 processor Intel architecture system on the left, as you see it, our SP mainframe system on the right, and center is what I'm going to be talking about here, the cluster system that we are acquiring from Apple.

Okay, how do you pick such a system? You have to benchmark it, and from our point of view, the only benchmark that counted was our application. So we ran our code using sort of a simplified geometry, but the full complexity of the problem in terms of a reasonable problem size and full combustion chemistry with a whole range of chemical species.

What we did find among other things in testing across the whole range of processors that the inter-processor communication, as I mentioned, is a small fraction of the total compute time. That is, in a given iteration we found that that might take typically a few seconds. The amount of time that was done in communicating between processors between those iterations was in the range of milliseconds. So there was very little penalty in worrying about the interconnect, which is why in fact we've chosen gigabit Ethernet for the system. Let me go back a minute.

The last point about it, part of the reason for doing that, of course, is that these other interconnects as you've already heard about previously in the session, you can get better performance for a broader range of applications, but the cost difference is not trivial compared to the Gigabit Ethernet switch.

Yes. Some of the data that we used and the kinds of things that drove our decision. What you see in there, and this is a log-log plot of the time to do a given step of the computation as a function of the number of processors you throw at the problem. For all the different processors, five are shown there on this chart. For all the different processors, you see they're actually scaled very well for our kind of problem. That is, as you double the number of processors, the processing time cuts in half.

The grouping there, if you can make it out, this is a log-log plot, but it does break up naturally into two groups. The upper two for the Athlon and Xeon systems are 32-bit systems, and the bottom three, Optron, Titanium II, and the G5s are the 64-bit systems. Lower is better on this chart. Less time it takes to do an iteration, the more of them you can do in a given amount of time, the more processing you can do. And on that basis, you can see that the G5, in fact, performed the best. Now this particular comparison is maybe not a little bit fair because these were different processors with different speeds. So the next chart is the same data.

This time all results were normalized as if each of the processors had the same 2 GHz clock speed. They didn't, but you can normalize the data that way just for demonstration purposes. What you see here is the results are essentially the same, they're not changed very much, and the only difference here now is that the Itanium 2 looks a little bit better than the G5. It's a little bit faster on that chart. But you've got to keep in mind that Itanium 2 is not available at any cost for 2 gigahertz, and at its fastest implementation, about 1.5 gigahertz, a system built with Itanium 2s comparable to the G5 would cost about five times as much. So G5s it is.

Okay, well, process is one thing, but there's a lot of issues in putting together a big cluster. Now, we've put together, as you kind of saw, clusters that are pretty reasonable size, but even for us, this is a big cluster. And there are a lot of issues that come up in terms of scale, but there's a whole laundry list of things that I'm going to show here. I'm not going to go through them in detail, except to highlight a couple things that you've heard before today, and I want to emphasize them yet again. Power and cooling on the bottom. especially at this kind of scale, is very much non-trivial. For example, for the current system that Apple is delivering to us right now, as we're getting it in, We've had to upgrade our power into the building. I'll tell you about that in a minute. But just to give you an idea, something we haven't shared with our corporate executives yet, but just to run the system, our utility bill for the year is going to run about $250,000 just to keep it going.

Cooling is also, of course, a very important issue. And just like power, you can calculate how much cooling you need, and you can get that cooling into your facility. But in addition, you have the added complication that you've got to get the cooling to the right place, and you've got to look at how you distribute that and remove the heat and bring in the cooler air. So that's something that we're playing with, and we expect to actually have to fuss with a fair bit over the next little while.

So how did we do this process? We decided it was going to be G5s, but you can't say that when you're buying on a government contract. You've got to be generic, and we did. We put a quote, a general quote, request for quotes, out to the community at large, and one of the quotes we got back, in fact, was the G5 system, coincidentally. But the requirements that we had included a theoretical performance for the system of at least 25 teraflops. We wanted a processor count in excess of 3,000. We wanted all the fit into a 1,000 square footprint. wanted minimal power and cooling requirements, and we wanted it all delivered by 12 July this year. And you don't want to pay a lot for this cluster.

We didn't share with the vendors what the cost figure we had in mind was. We had to go for the lowest bid. But we wanted the whole thing, including the switch and all the ancillary equipment we need with it, to come in at under $6 million and we're going to make that target.

So the system award exclusive of the network component was done on 17th of May this year. So that's really a three week turnaround, which is, in this business, is a very short time frame for getting that done. But we wanted it. fairly soon as well. So, okay, what about the system itself? Some of the details. We're calling it Mach 5. It stands for multiple advanced computers for hypersonics using G5s.

We've got 1562 dual XRF G5 compute nodes and four head nodes. And these nodes are being delivered as we speak. In fact, there was a lot of complaining back home that I get to play and come here and attend WWDC while they're working on putting the system together back there. And I've got to fly back there tomorrow because I couldn't get much more time off than that. The systems, we've taken delivery about 350 nodes as of yesterday. And it's coming in at a tractor-trailer load worth a day. consists of 25 pallets of a dozen XSERVs, and everybody's got to pull them out, assemble them, rack them, get them hooked up. At that kind of scale, it gets kind of interesting. So the physical configuration is going to be set up in 40 racks with 39 XSERV nodes in each rack. These are 42 URACs. And a 48-port gigabit Ethernet switch in that rack, The switch we're getting from Foundry Networks is actually a very high performance gigabit switch, and will work great for our purposes, we believe. And one rack includes four head nodes, a couple more cluster nodes, and a large 320 port gigi main switch that are trunked from the -- to which the individual 48 port switches at each to the nodes or trunked and access the nexus for the cluster network. The whole thing occupies less than 600 square feet, so it beat the 1,000 square foot limitation that we imposed.

And we're expecting to draw about 400 kilowatts of peak power for the system. And we didn't have that much enough power being brought into the building at that point. And we had Huntsville Utilities bring us in a new transformer rated at over two megawatts. We're planning to build actually a bigger system. But that's another story.

For cooling, we know we require about 110 tons of cooling. For those of you that might not be familiar with that, the ton unit that used to rate these big chillers is an archaic unit in the heating industry that relates to being able to remove the latent heat diffusion of water in one ton of water in one day. That is, make a ton of ice in a day. So we've got 150 tons of that installed. And if we ever get out of the computing business, I guess we could make a lot of ice.

but not at that price. OK, the details of the nodes. OK, the head nodes themselves are, of course, dual 2 gigahertz G5 units with mirrored 80 gigabyte hard drives and 8 gigabytes of RAM installed with a CD rewritable and the video card. The compute nodes themselves, 1,562 of them, also 2 gigahertz XSERV units with a single 80 gigabyte hard drive, 3.5 gigabytes of RAM per node, just under 2 gig per processor, no CD-ROM or video card required on the XSERV nodes. So in total there's 3,132 CPUs and 8 gigaflops per CPU. The theoretical performance of the system comes in at a tad over 25 teraflops as required.

As I mentioned, the system's being delivered, and these are pictures we took last week before we came out here, and those were the first 40 units packed in the high bay being delivered. Some work's still going on in the computer room in terms of getting the rest of the infrastructure set up, and you see some of the guys working on putting in some of the hardware in the racks.

Now, we've got 40 racks, and to mount all these X-serves in the racks, you have to put these little clips to which you can screw into to get the rack in there, You're going to put them in the front and the back. For this many, we calculated the guys have to put in over 14,000 such clips. They did it in an afternoon. We had a bunch of folks working on it.

Okay, so to kind of wind down the story a little bit, to give you some perspective on the progression of computer technology, I'm comparing here that mainframe system we acquired back in 2000 to Mach 5 coming in now. And cost-wise, we're paying a little bit more, $6 million compared to $4 million back then, so it's about 5% more. Floor space is about twice as much. However, for that, we get more than 10 times more processors, and we get more than 60 times more performance.

So in summary, we chose the APPLES XSERV G5 architecture for a major production cluster for computational fluid dynamic analysis and hypersonic flight. The proposal we got from Apple on the XServe G5, in fact, delivered the best bang for the buck, in its essence. Best price performance. Now, as I've kind of mentioned, Mach 5 has been designed for a compute intensive problem with relatively little demand on network. Now that means in terms of standard measures that put systems on the top 500, it will not do as well relatively as a system sort of purpose design with the higher speed network. That being said, we fully expect to achieve something over 12 teraflops of real performance. And we believe we might be able to get up to 15 teraflops of real performance. If we can do that, we'll still be easily in the top five when the November list comes out. Hopefully we can get there.

So as I mentioned, just to finish up, the system is being installed, and we hope to get it into production, actually get it working by the fall. And from the solicitation of the system to actually production work, we're looking for a six-month time frame, which is pretty phenomenal for a system of this kind of scale. And we hope it works out. Thanks.

Thank you. OK, well, we're running a little late. So just to summarize and finish up the session, What I really wanted to say is, you know, Apple is investing in the high-performance computing market. We're doing it through our products, our technologies, and the solutions that we're providing, working very closely to make sure the right solutions are available, both from third parties and the open-source community. And the adoption has really been phenomenal, and momentum continues. So, in summary, you know, Apple has products from the workgroup cluster, turnkey, easy to use in bioinformatics, all the way up to, you know, the top supercomputers. So with that, thank you very much. Unfortunately, we're running out of time for formal Q&A, but if there are any questions, it'll be available up front for any questions you might have. you