HPC System Design on Mac OS X - WWDC 2004

Enterprise • 1:04:45

Developers will gain an in-depth knowledge of hardware architectural issues including interconnect technologies, storage solutions, file systems, and software frameworks. Learn about the new and emerging hardware architecture technologies including hardware interconnects (Infiniband, Myrinet, 10GigE, Ethernet Jumbo Frames, Fiber Channel and FireWire), and we’ll also cover topics such as available storage solutions, filesystems, and software frameworks. This session is for developers identifying the best hardware/framework design for their cluster solution.

Speakers: Skip Levens, Srinidhi Varadarajan

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Thank you. Good morning and welcome to session 604, HPC System Design on Mac OS X. I know we want to give as much time as possible to our guest speaker so let’s just dive right in. It seems hard to believe, but we only last year released the Power Mac G5 at this very conference.

The Power Mac G5 marked the introduction of several new technologies to the Apple platform: an incredible system architecture, gigantic memory bandwidth to feed the processors driving over a gigahertz frontside bus, an expansion bay running at 133 megahertz, PCI-X cards. So very, very nice machine. And at the heart of it all, of course, was the Power Mac The heart of it, of course, was the PowerPC G5, which was the first chip to emerge from an extraordinary partnership between Apple and IBM. But the PowerMac G5 and the G5 chip were about to go places the designers of the PowerMac G5 never intended.

We began getting inquiries from the field sales team that support Virginia Tech that said that they wanted to buy 1,100 of these machines and they wanted to turn them into a supercomputer. And they had ambitions to actually make the top 10 on the supercomputer list on the top500.org list. The kicker was they had 90 days to complete the first benchmark.

The challenges were absolutely enormous and it’s been compared to mounting an assault The Power Mac G5 only existed in prototype stage at that time. The only prototypes were in Cupertino. The Infiniban architecture that was needed to drive enough interconnect bandwidth between the machines had not been ported to Mac OS X.

This is a technology so complex that the specification alone is over a thousand pages. I should point out that the errata to that specification are several hundred pages. The Infiniban vendor, Mellanox, had never touched Mac OS X and in fact the chief architect and designer at Virginia Tech had also not touched Mac OS X so we had to learn everything from the ground up.

The timeline of course was absolutely crushing. Everything had to work perfectly and had to come together perfectly, both physical plant, the optimization of the libraries and code, the drivers for the host channel adapters for the Infiniban, switches had to be debugged, everything had to come together, had to hit that first benchmark in 90 days. At one point we actually were doing follow the sun development.

There was a team in Israel who would hand off to a team in Virginia and in Boston who would then hand off to Cupertino. We had a DTS engineer working out of Ireland. So an incredible amount of effort expended. At one point Dr. Varadjuran actually did several short bursts of 20 hour coding with only cat naps here and there. So it was a truly extraordinary amount of effort expended.

So I don't think I need to tell you the result. The result was Academia's fastest supercomputer ever, the second fastest in the United States, and the third in the world. There were large teams to thank at Virginia Tech and at Apple, but we want to point out that this venture was essentially doing everything that we expect of our developers.

A developer has a vision, and they want to execute that vision on our platform. And they engage the same resources that we make available to every developer, our DTS services, who in turn invoke engineering. Worldwide product marketing got involved, technology evangelism got involved. So the same services that are available to every developer are exactly what Virginia Tech did.

So in the wake of this success, we commissioned a video, what we call a success video. And as we interviewed Srinihdi, he makes a statement in this video that says that with 5 million, anyone can build a supercomputer. And everywhere we've shown this, as we've taken our enterprise and HPC story on the road, it's always gotten a good laugh because it's got to be more complex than that. To explain some of the methodology and how you can incorporate the same type of system design principles in your HPC systems, I would like to introduce Dr. Srivni Varadajran, the Director of the TerraScale Computing Facility at Virginia Tech. Srivni, thank you.

Thank you very much Skip. Thank you all for this opportunity. So today I just wanted to step back from all the work that we have done on the machine and talk about just building machines. What kind of applications, how do you analyze your applications, how do you come up with an architecture for a machine, how do you modify the place where it's intended to go in order to take a resource that large.

Simply put, a primer. It's an outline of the talk. I'll sort of introduce the need for supercomputing and then go into architectures. How do you build machines like this? What are the various flavors out there? Supercomputing has been around for a good 30, 35 years, so there are a variety of architectures that have been tried out. What's coming up in the future? What kind of software do you need to be able to get performance out of systems like this? How do you tie all of these machines together and make them behave like one?

Then the deployment. So if you decide to go ahead and follow a build-it-yourself approach, how do you go and deploy a resource like this? How do you do it quickly? What kind of resources do you internally need to gather in order to put up a system like this? And finally, a little bit of perspectives on the future. What's going to be happening a year or two down the line and even looking further out?

The first question is why supercomputing in the first place? There are a lot of places that have built supercomputers, so why do you do it? The typical application for machines like this is something called, comes from an area called computational science and engineering. So if you look traditionally back into the workings of science, there were two major ways in which you could go about empirical work. One side you had theory, and the other side you had experiment.

And these two had always gone hand in hand, 500 years worth of history in there. About the 50s or so, a third area opened up called computational science, where essentially you start off from first principles and try to derive higher level theories directly from first principles by computing starting from very basic elements.

Very basic elements whose laws are known. Now this requires a phenomenal amount of computing capability, and over the years it's become another mode of science, just like experiment and theory. The big advantage with computational science is it lets you do things that you cannot really do in experiment most of the time.

It lets you see things that are too large or too small, things that cannot be directly observed by experimentation, but can be derived from first principle laws. Most of the high-end work that's going on today is occurring in computational science. And the goal is now you need computers fast enough to start analyzing elements at very, very small scales, and very high levels of compute capability are needed there.

So, in order to support computational science, you need a fairly large scale supercomputer. Now why? In the ideal world really, one processor that can solve the entire problem would be the best solution that you can come up with. You wouldn't need parallel processing, you wouldn't, and that one processor itself would be a supercomputer. Why is that interesting? Because it's much easier to program, if not anything else.

It's easier to program because typical sequential logic from von Neumann is very well understood. We tend to think sequentially. There are very few branch points in our thinking, and it's very easy to program, very natural. Parallel systems are not very easy to program. You've got to think about many things happening simultaneously, making sure that all the things that are happening simultaneously are happening at roughly the same speed. Otherwise, there are some processors doing nothing. Now since you can't get one processor that can handle the power essentially of these high-end applications, you need to harness the power of a lot of processors, all of them working in concert. Question is, how do you do this?

So if you look at basic architectures, the oldest of the architecture families is something called shared memory. So what you do in here is you take a bunch of processors and enable them to access a common memory bank. So all of them access the same memory. The first architectures used to be called UMA, Unified Memory Access, Uniform Memory Access. So essentially, if you try to access data from memory, all your processors could get to memory at the same speed. Now what’s the obvious problem with this?

If you have a lot of processors trying to access the same memory, you don't have enough bandwidth to get to memory. Then it's a moot point how fast your processors are. If you can't feed them data, if you can't even feed them their instructions, they're sitting there idling. This would be equivalent of putting a Ferrari engine and putting bicycle tires on the vehicle. It's not going to do very well. So that's basically a problem. You cannot feed data to it. The next generation of that came up with something called NUMA, non-uniform memory access.

So the memory really was distributed in many banks across the system. Each processor could really access all the memory. But the time to get data from memory varied depending on where the memory was located. So if you have a supercomputer the size of this room, depending upon where exactly the memory was that held your data, your latency would change. Most modern shared memory machines follow the NUMA architecture.

Shared memory machines are custom. You can get small SMPs. Basically, these are blown-up portions of SMPs. So in your Apple, in a PalMac, G5 or XR, you have a two-processor SMP. That’s about right for the kind of memory bandwidth that you have in the machine. But if you look at larger SMPs, which are shared memory machines, you can go to 16, 32 or even larger, these are completely custom machines. And their price doesn’t scale linearly.

The more processors you add, if you have four processors on a board versus eight processors on a board, eight processors on a board is not twice as expensive. It��s significantly more than twice as expensive to do that. So it does not scale very easily. On the other hand, it has a very good advantage, very easy to program.

Threaded programming is one of the standard things at Statham, say junior operating systems in computer science. It’s been done for many years. There are a lot of other applications that use threaded programming. It’s a very simple model. There are even things called parallelizing compilers that if you give them a piece of sequential code, we’ll attempt to extract parallelism out of it that can run on a shared memory machine.

They don't do very well necessarily. So if you have 32 processors and you let a paralyzing compiler take your code and run it, you're not going to get 32 times the performance of one processor. You'd be lucky if you get about five or six. So you pay, the trade-off that you make over there is you don't get the same kind of efficiency that you could possibly get. However, as a computational scientist, a lot of us like to spend as little time as possible paralyzing the code. We want some automated system to take it over from us.

Here, I've written this code. It works on my desktop. Make it run on your supercomputer. That's the basic idea there. And in that model, shared memory works quite well. So big advantages are relatively easy to program. Disadvantages, they don't scale. You cannot build very large shared memory machines. They not only get prohibitively expensive, their performance falls pretty rapidly.

The other kind of architecture out there is called distributed memory. So you have no common pool of memory. Every processor has its own memory and you hook a whole bunch of these processors together and that's your parallel architecture. So how do you program these machines? If you don't have a common memory, you can't have different parts of your processor, different threads exchange data through memory. Instead you send messages.

So effectively, you have 1,000 processors, all of which are communicating with each other just by sending messages. So one processor would say, processor 1 would say, processor 3, go ahead and do this. Processor 5, go ahead and do this. In return, it might be expecting data from processor 8 in response to which it will do something else. So essentially, each processor runs a copy of the application.

And depending on the data it gets, it does different things. And all the state is maintained just by exchanging messages across processors. The big advantage with distributed memory machines is they're inexpensive. They're also very good at scaling. You can build very, very large distributed memory machines that scale very nicely.

But they're hard to program. The trade-offs that you make when you go to message passing systems, it's very hard to program. There's a fairly steep learning curve, after which things start settling down. And that learning curve might last a year or two. In fact, very simple applications like Hello World, which is probably the first C program everybody writes, writing a parallel version of it invariably generates errors. So I've seen a lot of programmers start off saying, hey, this should be easy. We're just sending one message. And there's always something wrong in there. So MPI, which is the most common message passing interface that's used over here, is somewhat hard to program.

Another kind of architecture which was started off in the 70s is something called SIMD or vector processors, single instruction multiple data. So the idea over here is if you look at scientific applications, they do a lot of very regular operations. So if you know this fact, then you can try to exploit it. Simple example is, let's say I've got two matrices, and I want to multiply them.

So all of us have looked at high school matrix multiplication. So you take a row on one matrix, and the column on the other matrix, the first column, we say first row, first matrix, first column on the second matrix, multiply each element, add the whole thing together. Now, what are you doing here? You're multiplying one of these elements. The instruction is the same. You're just multiplying. The data is different.

That's what single instruction multiple data does. It executes the same instruction on multiple units of data simultaneously. So if you know that your application domain has a lot of this kind of parallelism, where the same instruction is being repeated over and over over a lot of different pieces of data, then SIMD architectures of vectors work very well.

And vector processors, this was pioneered first by Cray, the first Cray machines that came out in the '70s and then the '80s. They do very well in scientific computing. They're highly efficient. They deliver numbers that are not that far from peak for optimized applications. On the other hand, they're completely custom designs.

The processor has to be designed ground up. And they don't do very well on scalar codes. So what do you mean by scalar codes? So in vector codes, you have the same instruction executing on different pieces of data. But if you look at a typical program, you don't necessarily have this regularity.

You're executing different things one after another. Then vectors don't do very well. The second thing vectors don't like at all are conditionals. If A is greater than B, do this. Else, do something else. That kind of code breaks your pipeline for the vector and then your performance can be very poor. In fact, it will be much worse than a desktop machine in many cases.

Scalar processors are what typically get deployed in desktops. So they're also called multiple instruction, multiple data. What they basically do is execute independent instructions on independent pieces of data. The first generation of scalar architectures executed instructions one at a time. Modern ones are called superscalar, that is they can actually execute multiple instructions simultaneously inside the processor.

And there are some limitations on what kind of instructions you can put together. For instance, if you look at the PowerPC 970, you can execute about five instructions simultaneously. One of which is a branch, two of which are floating point operations, and two of which are load store operations. So these are superscalar designs. They can execute multiple things simultaneously.

The advantage is their application domain is fairly broad. They run all the way from desktops onwards to servers to many other platforms. And hence, just sheer volume makes them relatively inexpensive. It's volume at the end of the day that gets your prices down. And they can handle a fairly broad variety of applications. On the negative side, they're not necessarily as efficient for very regular operations, because they don't necessarily exploit the patterns that show up in scientific applications.

So if you look at some examples of modern supercomputing architectures, in vector systems you have the Earth Simulator in Japan, which is an NECSX6. It's a custom vector processor. There are a total of 5,400, 5,500 odd processors in the system. And they're an interesting mix of shared memory and distributed memory. 64 processors see the same shared memory image.

[Transcript missing]

So why clusters?

So if you look at the way processor performance grows, grows by Moore's Law. This was a sort of axiomatic presentation up in the 70s, where they noticed that processor performance, actually VLSI complexity really, translates to processor performance, doubles every 18 months. So every 18 months, you're going to see a processor twice as fast as the one you're running today.

This is an exponential curve, and it's a fairly deadly curve, in that if you take too long in designing your machine, by the time your machine comes out, it's going to be a processor a lot faster than yours. The problem is if you're a custom design, you need to spend a lot of time on your machine, because you've got everything from board design to processor design that has to occur. All of this has to go to a foundry, get manufactured, and by the time it comes out and your first machine's out, there are processors available twice as fast as yours.

By the time you end up with a processor, it's going to be a processor twice as fast as yours. If you're a custom design, you're going to have to spend a lot of time on your machine, because you've got everything from board design to processor design that has to occur.

All of this has to go to a foundry, get manufactured, and by the time it comes out and your first machine's out, there are processors available twice as fast as yours. And you can't turn around quickly. Custom designs take a long time to generate, and you can't turn around very quickly.

So essentially, clusters, on the other hand, you can build quickly. So you want to be able to harness the power of Moore's Law, use it to your advantage, essentially. So the goal, basically, is design quickly. Don't waste too much time in the design phase of it, but do it thoroughly.

You'll have to spend a lot of effort in the design phase. But cut the time down, because the longer you wait in your design phase, you cannot do a one-year design or a one-and-a-half-year design cycle. Because during that period, your entire design essentially is worthless by the time it comes out.

And you also have to move very aggressively. Because you're going to base your designs on the latest processor, do your design and do your construction in a shorter period of time. Otherwise, you're picking depreciation. You're going to be paying for your machine on day one. And after that, the last... The longer you take to deploy your solution, the more depreciation you're picking up, essentially. So custom architectures, this is where they face a problem. Their lead times for design and deployment are much longer, which is what hurts you.

So let's look at some architectures. So if you want to design a cluster system now, how do you go about doing it? What variables do you take into account in this design? First thing to do is a very simple question. Look at your applications and figure out if they are tightly coupled or loosely coupled. So what do you mean by this? In loosely coupled applications, you take a parallel application, it runs across a bunch of processors. They don't really exchange that much data among themselves.

they can almost proceed independently. There are some great examples in bioinformatics. Blast, the genome search engine, is very good at doing things in parallel, simply because it really doesn't need much input. There are other domains of parallel applications where you can use something called data parallelism. So let's say you want to crunch through X amount of data, and you have N processors.

You give X by N amount of data to each processor, and the processor can go and work completely independently on that data. Let me give you a trivial example. Let's say you want to search. You have a very long list of names, and you want to search through that list of names, and you have, say, 10 processors.

Take your list, break it up into 10 independent chunks, and let them all go independently. There's very little communication between them. This is a loosely coupled application. In a tightly coupled application, you've essentially taken something that requires the work of everybody, and they have to keep exchanging data among themselves.

In such systems, your network design is quite different. The other thing that you need to understand to identify in your application pool is where are your bottlenecks? Always profile the application, because there's a lot of dependencies. Do you need a lot of processor performance? Do you need a lot of memory?

Do you need very high-speed access to memory, because you move large chunks of memory in and out of the processor? Do you need large caches? Are your applications really cache-friendly? There are examples in scientific computing that are very cache-unfriendly. You have things called sparse matrices. You have a matrix where there are very few data elements scattered everywhere around, so there is no locality. Having a large cache really doesn't help you very much over there.

Do you need a lot of I/O bandwidth? If you have a tightly coupled application, you might want to send and receive a lot of data. So this is a lot of I/O that you're doing, really, over your network. Then your bus design should be capable of handling this level of I/O that you're looking at. And finally, do your applications work with a lot of data?

So what is the disk capacity that you need, and what is the disk bandwidth that you need? And by the way, real life applications don't have just one bottleneck. They have more than one bottleneck. So if you think you've found the solution, you've profiled an application and you've found that processor performance is your biggest problem, you just want a faster processor. If you go and get a faster processor and run your application, you might suddenly find memory bandwidth is an issue. These things always couple. You eliminate one, another one pops up. It keeps happening round and round until finally you get something you call satisfactory performance.

The thing to emphasize over here is it's always nice doing designs against a fixed budget. This seems a little counterintuitive. In the ideal world, we'd like people to give us infinite amount of money and we do a design against a completely unlimited budget. Problem there is you don't necessarily optimize anything.

You go find the vendor that's got the most powerful machine. Essentially you're going to be buying a Ferrari for driving to the grocery store because you can afford to do it. The question is when you start doing working against a fixed budget it forces you to prioritize your needs and then you might find that with the set of constraints that you're working in there might be a radically different design that works very well for you. Your set of applications it does very well on and that's what you're looking for.

Analyze your application mix. The first thing you want to do is figure out how much total computing capacity you need in order to solve the problems that you are faced with. And never fall into the trap of working against peak performance. Machines typically are quoted by peak performance.

You might say this machine has 20 teraflops peak performance or 25 teraflops peak performance. Real applications rarely get peak performance. There are some benchmarks out there, LINPACK being one example of it. These are synthetic benchmarks. They try to exercise the system for an application domain of interest. So when you look at LINPACK, the application domain is scientific applications.

If you're running enterprise applications, LINPACK's not a good measure of how the supercomputer is going to behave in your domain. There is no substitute, really, for running your application on the system. So if you look at large DOD contracts, they give you a set of applications and say, run this application. Tell me how your system does. And that's it. That's the best way of doing it. Run actual applications on the system. . The next thing is, first select your processor.

Select a processor that can achieve your desired performance in the smallest number of units. The more units you have, the harder it becomes to program. The ideal world you want to have one processor that solves everything. In the not so ideal world you want to minimize the number of processors that are used to solve the problem. It leads to better scalability and also better reliability.

Then select your node platform. And in selecting your node platform, there are a couple of choices. You can select what are called thin nodes, where a node has a motherboard, one processor, and memory. Or you can select increasing levels of fatter nodes, two or more processors accessing the same memory. The problem with this is there's a trade-off there. The more processors you put on the bus, the more chances you're going to run into contention within the node itself. All the processors put together may not be able to get to memory anymore.

So this kind of contention is something you'll have to think about. On the other hand, in some cases, if you have a poor network, for instance, it might actually be better to put more processors in one node, simply because they can communicate using memory instead of going over the network.

So this is a trade-off that you have to look at from an application perspective. More processors you put in, potentially they'll all run slower simply because they've got memory contention. On the other hand, they don't have to go over the network, so they may run faster. The other thing that you have to be careful about when you pick your node is it's got to meet all your other application needs.

It's got to have enough IO bandwidth, enough memory capacity and bandwidth, potential support maybe even for local disks and things like that. And in the ideal world, you don't want to pick the cheapest price for your node. Reliability is a pretty big concern. There are fairly large gaps in reliability between what you can get at the very low end to what you can get for even moderately higher prices.

The next component is picking your interconnect. This is your network that interconnects all of your processors together. And this is probably the most important design decision in building a cluster-design-based supercomputer. And your application domain should dictate what kind of budget you allocate to your interconnect itself. So if you have a loosely coupled application, you can work with gigabit Ethernet. Some cases, you might even be able to work with 100 megabit Ethernet.

There are applications out there that typically the CS community calls embarrassingly parallel, where you can even probably get away with smoke signals. You can get away with a lot if you have a very loosely coupled application. But here is the kicker. If you have a very loosely coupled application, why a supercomputer?

If you have a large IT organization or even within your enterprise, if you have a lot of desktop machines, just cycle steel from them. Instead of running screen savers on your application, SETI@Home is a great example of that. You don't need a supercomputer to run SETI@Home. Although it is a high-performance computing application, it can run in loosely coupled fashion across desktops around the world. You can get a lot more bang for your buck by not even putting up an interconnect. On the other hand, for tightly coupled applications, you need a pretty fast interconnect, very tightly coupled system.

So if you have a tightly coupled application and you take a fast processor and put a slow interconnect, that train wreck is waiting to happen essentially. You cannot get the system to really deliver any reasonable performance for you because everyone is waiting for data to come in and they are spinning CPU cycles doing nothing. Well, not doing nothing, they are generating heat.

So how do you measure the performance of an interconnect? So typical metrics there: Low latency. So low latency is if you send a very small message, how much time does it take between the point you sent the first bit to when the first bit arrived at the receiver.

So for small messages this is important and there are applications that are latency bound. They only send small messages but they want them to get to the destination as quickly as possible. So low latency, the order of magnitude you are looking at is very low microseconds. The other thing you want is very high bandwidth.

There are a lot of applications that are bandwidth sensitive and you want several gigabits per second of bandwidth, 5, 10, even higher if you can get it. So in a good supercomputer design you have something, a metric called bandwidth to flops ratio and this bandwidth can be either memory bandwidth or interconnect bandwidth.

So your bandwidth measured in say bytes or bits, it doesn't really matter. So let me take bytes as a simple example. So bytes divided by flops. How many bytes per second can you interconnect send divided by how many flops does your processor have? So, can anyone take a guess what a good bandwidth to flop ratio should be? Four. We have a takeoff of four. Any other?

[Transcript missing]

If you have all the money in the world and you want to do a good design. One.

Ten. I'm going to claim eight. Just picked a number out of my hat. Why eight? So if you look at scientific applications, they use double precision arithmetic. Okay? So the end of your mathematical computation, you're going to spit out one double precision number. That's eight bytes. Okay, each flop in there is spitting out 8 bytes worth of data. You need to get that 8 bytes out of the system.

Your memory bandwidth also follows the same curve. It depends on your architecture. If your architecture can load two doubles in one shot, then you want actually 16. If you can consume two doubles in one shot, you again want 16. "One variable, correct, exactly. So essentially you need to be able to get the output of your flop out of the system and that's your B2F ratio. So given that, any guesses what really good interconnects do today?"

memory bandwidth of 0.5 is considered very good. As an example the value for a PowerMac G5, you've got 6.4 gigabytes per second of memory bandwidth and you've got 16 gigaflops of peak. So it's little less than 0.5. In the case of a Xeon, a dual Xeon you would get something like 12 gigaflops for 3 gigabytes per second of bandwidth. So that's a quarter. A really good design. You would have much higher memory bandwidth essentially to feed your system.

The next thing that you have to worry about is fine, I've gone and selected my interconnection networks. a specification, and I want to hook up all my switches that form the backplane of my communication. Now how do I hook this up? What is the topology that I have to hook these things up? So typically, you need to set up a backplane that can handle your peak bandwidth requirement.

And in a good design, you measure the capacity of your backplane by a quantity called bi-section bandwidth. So bi-section bandwidth, if you want to look at it intuitively, take your entire cluster. If half the nodes in the cluster want to talk to the other half of the nodes in the cluster simultaneously, you can do that.

How much bandwidth do you have in the backplane? Bisection bandwidth can be expressed as a ratio, it can be expressed as a full number. In an ideal system, if half your nodes talk to the other half of the nodes simultaneously, you should have no blocking in the network. They should be able to do so. That’s called full bisection bandwidth.

There are designs that you can do with half bisection bandwidth, hoping for some statistical multiplexing that you say, "Okay, well, my scientific application is not going to communicate perfectly synchronously across the entire system. Maybe I’ll get lucky." Um... So if you look at modern interconnects, things like Infiniban, they're connected over something called a factory topology.

In factory topology, you actually have a mesh and they provide excellent bisection bandwidth. You can easily build designs with full bisection where if half the nodes talk to the other half, you have no blocking. In contrast, if you start using things like fast Ethernet or Gigabit Ethernet networks, all you can do, connect between two switches, is eight links. There's something called channel bonding that you use over there and there's a specification called Ether Channel. Ether Channel lets you bond eight links together and make it look like one link.

Ethernet uses a tree topology. If you do layer two switching, you've got a complete tree topology. And all you can connect between one switch and its parent is one thick link. If you've got Gigabit and you put eight 1G links, you've got 8 Gigabit per second going from one switch to the other. That's your bisection.

If you've got 10Gbps links, you've got 80Gbps. That's not very much. As an example, System 10, our bisection bandwidth is something like 11.6Tbps. So these numbers reach many many terabits per second very quickly. The overall switching capacity is like 46 terabits per second. So you can do Gigabit Ethernet as long as you have loosely coupled and you're not very bandwidth hungry. Next thing that you need to look at is your storage subsystem design.

Typically, if you look at cluster designs, they use a large shared store simply because it's easy to drop software into it, common software that's used across all the systems. And also, it's easy to start up applications. There are some exceptions to it. The Earth simulator, for instance, doesn't use a shared store subsystem at all. In fact, it uses a local store. It's fairly interesting.

The design criteria over here is, when you have to come up with a storage subsystem, you probably do it over a set of nodes. And these set of nodes comprise some kind of storage area network. And the storage nodes that you have in the storage area network, they run a parallel file system across them, and this is the typical design. So what kind of things do you have to look out for?

First, you need to worry about the capacity of your storage system. Then you need to worry about how fast you can access your storage system. So for capacity of the storage system, the first thing to do is look at your application. What are the needs of your application domain?

If you don't have any numbers from your application domain, that is, you're having to blindly design this portion. This is a simple rule of thumb. Whatever is the total memory footprint of your cluster, multiply that by 20. That will be a good starter system for a shared store. This is a fairly normal calculation there.

If you have a lot of disk access in there, then you need to worry about a bunch of other parameters. You need to worry about how does your storage area network hook into your cluster? So let's say you built your cluster with InfiniBand, and you've got a lot of bandwidth inside the cluster, and then you put your storage area network over gigabit Ethernet.

That would not be a good design necessarily, particularly if you're trying to get a lot of bandwidth. In your ideal world, you want to hook your storage area network into your cluster at as high a bandwidth as you can possibly get. Now, network bandwidth alone is not the criterion here. At the end of the day, these are nowadays electromechanical devices, right? You've got a motor spinning that's rotating a bunch of platters.

You can only access data in and out of the disk so fast. So the more spindles you have in there, the better access rate you get to disk itself. Your network might not be the bottleneck. You might have a great network, but have very few spindles in there.

You've got only 10 hard disks, for instance. There's only so much you can get out of the disk at the end of the day. You want to increase your capacity, you'll have to go up for more and more spindles in there. And sometimes you have to design for bandwidth, and capacity just comes as a result of that calculation. You don't have control over capacity.

And this is typically how you end up with interconnecting your SAN into your cluster. The thing to always remember is reliability is a key issue. If you've got a set of storage nodes operating, each of the storage nodes should at least be running RAID to recover from disk failures.

In environments where you have a shared store and there is a lot of disk access, failure is a fairly common thing. So you need to have RAID in there. And whatever parallel file system you're running across should also be capable of recovering from entire node failures. RAID will recover as long as you have disk failures. If the whole storage node crashes, you need a parallel file system that can somehow recover gracefully in that condition, instead of shutting off the entire namespace.

Next thing to worry about is software. So first thing there is compilers. C, C++, Fortran. I come from the scientific computing community side of things so these are the typical languages. Java is not very common in this domain. And wherever possible in terms of compiler choices, pick the vendor compiler. The vendor would be somebody who actually knows the processor family very well, maybe even the manufacturer of the processor.

Vendor designs typically for the compiler make use of a lot of the hardware features and they do much better code optimization. And this helps a lot in terms of pulling performance. The money is usually well spent. Think about it this way. If you have two compilers and one can generate code that's 25% faster, on a large scale cluster design, you're saving a couple of million right there. So this kind of calculation you'll have to look at and it might appear that the vendor compiler is expensive, few tens of thousands of dollars, but generally it's worth the money as long as it can show tangible benefits with respect to any other compiler you find.

The most important communication library that you are going to look at is MPI, Message Passing Interface. This was standardized in the early 90s and this is the most common messaging layer that is used across supercomputer designs. In fact, the design of MPI was probably one of the biggest reasons why we are talking about supercomputing here today. Prior to the arrival of MPI, every machine was programmed differently. Each machine had its own message passing layer and codes were not portable. You couldn't write code on one supercomputer and expect it to run anywhere else unless the anywhere else happened to be the same manufacturer's machine.

Codes were not portable. You had to write to a machine and there were very few people who could do that simply because there were very few people who had access to these machines. So MPI was sort of the first step along the democratization of supercomputing. A lot more people got access to it.

And there are several versions available: MPitch, LAM, MPI Pro, and there is a new one coming out called OpenMPI. Sometimes you might have your interconnect vendor give you a specific version of MPI and this might have hardware acceleration features. They would have drivers inside MPI that very efficiently use the hardware capabilities of your interconnect and might be able to do a lot better performance.

The next thing is something called a queuing system. So unlike running applications on typical desktops where you just type the name of the application and it fires off, supercomputers are governed by batch queuing systems. So you have your application, you submit it to a batch queuing system. The batch queuing system then figures out how many resources you need, what kind of resources you need, are the resources available, and then it generates a priority on top of it and then decides at some point to run your application. So applications go into the batch queue, they execute for some time, come out.

Typically, supercomputing applications are sort of autistic. They don't talk to the outside world. They're not interactive applications. They're fired off in batch mode and results come back. That's basically how it operates. Essentially, the scheduling system over here is governed by the administrative policies that govern the use of the supercomputer that you use. And those set of policies are applied on top of the jobs and that tells you exactly when your application is going to run.

Apart from the queuing system you might also have management software, where the job of the management software is to analyze the health of the system, generate logs, in some cases do notifications, call your pager or call your cell phone. There's a lot of notification capability that's built into several pieces of management software available.

The other thing that you might want to consider is if you're running your supercomputer in a data center environment at an IT infrastructure, your IT infrastructure might already be running some form of data center wide management system, which might have its own notification. You might want to hook these two things together so you have a common access platform. You don't want to run multiple management systems across domains that you don't need to.

The last choice that you have here in terms of software and a major one at that is what kind of operating system do you plan to run? And this choice can make quite a bit of a difference. OS performance is critical in running HPC applications. So in the ideal world, an HPC operating system should be so thin that it disappears. Your application directly gets to the hardware.

There is no cost to running an operating system. All it provides is a very simple set of primitives that the application can use. Very small footprint. You really want OS to disappear. And there are some radical examples of this. The operating system that runs on Blue Gene, which is a machine from IBM that's going to be delivered to Lawrence Livermore, I think, later this year. Very large-scale system.

It runs a very, very thin microkernel. And that's all it runs. It's a single-tasking machine. You can't do fork. You can't run two applications on the node. It's a radical example there. And then there is Cougar and Catamount that run on the Cray-Redstorm series. Again, very, very thin microkernels. Single-tasking. In many cases, they have no virtual memory either. The physical memory is all you get.

So why? Why do we want to create a very thin operating system? If we look at a cluster design, each node of the cluster runs an independent copy of the operating system. So in our case, for instance, we have a thousand nodes. A thousand node system, you've got a thousand copies of the operating system. More importantly, you've got a thousand copies of the scheduler.

And this is a problem. The reason this is a problem is let's take a look at something called a block synchronous code. So in block synchronous scientific applications, they execute some piece of a model. Okay? And they execute one time step of it. At the end of the time step, everybody stops. They all make sure they've gotten to the end of the time step, then they begin the next time step.

This particular operation that enables all of the nodes to stop and synchronize is called a barrier. And typically a barrier, if everything is working well, executes in a few microseconds. However, if one of the schedulers, one of the thousand different schedulers running on the system glitches, glitches meaning it runs some other process instead of yours, what happens?

The barrier doesn't run in microseconds. It runs for the entire scheduler interval, which can be as large as 10 milliseconds. So now if your time step is very small, your time step itself executes only in like 500 microseconds, and then the barrier takes 10 milliseconds, it doesn't matter how big a machine you built, the rate of processing is going to be the inverse of your scheduler, and this happens. The interesting thing is, this is not just a theoretical curiosity. This happens a lot. This takes out 20% out of your system. So what do you do to solve this problem?

You synchronize all the schedulers, such that they all do context switching at exactly the same time. This is a solution called gang scheduling. The problem is you need to have a clock that runs very accurately across your entire system, and the cost of that clock can be very high.

You might have to get a GPS clock, put it on each node, and query it to be able to make a scheduling decision. In a cluster design, doing gang scheduling is not so easy. The other trivial solution to this, which is what was picked up by the two extreme examples, is don't do multitasking.

Okay, so you never run into gang scheduling problems. That's an interesting solution except for the fact that once in a while you do want to do multitasking. When your monitoring system management demon wants to collect some resource information and report it back. So in this case we've actually worked on something which I won't talk about.

It turns out microkernel designs, you can take a general purpose microkernel design and be able to whittle it down such that it only runs a small set of processes and you can control exactly when context switching occurs across the entire system. There's a fairly elegant way of solving this gang scheduling problem.

The other question, problem that you have with cluster designs is basically reliability. Take a large scale cluster. 1,000 nodes, 1,100 nodes, 2,000 nodes. You've got a fundamental problem there. The number of components is very large. So let's say each of your nodes, you go and pick the best node design that you can find.

Something known for reliability. Okay? So you're going to pay a premium for this. But the premium might be worth it. So let's take the extreme case of it. Six sigma reliability. Three sigma after that comes at a very, very high price point because the quality control is very expensive.

You'll have to watch every single component that goes into the design. So six sigma reliability. So you buy a node that crashes once a year. That you would call a pretty good node. It crashes once a year and comes back in like a minute. Once a year crash, 1100 node system, we have 1100 crashes a year. That's about three a day.

These are the kinds of statistics that you get on large scale clusters. Two to three times a day you lose something. Now, in a typical desktop environment, take IT infrastructure inside a university or an industry, you might have a few thousand machines. If one of them crashes, it doesn't really matter because all you lose is whatever was running on that machine. Very graceful degradation. In tightly coupled parallel applications, you lose one node, you lose the whole thing.

And this is a big problem. So we've been working on a solution called DejaVu that builds transparent fault tolerance directly into application, analyzes the application, instruments it directly, and any failure from the system, the application doesn't see it. It recovers transparently from it. So this is funded by the National Science Foundation and we got a patent on this. And it's being commercialized this year. Hopefully later in the year it will be available through a company called California Digital.

This question is deployment. So now you've done your analysis on your cluster, you've figured out exactly what your design should look like, now you want to build it. The first thing you've got to worry about is your team. That is your most important asset over there. You need a fairly large, fairly diverse group of skills in this team.

The team generally is kept fairly small. But if you're trying to do a large scale deployment, you've got a whole slew of issues that you've got to worry about in the deployment, and I'll talk about them. And you need people with very specific skill sets to be able to address each of those deployment issues.

One or two people really won't be able to do it, unless they're a true know-it-all. It doesn't happen. The kind of knowledge that you need, very specific knowledge, you're not going to find people that can expand the entire width over there. Your project plan can run to hundreds of elements. In our case, it was something like nine double-sided pages at like eight point font. And this was the project plan. These things are all gated.

One of them slips and you have trouble. And then you've probably got 10 or 20 elements running simultaneously at any given point in time. And you've got to be able to manage this whole thing to make sure it delivers correctly. So you start off with a core team, kept fairly small, and it's responsible for the overall logistical planning of how this deployment is going to go about.

Your next team is physical infrastructure. It consists of two components. Your IT team. Your IT team is very critical because they generally have the experience not only for deploying but also managing large-scale production systems. So having a good IT team is absolutely critical. And the second thing you're going to need is physical plant personnel for site reconstruction. You're going to need renovations into your center, particularly if you're planning to do a very large-scale deployment. You need finance people. There are lots of components you're going to be ordering in here. You need to keep track of everything.

You'll need PR people if you're doing something that's interesting and worthy of everyone else to take note. And vendor liaisons. You'll have to have dedicated people that can liaison with each vendor. You're typically going to have anywhere between 8 to 20 vendors, independent vendors, all of whom have to match your schedules. Delivery times have to be coordinated, so it gets fairly complex.

So in terms of facilities, first thing is a data center that has a raised floor. It's very useful. Typically, you need to run power cabling below the raised floor. You don't want to run a lot of power cables on top. In our system, we had something like 432 power receptacles. Running the cabling for all of those power receptacles would be a nightmare without a raised floor.

And if the raised floor is sufficiently high, several feet, you can also run your network cabling below it. There are some good examples of that kind of design. Earth Simulator's raised floor is six feet. They've got a whole floor below. It's very easy to do network wiring then, but then it causes other problems, which I'll talk about.

Typically, network cabling is easier if you try to run it on trace overhead. The other thing you have to look at is how do you locate your machines? What is your aisle design? Typically, what you do is you have aisles, what are called hot aisles and cold aisles. So in hot aisles, you have machines blowing hot air, which is sucked out. And in cold aisles is where you force in cold air.

Typically, machines cool front to back. And in the first row, your machine cools front to back. The next row is reversed. So it's blowing hot air into the same first aisle. The aisle after that, both the fronts of the machines are facing. You pump in cold air. You do hot aisle, cold aisle design.

There are other designs that you can do also. You can do star kind of topologies where you have cold air coming out and hot air exiting out this way. Even vice versa can be done. But typically, hot aisle, cold aisle is the most common design that you'll see in supercomputers.

I cannot stress that first part high enough. You need a loading dock. If you have a facility without a loading dock, forget putting a large scale cluster. Putting a small scale cluster itself will be a problem. There are nightmares, everything from your rack itself not fitting in an elevator to not being able to wheel things in. This can be a very big problem very quickly. Fire suppression. This is necessary. And if you can find a solution for fire suppression that doesn't destroy your machines, that will be useful.

Because it's generally a devil in deep sea decision that you have to make over there. Water sprinklers save the building, not the machine. Physical security. Security inside these machines is generally kept very low for management purposes. You want very good physical security and you also want very good network security coming into this system.

Power is a major issue. Budget for sufficient power. The way you do it is take your machine, do load tests on it. No load is when the machine is on but not computing. Full load is when the machine is on and computing at peak. So find your most compute intensive application. There are some good ones out there. Run it and let it chew up as many CPU cycles as it can.

If you've got I/O devices, run the I/O devices also. That gives you expected load on the machine. Power load from each node. Not only do you have to worry about power load from just the nodes alone, you have to worry about power load from your communications and more importantly you have to worry about your power load from cooling.

So remember, at the end of the day, whatever power you're going to feed into the machine is going to come out as heat. So you need a cooling system to take the heat out and that takes power again. So all of this has to be budgeted into your power calculations. And power redundancy is very important here. If you power cycle a node, a perfectly healthy node that's had an uptime of like one year, there's a very high chance it won't come back up again.

So don't power cycle a node if you don't have to. And typical cases of power cycling are the unintended ones, brownouts, lightning strike next to the building. If you lose power, you could end up with a lot of your nodes not coming back up. So typically you can do two-way redundancy with dual main feeds. You want them coming from different substations so you're not dependent on upstream, and be able to switch between them.

Three-way redundancy would have a UPS backup behind it. And for a large-scale system, that UPS can be very large. In our case, it's like a megawatt or so. And after the UPS, you might even want to go four-way redundancy, with the diesel generator backing up the UPS also. And in this case, remember, budget for enough redundant power to power not just the machine, but also the cooling system. 'Cause if you have a large-scale cluster and you budgeted for enough redundant power to bring the machine down gracefully, it won't come down gracefully.

Because the system's so large that by the time you can complete your shutdown, with the cooling system off, the temperature might be high enough that it'll just crash nodes. Temperatures rise. In a large-scale design, very quickly. We're talking seconds. Turn the cooling system off, seconds, and your room would be reaching well over 100, 130, 150 kind of temperatures.

cooling capacity is very important. You need to analyze this very carefully. Large clusters produce phenomenal amounts of heat. You're talking millions of BTUs. And if they're very tightly placed, that is they take a very small amount of square footage, then the heat density is very high. So you can use typical air cooling to handle low heat densities. But with high heat densities, this becomes a very big problem, and I'll talk about that.

Don't plan for cooling and you'll have a really good heater. And that's all the machine would be capable really of doing. The side effect is basically thermal degradation, and this is insidious because it doesn't show up really as a clear-cut phenomenon in the sense that you will have random instability in the machine. Nodes will just crash. And you're gonna go back and blame your node manufacturer, saying you've just given us really poor nodes.

It's actually not the node manufacturer's problem. It is the fact that your thermal load inside that facility is not enough to keep the system stable. And eventually what happens is your processor lifetime is shortened in this process. Stability, interestingly enough, with respect to heat, is not linear, it's exponential. So if you cross certain temperature domains, your machine is eight times more stable than before.

So it's very important that you try to keep your cold dial well below specification of the manufacturer. In our case, we keep our cold dial something around 70 Fahrenheit coming in, and our exit air is only barely above 80. So it's a fairly well-balanced system, and you want to keep these kinds of numbers.

And depending upon your arrangement and your heat density, air cooling might not work for you. See, one of the problems with air cooling is it uses air. Air has very poor specific heat. If you're trying to blow a lot of air, it loses its cooling capacity very quickly in a room.

And you want to be very careful about how you cool it and how quickly air comes out of your intake aisles. If you have very high heat densities, if you're looking at eight kilowatts or 16 kilowatts of heat, per rack, then you should be looking at liquid cooling solutions. They're a better way of getting heat out of the system.

And finally, you want to look at system assembly. The scale is just overwhelming in here. You're going to have thousands of nodes. You're going to have communications interconnects. You're going to have communications cables. If you have more than one communication fabric, you're going to have even more cables. All of this has to be managed.

Pipeline everything. Assembly lines work very well. And there are certain elements of assembly line that can actually be parallelized. There are a lot of steps in the hardware, everything from unboxing, which can actually become a pretty big deal. If you have 1,000 boxes sitting in your warehouse, unboxing itself takes a long time. So you've got everything from unboxing. After you unbox, you test and burn in right there to make sure that you didn't have any shipping problems.

Rack your nodes, and then do your cabling, network cabling, power cabling. You could have thousands of cables running over here. It has to be done carefully such that you don't have interference potentially, and you don't have cables being stretched too taut, or in some cases, even being too long.

In terms of software, you need to create a node image. All your nodes in the cluster run exactly the same copy of the operating system with pretty much the same image. You need to create an image and be able to blast this image to all the nodes very quickly.

You want an automated way of doing this. The last thing you want to do is put a CD into every node and start installing the operating system. Even at scales of four, it becomes a problem. At anything beyond four, this is a huge time waste, particularly if you have to upgrade your operating system.

So, and then power up the system gradually. You might want to do sectional bring up. Diagnose a section before you go to the next section. And then finally, one of the things you can also do here is use benchmarks to burn in the system. They tell you very quickly, because the system starts off and starts computing at peak load, it very quickly tells you if something's flaky.

And finally, open up to friendly users first, because believe me, not all the bugs have been worked out. Open up to users who understand that these systems are very large, very complex, and potentially unstable. This is like a NASCAR in some sense. This is a highly tweaked car. It's not meant for everyday driving, really. It has its characteristics that you have to work with. And work out all your kinks before you go into production mode with full-scale users.

Some perspectives on the future. So our system was built at about $5.2 million and we have This kind of budget, now supercomputing is really within reach of not just academia but also industry. 5.2 million is not really that large a budget. Universities run annual budgets of well over 500 million.

So this is not a particularly large budget. This might be the beginning of a paradigm shift. If you look back into the 70s, you had these glass house mainframes. A whole university could afford one if it was lucky. There was a famous statement by Watson. It said the worldwide demand for computers estimated at that time by IBM was six.

And that was true because at that time the machines were very, very expensive and almost entire countries or states could afford one machine. But then the minis came about. They were a lot cheaper. A lot more people had access to them because universities could buy them. In some cases, even departments could afford them. And this started a large body of computing research, results from which made the mini better and better and better and led to the PC.

Essentially, as the cycle starts, where you have the machines getting more stable, results going back to industry, more units being built. And the cycle keeps continuing. It gets cheaper and cheaper at each iteration. And this is what you want to start. Today, supercomputing is considered exotic. And programming methodologies are still at their infancy simply because it's not had enough people being able to address the problem because there are not that many of them around. And each one of them is in very high demand.

So how do you make this a commodity? How do you make this something? Today, if you in the enterprise want to deploy a supercomputer, you want to do it in-house because you believe there is a lot of technical knowledge that goes into building the system. You would not really trust a data center to do it.

But that's going to happen. When it becomes a commodity, this is something that you would look at and say, "Why should we be spending money and wasting IT resources on putting up supercomputers? Let's have data centers come up and let them handle all the hassles of building, maintaining, upgrading these things. We'll just use the cycles." How do we get there?

In terms of architectures, we are looking at something interesting. There are some emerging processor designs that make it possible. So if you look at processors today, many of them support vector designs. In fact, one of the best examples there is Velocity Engine, the Altivec, on the G5 processor. It’s a full vector implementation. It also has a scalar component.

So most of you are used to vectors from the image processing side of things. Photoshop onwards, all of these are optimized pretty heavily to run on the Altivec. And that’s where you get your really big performance difference. Because the vector is highly efficient for these operations. If you were to run that on the scalar, your performance would be a lot worse. And there is a true vector implementation. Now for scientific applications, you need something called double precision floating point. The Altivec today doesn’t support that.

But a processor design that supports a double precision vector, and the next session will be on the new and emerging hardware architecture technologies including hardware infrastructure, storage solutions, file systems, and software frameworks. card. That's pretty balanced right now. Both of them are bottlenecks, but at least the bottlenecks are balanced, which is pretty important because if you're looking at gigabit Ethernet, that's 125 megabytes a second. So get out on the network at 125 megabytes a second, you get data from memory at 4 gigabytes a second. You got a problem.

But it's here, and this opens a door because there are some new versions coming out which have low latency also, which effectively you can use a cluster architecture and in software be able to create a shared memory machine that rivals a shared memory machine. So basically our goal over here is we're working towards an architecture that using a basic cluster design does both vector and scalar and both distributed memory and shared memory. One unified architecture, you should be able to run codes of any flavor on the same system.

Biggest thing being you get rid of the price performance of clusters, which are very low today. The other thing that we need to work on is programming models on how to make these things a lot accessible to a much larger community than the current community. that uses it.

And at Virginia Tech, a little bit of the future. So System 10 is our current system and it's just the beginning. We have a long-term interest in the field of computational science and engineering. We are recruiting pretty heavily there. And this system, our current one will be followed by another one that will probably start off in '06 and deploy through '06 essentially. And I know I'm going to get asked this question. What happened to it? Are XServe upgrades going on? Those are a bunch of the machines that are right.

This is a handiwork of the IT group. Beautiful cabling. Absolutely beautiful cabling. And every row looks identical. You can't see cables anywhere. Very neatly tied up. High speed communication coming on one side, power and the low speed communication on the other side. There is how many racks of this? 36 racks. Plus I think what, two more in the third aisle, right? Thank you.