HPC Software Performance Optimization In Depth - WWDC 2004

Enterprise • 1:12:13

This session provides an in depth look at Mac OS X technologies and applications for optimizing all aspects of software performance specific to a computational cluster. Information covers the OS services, job schedules, optimized libraries, cluster profiling tools, and third party cluster computing development frameworks like GridIron's XLR8.

Speakers: Warner Yuen, Steve Peters, Josh Durham, Dean Dauger, Steve Forde

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

I'm Warner Yuen and welcome to this afternoon's session on HPC software optimization. So once again a lot of people are always wondering Apple and high performance computing and in reality what they think about is this. But the reality is there are a lot of great tools for building high performance computers with Apple hardware. And today in this session we'll be talking about some of these things, specifically what's new at Apple for high performance computing. And also some of the things for preparation of your Apple hardware for high performance computing. And a great introduction on how to write parallel code.

So we don't have time to show you all of the cool new Apple tools, but I wanted to give a plug for the new Chud performance optimization tools. Specifically tomorrow there's a great session called Got Shark that talks about performance profiling. High performance compilers. So this is what has really allowed us to play in the high performance computing world. Specifically new this week was the 64 bit compiling ability with GCC in your Tiger previews.

In today's session what you'll really, what we'll get into is performance benefits using Mac OS X's Accelerate framework for high performance computing. And we'll have a presenter talking about streamlining the OS services for performance. And another person will be coming up to talk about what options are available for building high performance computers with Apple hardware. And then we'll have our introductions on creating parallel code with both MPI and also a next generation parallel computing developer framework.

So who's talking today? We have some really great speakers. We have Steve Peters from Apple's Vector Numerics Group coming up to talk about Mac OS X. We'll have Josh Durham from Virginia Tech that will come up and talk about use of Mac OS X for high performance computing. And Dean Dauger from Dauger Research will come up to talk about writing parallel MPI code. And finally, we have Steve Forde from GridIron Software to talk about next generation parallel development framework. So with that, I'm just going to go to our first speaker, Steve Peters.

I'm going to tell you about the mathematical facilities in Mac OS X, sort of base of much of scientific high performance computing that's available for Mac OS X. The agenda today is survey the APIs that we ship with Mac OS X, tell you about new Tiger APIs, a little bit of comparative performance to let you know that we're, I think, leading the field in sort of this price space anyway, and then reinforce the mantra that first check out our frameworks when you're interested in getting performance out of the machines on math.

So we start out with chips that are IEEE 754R compliant. Like much of the industry, this is what gives us our substrate in math. Then layer on top of that LibM, a C99 compliant library full of the elementary transcendental functions that everyone knows and loves, sine, cosine, square roots, and so forth. We present these in single, double, and long double precisions, complex and in the real domain.

For linear algebra, a place where many vendors add value, and so do we, we take the basic linear algebra subroutines as shipped in the Atlas open source package, do some additional tuning, and ship that as part of our Accelerate framework. So if you're looking for any of the BLAs, look first to the Accelerate framework where they've been closely matched to the hardware on each of our platforms. And those come in all the familiar flavors. And layered on top of that is the gold standard of dense numerical linear algebra solvers, LAPAC. And again, in all the familiar flavors.

When we begin to talk about performance, we're looking to the G5 as our flagship. It's really gotten us into a really interesting place in the HPC space. And in my opinion, for scientific work, it's the dual floating point cores that have taken us there. So for every 970 CPU, you get two floating point cores capable of doing double and single precision IEEE arithmetic.

The CPU can dispatch to both of those floating point instruction units on every cycle. It can start a new operation on every cycle. All the basic arithmetic operations are present as well as hardware square roots. That's new to PowerPC as we've seen it anyway and good for us. The G5 also offers a class of instructions called fused multiply add.

These are three operand instructions basically. It multiplies the first two together, adds the third, and does that as one machine instruction Saving a rounding and adding, finishing the job in smaller number of cycles than back to back multiply followed by add. So fuse multiply add fuses those together. Why do we make a fuss about this? There are some people I've heard who say, "What's the big deal about fuse multiply add?" Well, it's fundamental to linear algebra.

It's the dot product, the essential piece of the dot product. And that is fundamental to matrix multiplication. It's a big part of the fast Fourier transform. The butterflies are essentially fuse multiply adds, multiply adds which we can fuse. And if you're doing function evaluation, say by Horner's rule, you'll arrange polynomials in a way that can take advantage of fuse multiply adds.

So a FUSE multiply ad counts for two flops, a multiply and an add. So we're credited with four floating point ops per cycle. On a modern G5, dual G5, that gives us eight flops across both processors per cycle and we get 2.5 gigacycles per second. So we peak out at 20 theoretical double precision floating point gigaflops, 20 gigaflops on the new dual G5 2.5 gigahertz Power Mac.

So how do you take advantage of this coming to the platform? Well, if you've already got compiled binaries for Mac OS X, bring them to the G5, our flagship. They'll immediately see some advantage from the ability of the CPU to schedule to both of those floating point cores. Recompile and you get even better performance because now the compiler knows there are two floating point units out there and can rearrange the order of operations in your code to take advantage of both and make some efficient use of the dual floating point cores.

And if you have an opportunity to think about your algorithms, you may be able to cast them in ways that can squeeze out a bit more performance. This kind of detail we've paid to our libraries, LibM, VForce, which I'll talk about in a moment, the BLAS, LAPAC, and our digital signal processing libraries, VDSP.

In the area of single precision floating point, we have a very formidable capability on both the G4 and the G5, the Altevec SIMD processor. It's a four way parallel single precision engine. Again, all the basic arithmetic operations and a vector fused multiply add. We top out here at 40 gigaflops single precision on the new PowerMax. And there are some codes that can get fairly close to using most or all of those. Convolutions are very, very effective on that box.

How do you get to high performance on the AlteVec unit? You've got to work a little bit harder. You're really going to have to think your algorithms through and cast them in terms of parallel operations. We have some advice on the web about how to do that. But first of all, it's probably wise to profile. And here's another plug for the Chud folks. They have wonderful profiling tools that will focus you on that 10% of the code where you're spending 90% of your time. Look there first.

Auto vectorization is an option and even a better option announced this week that GCC 3.5, available later in the year, will have auto vectorization features. That's a good way to get into the AlteVec game. and finally the level of detail that gets really good Altevec performance and single precision we've already paid in V-Force, the BLAS, VDSP and V-Image.

How do you use these things? We try to make it straightforward. Try to hide at least a bit the nature of the platform from your code. You call the API, we'll dispatch to the proper code suited to the underlying chip. LibM, the math library, links in by default.

There's nothing special you need to do. If you want the long double facilities and the complex APIs, we have LibMX, LibM Extended. That's a flag on the link line for GCC. And for our value added library, the Accelerate framework, you simply specify dash framework accelerate. It gets you on the air. and of course we ship these performance libraries on every copy of Mac OS X that goes out the door. You can always expect to find it there.

Well, what did we do that's new in Tiger? We've added a library called VForce. It had been called to our attention that the C99 APIs for the familiar elementary functions were data starved on our machines. We were seeing bubbles in the floating point pipes that were going unused cycles and we hate to see those go by. And also the C99 and IEEE demand very careful attention to the rounding modes and the way exceptions are handled and that adds quite a bit of overhead for APIs that are only processing just one operand at a time.

So the ideas in V-force were to pass many operands through a single call. For example, if you need 768 values of the sine of X, well there's a call called VV sine F that lets you pass all of them in at once. We amortize the overhead and get back in a big, big hurry.

And in fact that code runs about 12 times faster than a naive loop calling the traditional sine function. There's some caveats here. You have to expect IEEE default rounding mode and you won't see any exceptions. We're expecting that you'll give arguments that are within the domain of the function and so forth.

Using these ideas opened up a number of performance opportunities. On single precision, hit Altevec. Hit Altevec hard. That gives us four way parallelism to begin with. Double precision, we've got two FPUs. Let's make sure we schedule those effectively. do some software pipelining, fill up those bubbles with sort of independent parallel streams of computation. And then we take great care in choice of algorithms to avoid branching, which is very tricky on pipeline machines.

Some caveats, we're generally as accurate as the traditional libm elementary functions, but we're not always bitwise identical. We handled nearly all the edge cases according to the C99 specs. Plus and minus zero are the occasional exceptions. There's documentation to tell you where. We make no alignment requirements, but if you really want top performance aligned to 16 byte boundaries, that lets our SIMD engines collect the data most efficiently.

We're tuned for G5, but we also run very well on G4 and G32. Here's what's in there, the inventory, some simple division like functions, roots, X-mentals, logs, and powers, trigonometrics, arc trigonometrics, hyperbolics, and some integer manipulation. How do you code to these things? Couldn't be simpler. The blue on the top is C, below is Fortran, and in orange the obvious command line compilations.

What else is new in Tiger? We updated to Atlas 3.6, did some additional Mac OS X specific tune ups. We get some LA Pack performance gain since it relies on those BLAs and from some compiler advances. Little performance chart showing in blue our matrix multiply performance and in the cluster of orange, green, and burnt orange, the matrix decompositions. LU, the symmetric decomposition, LL transpose and the symmetric U transpose U. Notice that matrix multiply tops out around 13 gigaflops on a new dual 2.5 gigahertz G5 Power Mac and the decompositions, let's say 11.

Here's what the Xeon gets to, a 3.0 Xeon running MKL6, matrix multiply topping out well under 10, and the decomposition's just about getting to 8. and finally Opteron. This is quite old numbers, it looks about from last summer. They're topping out about five and a half last summer.

Let's give them 50% since they've upclocked by that much. That would be perfect scaling and that takes them, well, not quite to eight I think on matrix multiply. So G5 cruising along near 13 and the Opteron eight, maybe nine and Xeon probably closer to 12 in the 3.6 incarnation.

All right, finally we bring Long Double back to the Mac platform. It was present in Mac OS 9 and earlier, and the complex Long Double, and the C99 TGMath type generic math. So you can say sign of a complex number and the compiler figures out what you mean in C. Isn't that nice?

I'm going to pass by this since I am running very close to my time and maybe just jump to the end here where I show on the left our elementary functions, number of cycles to get in and out of our library functions, in the middle column what the competition publishes for their x87,

[Transcript missing]

There's several documents out on the web on our developer site that can get you started with this kind of stuff. We've already had the Accelerate Framework talk. We're in this talk. If you haven't seen the Chud stuff, by all means, please go see it.

[Transcript missing]

Thanks, Steve. So today I'm going to do kind of a brief overview of Virginia Tech System 10, which I'm pretty sure most of you have heard of. It's the 1100 node cluster at Virginia Tech. And then we go into some detail about some of the services in OS X that you can turn off, briefly go over what kind of things we did at Virginia Tech to kind of improve our benchmark scores, and very briefly I'm going to just kind of go into some of the management tools that we use at Virginia Tech.

So Virginia Tech System 10 is 1100 dual processor XSERVs. So that, of course, is 2200 PowerPC 970s. Each XSERV cluster node has four gigs of memory, which gives us basically an overall 4.4 terabytes of system memory for the whole cluster. One of the things that early on when we were deploying this is, well, they've got to be running Linux. Why would they run OS X? Or maybe they're running Darwin. But we really are running OS X. And it's the OS X that was shipped with it. We did a little bit of modifications, which I'm going to kind of go into. But it is OS X.

Briefly, our interconnect that we use at Virginia Tech was InfiniBand and we went with 24 Mellanox 96 port InfiniBand switches. So these switches give basically 20 gigabits per second, a full duplex per port, and that's basically 1.92 terabits per second for the overall bandwidth for the switch. We got about 12 microsecond latency and that's across the entire network, about 8.5 microsecond latency across just one switch.

We use a fat tree topology, which I kind of did a really rudimentary diagram at the bottom of the thing there of what a kind of fat tree topology is. And in our case it's a half bisection. And half bisection means that at any time if you have half the cluster trying to talk to the other half, you're guaranteed basically half the bandwidth, which in our case is basically 4.5. So that's about 5 gigabits per second.

In addition to that we also have a secondary gigabit network, which comprises six Cisco switches with 240 ports each. And we basically use that to do management and some file sharing to basically kind of get the system up and running and kind of do some of the administrative stuff on it.

Power and cooling can never be emphasized enough when you're looking at clusters, especially this size. So at Virginia Tech we're lucky to have this really wonderful computer facility that has basically three megawatts of electrical power, half of which is basically dedicated for System 10, the cluster. So we have dual redundancy with our power, we have a UPS system, and we actually have a diesel generator which is pretty much the size of a diesel locomotive. It just sits back on a pad and it's gigantic. So as I said, about half that, 1.5 megawatts, is reserved for System 10.

Cooling, we have basically 2 million BTUs of cooling capacity using Lieber's extreme density cooling. I'll have some pictures later and I can kind of point it out, but basically it's a rack mounted system where it kind of blows cold air from above and we also have floor cooling as well. So it uses standard refrigerant, overhead chillers. So we're looking at different kinds of cooling and we had a regular data center.

And the typical way to do that is you have air conditioners throughout the room that basically bring in air from the top, chill it, and kind of push it out throughout the floor. And then you basically put tiles in the right places to get that air. So we looked at trying to do that and if we did that we'd have had wind velocities of about 60 miles per hour underneath the floor. So imagine just pulling off one of those tiles, you get shot with that 60 miles per hour of wind.

So that's basically what I'm going to say about System 10. One of the things I'm going to overview is some of the services in System 10. And we're going to turn off some of them to optimize it slightly for more of the HPC type application. So OSN Server by default comes with about 40 processes, just the regular default install.

And so why do we want to reduce the services? Well, one of course is we want to free up resources like memory and CPU time. And increases some security. Obviously, if you're not running things like a web server or something like that, you're not going to have to worry about securing that.

And it reduces the amount of time for the system to start up, which kind of lowers your mean time between failures. So one of the things I want to emphasize is I always use this analogy for when you're turning off services that it's kind of like these guys that buy the Honda Civic. And they basically rip out the engine. They put a turbocharger on it, big spoilers on it. And they kind of soup it up.

So they kind of design it for their own purpose. And the problem with that is Honda's not going to do any sort of hardware support for you. And so the thing to keep in mind is if you're starting to turn off these services, this isn't something Apple's going to recommend you to do. This is kind of what we did here at Virginia Tech.

So the first, basically I'm going to kind of step through the different places in OS X where you turn off services. And the first one is the Etsy host config. and there in orange are basically some of the services that run from ETSI Host Config. We have things like CupsD, which is a printing service.

We have the AutoMount, which basically handles mounting removable file systems and network file systems. And we have the Crash Reporter D, which sort of sounds important, but basically it's just for creating crash logs for the GUI applications. And we have Server Manager D, which is basically if you ever used with the XServe, Server Admin or Server Monitor or any of those tools, this is what uses that.

So that's one thing I'd like to point out, is if you want to use these tools, and they're great tools, you want to keep this service on. So basically this file has a list of services and just changing the service equal yes to service equal no, that will disable the service on the next time you reboot.

The next thing we do, and this is kind of blasphemy in the OS X thing, is we're going to turn off the GUI. Unfortunately, we have 1,100 GUIs running. There's no real need to do that. No one's ever going to actually see these GUIs. So the place to do that is an Etsy TTYS. And so I have, basically there's this very complicated line there and it needs to be commented out. And commenting out that one line pretty much is going to prevent the Windows Server from running and the login window thing from running. Actually this just does the login window.

The Windows Server is in another place. Basically in OS X there's all sorts of different places where these services get started. This one is in a directory called etsymockinit.d. The way I disable it, and this is just a personal preference, as opposed to just I could delete the file, I could remove it. Instead I'm just going to create another directory and move the script into that directory. That way if I change my mind later I don't have to go find it or make sure it's the right thing. I just can move it back.

So next thing that we turn off is the ATS server, which basically provides font services. Since we disabled the GUI, we're obviously not going to need font service on the system. That's another thing that gets run out of Etsy Mocking at D. So basically just moving that P list into the disabled directory is going to do that.

The next thing that we're going to hit, the next stop on turning off the services is basically modifying Watchdog. Watchdog is this process that basically monitors your system to make sure that processes are running. If they aren't running, it restarts those processes. Another thing that Watchdog does that's really nice is that it enables the system to reboot if it crashes. So that's actually pretty nice in the HPC thing because the system will come back up and hopefully rejoin the network.

It kind of reduces your time between failures because it's kind of almost like a self-healing kind of thing. So in etcwatchdog.conf, we're going to disable two services that we don't need. The print service monitor, obviously we're not going to be printing from the cluster nodes. And master. Master, I didn't know what it was when I first started. But it sounded really important. It's actually just the main server for the mail server. So we turn that off.

One more thing that's on there is HWmonD, which is the hardware monitor. And so basically this thing is pulling every, I think, five seconds is the default. And basically it's just keeping track of your hardware. So it keeps track of all your fans, your temperatures throughout the system, and just kind of records that.

And it can also send notifications, stuff like that. So I thought every five seconds is a little too much. So we kind of bumped that up by adding -s 60. And that's going to make it only run once a minute. So that kind of reduces the CPU overhead of this one service.

Next thing I turn off is MDNS Responder and I have a feeling that this is not going to be something we're going to be able to turn off as more and more things start to rely on Rendezvous. So we have things like XGrid and if you plan on using XGrid you don't want to turn this thing off because XGrid is going to use Rendezvous to find other compute nodes. And there are things like if you ever want to use the distributed compile option in Xcode it also uses Rendezvous.

So if you don't plan on doing anything with Rendezvous this is definitely something you can turn off. It will reduce the amount of network stuff it sends out and a little bit of CPU overhead. So this one basically has a script in the system library startup items folder and basically I just comment out the line that starts it.

So in OS X there's lots of different places to kind of look for services and you kind of see there the grayed out services that we kind of went through. And there are some things on there that people disagree with that either need to stay or some things need to go.

Like I leave the time server on there because I think that's important for the cluster that I run. I leave cron turned on because we actually use cron to kind of do things every so often on the cluster. But some people can turn that off and not have any issues with that.

So I'm going to talk about the LINPACK optimizations and the kind of things we did at Virginia Tech. LINPACK is basically the benchmark that's used in the top 500 list. So we were number third in the world in November. And so the way this is established is we have to run this benchmark called HPL.

So we had some, this was about a year ago, and this was before a lot of the optimizations went into the XLR8 framework. So we had a person in Japan named Kazushiki Goto, and he did basically some assembly level optimizations on major subroutines. Basically the DGEM subroutines. I have a website there for more information if you want to kind of look at his optimizations.

One of the things that we had to do though is we had to kind of write our own memory manager. Because the blast routines that he was writing did a much better job if it was guaranteed a contiguous physical amount of memory. As opposed to having it kind of get segmented or partitioned.

So with those optimizations, we actually had about a 10% increase over the Apple Veclib at the time. Remember, this was using Jaguar, so we didn't have XLR8. We were still using Veclib. So with the optimizations and some of the tweaking that we did at Virginia Tech, we actually got 10.28 teraflops per second, which was the third fastest in the world. And without those go-to optimizations, we probably would have gotten around 8.4 teraflops, which on that list probably would have put us around fourth.

Very quickly, I just want to kind of go over some of the system management stuff. And I can talk for maybe two hours or five hours or 12 hours on this. It's something that I do a lot of work with. The tool that I love for system management is called Ganglia. And I know like BioTeam uses it inside their package. And what's really great about Ganglia is it runs on each system. It kind of just gets system status and kind of broadcasts that out on the network.

So by default it has a couple of displays and I have a few of their displays at the top. Like at the top there's a cluster load percentage. So it's kind of really great. You can see what's going on with your 1100 systems. You know, you kind of get a, take a step back and be able to see what's really going on in the cluster. And what I love about that is you can drill down. So we have that big cluster overview, but you can drill down and look at like a specific node. And what I love about it is that it's XML data. So you can parse that XML data.

So we at Virginia Tech made a, basically a kind of a custom display there that kind of shows us a physical representation of what our cluster's doing. So we can see if a CPU's kind of doing something weird or if we can look at temperatures and loads. Kind of get a physical view of it. And it really helps with just quickly discovering what's going on with our system.

So, things I talked about of course were, you know, overviewing our System 10, reducing the number of services, and what we did on the LIMPAC scores and some of the management features we did. So people of course are, if you went to Dr. Vrajanan's presentation yesterday, you probably saw some of this, but people keep asking us, so what's going on with System 10? So, you know, we dropped off the list and it's because we swapped out our PowerMax and we're upgrading XSERVs.

So I can say that people are very hard at work installing systems and we have about 850 in. And so there's some of the racks that we have. And, you know, one of the things that is really interesting is that we have, we're using basically a third of the space that we do with the PowerMax. So we only have one aisle where we can do all the cabling.

And so it gets kind of crowded. And there's, I don't know how many people are in that picture, but that's a small space for a whole bunch of people. And that's basically us doing the wiring in the background. We have to wire your Ethernet, do the power, and run an InfiniBand. So quite a bit of cabling going on. So with that, I'm going to introduce Dean Dauger from Dauger Research.

Thank you, Josh. Yes, so let's see. It's definitely a pleasure to be here today and to be speaking to you. I very much appreciate the kind people at Apple to invite me to come out and talk about plug and play clustering and how you can build your cluster in minutes.

And so what I'd like to go over first is an outline of what I'd like to talk about. And first of all, why parallel computing, why parallel computing was interesting to do, and what we did to go about inventing or essentially reinventing the cluster, inventing the Mac cluster, and an introduction to basic message passing code. And then a description of how you can build your own Mac cluster, and hopefully if the demo gods are kind to us, I can show you what we can do with a Mac cluster. Thank you.

So why parallel computing? Really, parallel computing is good for problems that are too large to solve in one sense or another on one computer. The simple reason of simply taking up too much time, too much CPU time, but also in some cases, or in many cases I know, it requires too much memory.

Some problems can easily outgrow the RAM capacity that's available on a single box. And I know codes that run 15 billion particles and it has to keep all that data all in RAM and so multiply that by however many dozens of bytes per particle and you can see that's quite a bit of memory space.

So the other thing that's happened in the last decade or so is that the programming API has become standardized on what's known as message passing interface, also known as MPI. It's a specification that was established in 1994 and by the end of the 90s it became the dominant software interface that's available at supercomputing centers such as the San Diego Supercomputing Center.

As well as NERSC and also on many cluster systems. And so this development enabled the possibility of having portable parallel code, code that's portable between the supercomputing centers and the clusters in both 4G and C by using MPI. And that's been a real benefit to scientists and many other users of such systems.

So to give you an idea of some of our experience, this is a current picture of the UCLA physics Apple C cluster established in 1998. And as you can see, we use a mixture of G5s and G4s connected with a fast switch. And we are running on a mix of OS X, various versions of OS X, as well as OS 9. So we're able to mix and match nodes older and newer hardware.

And then we can combine this cluster with machines that are on people's desks, such as my colleagues' professors or postdocs or graduate students. Combine them as we need to when they're away to go home from work or on their vacation. Or if a colleague needs time just before a conference, they can go ahead and just use the machines, ask permission, and involve them together. And that's really saved a lot of people's work.

And just a little quick note, this is a picture just from last week, the Dawson cluster. It's going to be 256x dual processors. Currently, 128 online. It was literally physical. It was technically just assembled last week. And we were able to get this picture connected with Gigabit running 10.3. So we'll be definitely having some results of that later in the month.

So cluster computing with Mac OS X, essentially we went about reinventing the cluster computer. And it really is a very nice approach to cluster computing, much more reliable than many other systems that I'm familiar with. It's independent of shared storage or any kind of command line login or static data like machine lists or static IP files.

And so that leads a lot to a great deal of reliability because you don't have to make sure that every little switch is just right in order for the cluster to work. And so this all results in the lowest barrier to entry for people who are using clusters and really saves a lot of time and money.

And really the purpose of this whole approach is to be able to enable users to focus on getting useful work done so they don't have to be bogged down with the mechanics of the cluster. They can actually get real research and real work done. And that was our motivation to be able to assemble and design the Mac cluster.

So, we're going to do a little bit of

[Transcript missing]

So let me give you an introduction to parallel code using MPI. Basically, it's code that coordinates its work using messages. The model is that there are n tasks or virtual processors that are running simultaneously, and you label them from 0 to n minus 1.

These executables often use this identification data to determine what part of the work they're going to do and how to coordinate work between them. And so they pass messages between all these virtual processors or tasks to organize the data and organize the work. It's really analogous to a number of employees at a company who make phone calls with each other or have meetings to be able to coordinate work and to accomplish a much larger project.

Any group of tasks can communicate, which implies there are order N square connections that are supported by the MPI. And that can support simple sends and receives, as well as collective calls, such as broadcasts, where you're sending from one task to all the others, or gather where you're collecting data, say for data output, or reduction operations, such as computing the maximum of an array that's spread across the cluster, or the sum, or other parameters like that. And also matrix operations such as transpose and vector operations.

And synchronization is not required in between the tasks. No precise synchronization is necessary, but it's only implied by the fact that messages need to be able to get from one task to the other. So to give you an idea of what it looks like, I'll introduce a simple example, a simple example I know of message passing called, we call parallel knock.

And in this case, in this diagram, the time axis is down. And we have a time axis that's down. And we have two tasks that are communicating with each other. At first, task zero sends a message to task one. And then they both print that message. And so task zero prints the message it just sent, and task one prints the message it just received. And then a reply is sent back from task one to task zero, which is then printed by both tasks. And so task zero prints the message it just received, the reply, and task one prints the reply it just sent.

So to give you an idea of what the code looks like,

[Transcript missing]

Now the next example I'd like to go over is Pascal's triangle. This is an example that illustrates local propagation. Propagation, what I mean is that every element eventually interacts with every other element in the problem.

But the interactions are all local because any one element is simply the sum of the two neighboring elements in the preceding line. And so eventually they all interact with each other, but every interaction itself is local. And this is similar to a variety of physical problems such as fluid modeling where you're looking at fluid flow through like inside a plasma or inside a blood vessel or things like that.

And you can use partial, one tool, you can use partial differential equations for that where you're having neighboring elements interact with each other. As well as elastic deformation which occurs when you're trying to simulate say using finite element modeling to be able to understand how the earth's crust is going to deform when say a fault slips.

Or a Gaussian blur where you're talking about one point spreading this information to all its neighboring ones and so forth using localized convolution. Or molecular dynamics where you have molecules interacting with each other in a local manner or certain parts of particle based plasma models. Are all, those kinds of codes are all good examples of local propagation.

So in parallel Pascal's triangle, the way that you recognize where the message passing is, is the first layout that you can think of this as the time axis being down from one to one, one, and so forth. And the thing to recognize is to understand where the communication is happening in the problem in order to perform the computation. And so what I've drawn here is all the arrows indicating all the places where there's a certain amount of information being propagated or data being propagated from element to element.

And so the thing to recognize is that when you partition the problem up, let's say up into three different sections, and you can recognize that a certain amount of information or a certain amount of data being propagated through the partitions in between each section of the problem. And so you can handle all the internal communication as normal for any single processor code, but then the MPI calls correspond to the arrows that cross the red boundaries that are here.

But by choosing this method, this arrangement for the partitioning, the computation becomes proportional to the volume of the problem and the communication becomes proportional to the surface area. So you can think of it sort of physically that you'll probably end up with a good communication to computation ratio with this kind of organization.

So by splitting it up into the three different sections, imagine you have three different tasks running, these are the messages being sent and received. So that for every odd and even line, you're sending messages either to the left or to the right, to the left or to the right for every alternate line.

And so for the computation, all the computation needs to know is simply that there's an array to form and take, to compute the value of an element in one line, you simply sum the previous two. But what the message passing does, it fills in the gaps as it needs to, to be able to propagate the information in between each section.

And so you can see, say, the left edge of the middle task is a duplicate of the right edge of the left task. And so the fact that there's a duplicate, this is also known as guard cells, where you're able to set up these kinds of guard cells to allow the computation to proceed as if it was the only process you're running, but then the MPI simply fills in the guard cells at the moment where it's needed. And so this is actually a fairly prototypical example of a lot of local propagation type problems.

So to show you a code example, again this is available in Fortran as we'll see. In this case, this if statement is alternating between odd and even lines of the problem. And for example, we start at the top part of the if statement. We have an MPI receive that's performed on the right edge of the array from the right processor. And what I mean is that it's an immediate receive, it immediately returns, also known as an asynchronous receive.

So you're allowed to continue to execute while the receive is happening. And then an MPI send is performed on the left part of the array, array zero, to the left processor. And then an MPI wedge is performed to be able to balance out the I receive and complete the I receive that came before.

So in this case, since everybody is sending something to the left, that means that you're receiving something from the right. And so that's what that corresponds to. Likewise, in the lower half of the statement, we're doing an I receive from the left and then a send to the right.

And then a wait to complete the receive. And so we're all sending to the right instead. And so the result of this code is like this. If we divide it up into three different tasks, and the way that this is drawn, is that all the odd values is drawn with an asterisk and all the even values has a space. And so we can see that they're essentially task one has a seed at the top, which then propagates through and propagates out to the boundaries across the partitions into the sections on task zero and task two.

And so by forming, by arranging this way, we can see that we actually have maintained our guard cells. If you look carefully, the left edge of task one is identical to the right edge of task zero. And so those guard cells will be maintained by the MPI. So we see that's successful there. And the other thing that we see out of this way, this forms a shape also known as the Sierpinski gasket in the Pascal's triangle. So we're able to perform this problem using MPI in this way.

So that's just one of many possible message passing patterns that are available that are supported using MPI. And for example, that's the example of the nearest neighbor on the left. And of course, the arrows are reversible, so you can do left, right, left, right. And then the upper left is another common message passing pattern, also known as master/slave. This is something that is relatively simple. And so in this case, it shows a broadcast from one task to all the others. Or you can reverse this and, of course, do a gather.

But also, the all-to-all communications pattern, where every node is communicating to every other one, that's very important for data transposes of matrices. And that's important, say, for performing a very large 1D FFT in parallel. You have to go through data transposes and, consequently, the message passing patterns. You have a lot of all-to-all communications. Or you could have other message passing patterns, like a tree, where one is sending to two others, and they, in turn, send to two others.

Or in a regular pattern for a more regular problem, or any combination of the above, or any multidimensional versions of any of these. So these are all things that are possible with MPI and are important for a variety of interesting problems. So to give an idea of some of those interesting problems, these are the applications that have been run on MAC clusters that I'm familiar with. And for example, on the upper left is a picture of the electric Tokamak device.

So Tokamak is a plasma device that attempts to hold a plasma in confinement in a ring shape pattern, or torus shape pattern. And one of the things about many kinds of Tokamaks is that the plasma in it is very hot and very hard to handle. And so it typically leaks out to the walls around it. And so they wanted to try to confine it better. But in order to be able to see inside it, if you stick a probe in it, it's so hot it could just vaporize.

So it would be very hard to be able to try to really probe in there. So that's why computational simulation is very interesting to do. And so this is an example. This is a very interesting example. A QuickTime movie made from a gyrokinetic simulation of a Tokamak plasma in the cross-section showing the electric potential and seeing how it evolves from a linear state to a saturated state.

and then on the right is the planetarium rendering that was performed by a customer over at Northern Kentucky University. This was submitted to the first ever full dome festival and actually won an award. This was performed on a 50 node Mac cluster rendering out a three dimensional simulation inside a planetarium.

On the lower left is an example that comes from Dr. Hulzenbeck over at UCSD in biology where he and his colleagues wrote a program called P-Mr. Bayes that computes the posterior probabilities of phylogenetic trees. So I'll say that five times fast. And what it studies is it looks at the DNA, the similarities in DNA between various species and tries to determine the evolutionary path in between them. And this was a code that he consulted with me on to be able to do parallelization as well as vectorization.

With vectorization we were able to get a three time speed up and of course with parallelization we were able to boost that even more. And the number of processors that involved. On the lower right is a quantum PIC simulation, some diagrams from a quantum PIC simulation. In this case this is showing a two dimensional quantum wave function in a simple harmonic oscillator and showing the circulation of the electron around the wave function.

This was actually work that was based on my doctoral dissertation which I did entirely on Mac clusters. And what it involves is an approximation of path integrals that to be able to choose sample just all the possible classical paths. And use plasma code to be able to push those paths forward and determine the evolution of a quantum wave function.

So the Mac cluster recipe. Basically this is all the description that you need to be able to assemble a Mac cluster. The ingredients simply take a bunch of Power Macs or extra G4s or G5s, upgrade the memory as you need to and get a fast Ethernet switch or faster if you have more money and get a bunch of Ethernet cables.

Then the directions are connect the cables from the Macs to the switch and download Pooch which you can get from our website and install Pooch and it only takes seconds per node to be able to install Pooch. And then use the Altifact Fractal demo to be able to test the cluster. And so what I'd like to do is to be able to see if I can give you a demonstration. So if we could switch to demo two. Yes, thanks. And let's see.

Okay, so let me uninstall that. Okay, good. So let me give you sort of a prototypical idea of a numerically intensive code that we have here. This is known as the AltaVec Fractal Demo. Right now it's not using the vector processor that's in this G5 here. And it adds up, it uses a Z to the Z fourth computation, something I thought was a little bit more numerically challenging.

And also it counts up how many floating point operations it does in times itself to really determine how many megaflops it achieves. And it gets about 1,100 megaflops in this case. But if I use the vector processor, I can go ahead and use that. And it goes quite a bit faster. It gets about five, six gigaflops or so, which is pretty nice.

and this also can make use of a dual processor machine. That gives me another factor of two, but what if I want to get beyond a factor of two? Well, that's where parallel computing comes in. That's where Pooch comes in. This is how long it takes to install Pooch. Just double click on the installer. and there we are. And POOCH is an acronym, Parallel Operation and Control Heuristic Application. And let's see, I just need to be able to log in to the cluster.

And to be able to start up a new job, I go ahead and click New Job from the File menu. It opens up a new job window. And this has two panes in the job window. It holds a list of files on the left and executable that will be copied to the machines listed on the right.

and execute as a parallel computing job there. So if I click on select app, I can go ahead and use the file dialogue to be able to navigate through the file manager. But I really don't prefer doing it that way. I prefer using drag and drop. So how many of the parallel computers can you think of you can launch using drag and drop?

There really aren't too many. By default, it selects the node I'm on, which is NobHill Demo 2. And if I click on Select Nodes, this opens up a new network scan window. And this uses both Rendezvous and SLP simultaneously to determine the names and IP addresses of other machines on the local area network that's here.

and so I can see that it drew, it used this information, it uses the IP addresses to be able to contact the pooches on the other machines, in this case the XSRVs that are here. And involve those and then determine whether or not they're busy or okay. Busy means that they're running a parallel job with show up in red letters or how much RAM they have.

And it also queries other information such as, gee, you know, what's the clock speed of them or what operating system or how much load does it have, how much disk space. You know, when was the last time someone touched the mouse or, and it uses this information to be able to form a rating of the cluster.

And so it helps you choose the nodes that are more suitable for running in the cluster. And it actually can give you a recommendation. You can go ahead and choose the add best function if you want. Or you can go ahead and drag and drop or double click on the nodes that are there.

So if I click on the options of the job window, this opens up the options drawer. And you can, say, place the executable in a particular subdirector on each one of the machines, maybe perhaps delay the launch until some later time of day, so like after a colleague leaves home to go from work. And you can also pretend that you're on a very, very large system by launching as many tasks.

By default, it launches as many tasks as there are processors. So you can also really benchmark or stress test your code. We support three different MPIs, as I described earlier. And if you want to get through a firewall, you can use a particular port number or queue the job for later execution.

So to be able to launch the job, I go ahead and click on Launch Job. And this copy is executable to the other machines and then passes control to the parallel computing code, which then divides up into the various different sections and then collects results back here for display. And we get something like 44 gigaflops in this case. Thank you.

Let's see, I just want to check something. OK, so from there, and just to show you that this isn't just for fractals, is that this is an example of a physics code that we have. And let me go ahead and actually, oh, that's fine. I'll just involve the same nodes that are here.

And we can go ahead and-- so what's happening here is that this is actually a plasma physics code. It's running at least a few million particle simulation. And it's being performed on the nine processors that are available here. And if I go ahead and run this job, we can see that the electrostatic potential is going to show that there's a plasma stability that grows out of that.

And we can see that in the lower right, there is the Mac MPI monitor window, which is very useful for diagnosing and debugging parallel codes in MPI. And so in the top part of the window, it shows the messages. White means it's not sending any data. Red means it's receiving data. Green means it's sending. Yellow means it's swapping. And so a typical thing that happens when you're learning how to write a parallel computing code is that a lockup happens. And so it freezes in the light pattern of the hang.

But also, down below, there's a histogram of the messages being sent and received as a function of message size. So this encourages you to send fewer large messages as opposed to many smaller messages. It also shows you dials of how much time it's been communicating and how many megabytes per second are being sent or received in between these machines. So this is a utility that myself and many colleagues and many venerable institutions have used to be able to diagnose and debug their codes.

And to give you another example of a code, also a physics code, this is an example of a code that performs a Fresnel diffraction problem where you have a point source of light producing spherical waves and projecting a diffraction image on the screen. And so from there, we can actually, this actually has a feature where you're able to automatically launch itself in parallel on a cluster.

And so this is the way that I would hope that applications become so easy to use that you can simply drag and you can just simply use a menu click to be able to have it launch itself onto a cluster and make use of the resources that are there.

And you can see again the Mac monitor window showing the messages being received, mostly very large messages. Let me make the problem, I was going so fast, I got to make the problem size bigger. And then it's also showing the colors of different parts of the problem that are being assigned to the various processors.

And so just one more feature, one other thing I want to show you, something that was just announced this week was what we call Pooch Pro. It has a new user menu where you can actually assign a certain amount of quota for each user. And so it computes how much compute time is being used. And then this is a cluster, the only cluster or server computer I know of that has rollover minutes. So you can rollover your compute time from week to week, let's say.

And also, now this is something you would only see as an administrator. You can actually administrate the users that are there. And A means that it's administrate, you have administrative capabilities. Q means it has a quota, rollover minutes, being able to migrate and password changing. And you can have different passwords and so on. So I can double click on a particular one and edit, say, how much time our good friend Warner Yuen has CPUs.

Like, let's say, I'll give him just a really little time or something like that, who knows. And that's not a limb changes password, okay. Anyway, so these are the kinds of things that are available in Pooch Pro. So that will be it for the demo for now. I'd like to switch back to slides.

Thank you. So just very quickly for more information, the reference library that we can refer to, basically we go to the data search website. You can find a whole bunch of information. The Pooch website, you can find out the cluster recipe, you can download a trial version, and we have a tutorial on writing parallel codes as well as a zoology of parallel computing that is a description of the various parallel computing types.

And this will all be linked from the WWC URLs. As well as the parallel NOC tutorial with both code examples, Fortran and C, a parallel adder tutorial, both languages, parallel Pascal's triangle, and as well as related publications and actually another video that's a little bit longer than what we displayed here of some of the work that we've done. So I'd like to introduce Steve Forde and thank you very much for your attention.

Steve Forde, I'm CEO of GridIron Software. I'm going to go over a real brief overview of what we would call a next generation parallel computing framework. And we're going to do that really from a very commercial perspective. So probably a lot of the same points that you've heard before, I'll go through a little bit, but we'll go from here.

So one of the key things that ISVs that we work with are looking for are obviously speed, but a lot of times the resources that are available to end users for products that they ship are not vast. So you get into a scenario where I need to provide 100% performance or provide linearity for every CPU that I add.

Because you might have a company like a Pixar that has thousands of machines, but you also might have a small post-production facility or something like that. And you're sitting in the basement with just one or two machines kicking around. Is that actually going to provide some value for them? So the challenge for developers is to how do you build in a parallel application that provides this performance in a very easy to use and seamless fashion?

Power, ISVs are really interested in the money quotient. This is what I would like to call our million dollar slide. But from this perspective, this was a customer we have in the print space, and we actually did a comparison. Between five G5Xers and a Sun SparkFire 6800 12 CPU.

The interesting thing is that this was the result, and this is the cost that's generally associated with machines like that. And you can kind of get the idea of why commercial software vendors are very interested in seeing how can they provide this functionality from a commercial perspective to everyone in their user base.

So methods of grid computing, we've heard a lot about different things, but from the grid's perspective, there's three basic kinds. There's the middleware perspective, there's the opportunity for message passing, which Dean talked a lot about, and some development tools that try to make this whole black art of parallelism a little bit easier on you, the developer.

Script distribution, there's obviously some pros and cons. It's very good, as we see with distributed resource managers, and if you're familiar with things like Xgrid, and that kind of stuff, to go out and say, okay, I'm going to use existing resources with existing applications and do things across.

But there generally needs to be some sort of skill set for the end user to understand how to do those things. So it's very useful in areas of scientific computing and in research. But when you go into a shrink-wrapped application and you're trying to put that onto a CD, it's a little tough for a lot of the user base to grasp.

Message passing, as Dean talked a lot about, it's used quite extensively in the scientific and research areas. But the interesting thing that we found as we went through our engagement with several ISVs from the commercial perspective, obviously pros and cons from that perspective, but the biggest thing was that there wasn't a lot of confidence with their ability to ship that with a product. So it was the learning curve associated with actually putting that into their products and their users to understand how this thing works.

Development tools is where you're probably going to see a lot more emphasis on this down the road, especially as chip design and so on is going to move in a few different directions. But from our perspective, we wanted to create an application development environment that had a very high level of abstraction.

So message passing is a message passing interface. But to turn around and say, OK, you still have to write a parallel application. That provides you the messages, but everything else that you're associated. So you not only have to worry about how do you partition your algorithm, but then how do you message? And then how do you build all the things such as discovery and resource fault tolerance, all those kind of things? You've got some good tools, again, like Dean's tool that can come along and work with MPI and so on on the top.

But from the perspective of what happens from within, and that's something that's very important. GridIron XLR8 just as a real brief overview, it's a peer to peer based distributed computing architecture. And the APIs are built into source. It's more wrapping the source. I'm going to give a quick example of that, what we did for some MPEG encoding.

And then the work is dynamically addressed across the network. And that can be to a dedicated cluster or to specific resources of desktops. It doesn't really matter. The key thing is that you can go into that scenario very quickly. And from within your application, once it's been programmatically added, provide a user with a very engaging experience.

So the development tools have a lot of the same pros and cons. There's obviously requires code modification and we as developers don't like to modify code. It's a very non-trivial thing, especially when you get into busting up algorithms. Anybody who works in multi-threading can go into that attestment.

But from our perspective, we're kind of like a hybrid between OpenMP and MPI. We wanted the ability to provide in the demo I'm gonna show you with Adobe After Effects was to be able to take advantage of just another machine, another CPU within the same box for a very serial threaded based application.

To grid or not to grid, obviously that's the big question. What kind of development work or effort do you have to put into the parallelization of your code to return some results? And is it worth it? That's the big question. And that's always why parallelism has been the black art.

So from our perspective, we wanted to really start to get rid of that black art connotation, provide an interesting framework to do this. A lot of times, I mean, there's been a lot of talk also about processor intensive applications. But ironically, most of the applications that we've worked with, they have more data problems.

Data movement, reads and writes, and those kind of things seem to be a major bottleneck with a lot of the applications that we've worked with. So we wanted to come up with various different means. Again, saying to you, the developer, really you focus on your algorithm. You focus on the thing that you know very well. And we'll provide you a parallel application that you can call into.

So which grid method to use? Obviously, if you're doing embarrassingly parallel or scriptable type things, distributed resource managers and scripted batch queue systems are very good. If your source code is not available, again, that's probably the only route that you have. But then you also have the opportunity, depending on the resources at your disposal, to go into a message pass or into another type of framework such as a development environment.

Quickly on grid enabling an application, obviously 90/10, 80/20, basically the same thing. But when you're looking at the 80/20 rule, focus on that 20% of the code that does 80% of the work. The abstraction level again is very important here because what really we're trying to do from a development perspective, there's no such thing as real automatic parallelism. But from that perspective, maybe there is ways to wrap and provide hints instead of breaking your algorithms.

In other words, they can still run the same way that they did before and you don't have to worry about totally wrecking your application. Application modification, we've broken out our architecture into three plug-ins, defining tasks, task compilation, and then result reassembly. But again, the goal is to provide the end user with a really engaging experience where they can basically think that it was just done on their PC, just on their Mac, and they can use whatever machines that are on the network.

MPEG encoding was a challenge that we were given by a certain company that does a lot of encoding. And we wanted to see what kind of results because this was high data. So this is HD video. We actually went in and did a modification and we ran the test on several XSERVs and we got some very interesting results. Basically, we tried it out originally on 12.

We did actually go up to 40 but we started seeing some degradation in the curve of diminishing returns. But we took an HD encode and brought that down from two and a half hours to six minutes and provided the seamless result. The nice thing is I also didn't have a lot of disk space. I didn't need to have a lot of things. It just went and dynamically moved the data when it needed to and brought the results back for the end user.

More importantly, from a development perspective, we only modified 100 lines. Out of this 1100 source files, we modified three, basically about a thousand lines. Of course, we published a white paper on this so there's a lot of comments. But we did it. And that's available on our website if you do want to check that out.

So another thing that we did is, and what we're shipping with right now is with Adobe After Effects. And I wanted to kind of show that to you, if I can switch over to demo two. I wanted to show that to you very quickly because this is what we feel the end user has to experience. And this is the challenge for us as developers to bring into an engaging experience for end customers.

The funny thing as well, and I'm just gonna make a reference, I think in the keynote on Monday, there was a reference to, you know, the challenge of bringing chips into a smaller design. Well, a lot of folks are announcing now the ability to go into a core multi-threading or a multiple cell type chip. So parallelism is gonna be absolutely key if our software environments are actually gonna take advantage of it.

So one of the things that we did in this scenario is, is that we're actually gonna use these four XSERVs here. But from an end user perspective, they don't know anything. I mean, this is a product that ships, you can buy it at Fry's for 900 bucks. And from that perspective, if you go in there, they don't know anything about DHCP. They don't know anything about DNS. They don't know anything other than how you have to plug it in and I hit go. So from that perspective, that's literally all they have to do.

They go out, it'll automatically dynamically find all the other machines. It'll pass the data that's relevant for it to work on. But the most important thing is, is to provide the results right in an application in a method that they're very familiar with. So if you look down here, we're starting to bring in, this is an HD 1080i clip by the way, for those who are interested. But the interesting thing is, is that we're bringing the results right into the RAM cache of After Effects. So from the user's perspective, this looks just like that always worked.

And that's very engaging. The other side too though, is that we have an interesting side effect of using a grid to do all the work. And that is, you can render and work at the same time. That's never been doable before in a single threaded application like this. And I can do things and render and so on. So I'll go back to, to the slides.

That's just a very quick demo. And the power that provides an end user is very engaging. And we've been able to see that. This has been shipping for a month. And I think the stat was 17,000 users are using this, cobbling together machines in their basements, using it in very large infrastructure as well. The NBA finals were brought to you by this. And from that perspective, if we can, as developers bring, engaging experiences for our customers, our products will come at the market in a very engaging way.

So summary, obviously speed. Speed's great. Is it worth the work? That's really up to you. You need to look at environments that are going to help you get to a more optimized and parallel infrastructure without the headaches or the worry of breaking your code. There's going to be new hardware technologies coming down the road, specifically multi-cell chips that are going to mean parallelism is absolutely key. So we've got to start thinking about it now. And significant linear performance is really the thing that customers want to buy.