Scientific Clusters on Mac OS X Server and Xserve - WWDC 2006

Information Technologies • 1:03:06

Come hear from the experts who have deployed small to large scale HPC systems with a variety of technologies including Xgrid, Sun Grid Engine, Platform LSF and other tools to make scientific computing more powerful than before. Through practical configuration tips and live real-world demos you'll see how easy it is to operate high performace computing solutions on Apple software and hardware.

Speakers: Todd Dailey, Warner Yuen, Yusuf Abdulghani, Josh Durham

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. Welcome to our session today, Scientific Computing Clusters on Mac OS X. My name is Todd Dailey. I'm a consulting engineer on the business side of Apple and we've got a few other presenters for you today. We've got Warner Yuen who's a consulting engineer on the education side, Yuusuf Abdulghani who's a performance architect both doing their benchmarks and working on the performance tools, and Josh Durham who's a technical services consultant who's been involved in the Virginia Tech Cluster.

We'll have a couple other people coming up for Q&A later. I'll introduce them to you then. But what we're going to do today is we're going to work up the stack of high performance computing cluster and show you everything that you need, give you an update on all the different pieces.

Obviously we've done some updates this week. We've introduced a great new server. We've introduced some other great new tools and we've got some great stuff coming in Leopard. So we'll kind of use this as an architecture to work through and show you everything that's going on currently. We'll kind of use this as an architecture to work through and show you everything that's going on currently. with clusters and how you can use that.

So the first thing we want to do is give you an update on Apple deployments and talk about what we've done lately, what people have been doing with our XServe G5 products and expanding those and how they've used those today in deployments that they're using for real science for real purposes today.

One of those that we've recently done is the Mayo Clinic and the, oh I'm sorry, let me step back. One of those that, so why would you use Apple for scientific clusters? There's a lot of tools that are built in there that are useful to you that you might, that provide a lot of benefits to you for clusters today.

Obviously we've got Mac OS X Server, a powerful Unix based and Leopard Unix operating system that provides you with a lot of tools that are built in there. We've got all the built in tools like Python, like Ruby coming up in Leopard tools like Ruby on Rails that are all just built in ready for you to use.

We've got Xgrid which we've done some updates to. We've got the XServe G5, which we've done some updates to. We've got the XServe G5, which we've done some updates to. We've got the XServe G5, which we've done some updates to with Xgrid 2 for Leopard. We've got the XServe G5, which we've done some updates to with Xgrid 2 for Leopard. But it's a great easy to use clustering tool that allows you to run jobs. Josh will be showing you a demo later of how that works.

We've got all the great tools in Xcode 2.4 and Yusuf will be showing you some of that. And it's a popular and a well understood platform within the science community. And that's a very important point. If you go down to the Apple and Science Lab downstairs, you can see all of the different applications that are built in there.

You can see all of the different applications that people are using our platform for. You can see the hyper wall, the big multi-monitor wall that people are using. Just amazing innovations going on on the platform. So one of those people that's using our cluster and expanding that today is the Mayo Clinic.

The Mayo Clinic is doing molecular modeling. They're developing drug therapies. So they're doing kind of theoretical therapies trying to figure out if a drug could work, for example, as a cure for cancer. And this is very complicated. They have a 295 XServe cluster that's running. This replaced an old Xeon cluster that they had. And with that, they were able to reduce the time to run one of their models through from 27 minutes down to just five minutes. So a lot of benefits for them immediately with the XServe G5 by being able to deploy that.

Another customer that I'm sure you've heard of is Colsa. And we'll have Eric Wages who is involved in running that cluster up to talk to you later on the Q&A. So Colsa provides computational resources basically under contract to whoever would like to have their cluster housed with them. They're number 21 currently on the supercomputing top 500. They've got 1,536 of our cluster nodes deployed today.

And some things that they've done, you've probably heard about their cluster before. But some things that they've done since last year, they've upgraded the cluster from gigabit Ethernet on the cluster. From Ethernet on the backbone to mirror net. This provides them with a lot more speed. Warner will be showing you later some benchmarks that show that.

They've got XSAN deployed now. They've got 41 XSAN clients going back to XServe RAID. And they use this cluster to do missile research. So they do a lot of research about hypersonic missiles. And to do that, they use something called computational flow dynamics. A very complicated way of modeling that to basically see theoretically where things are going. So it's a very productive cluster for them.

And we've got some other customers out there that have very large clusters as well. So besides Colsa up there on the top, you can see as we go down, we've got some other big clusters that are already out there. We've got Virginia Tech with 1,100 nodes. We've got UIUC with 512. We've got UCLA with 256. Bowie State with 224.

And we've got hundreds of deployments that are out there that are still going. And we've got a lot of other clusters out there that are smaller than that. And those people in many cases are using our prebuilt configurations that we saw as the Apple Workgroup cluster. And so we've got a lot of 8, 16, 32, 64 node clusters out there.

A lot of those in education where people are using these for educational or departmental or teaching purposes to be able to take their cluster and do research on a departmental level. In many cases, in addition to larger clusters that they might have. You can get into one of these Workgroup clusters pretty inexpensively.

So we see a lot of departments that buy them just because it's within their discretionary budget to do so. And they can have their own cluster to play with versus having to compete for resources on a campus or on an entire enterprise level. You can get this prebuilt. Well, you can get this with qualified for many other applications that are out there. We qualify that with MATLAB.

We qualify it with Grid Mathematica. We qualify it with Inquiry from the Bio team, which does bioinformatics. And it's also a great platform for just open source code. There's hundreds of open source projects that are out there that just compile and run just fine. Clustered code projects that just run fine on our clusters.

So talking about the HPC stack itself and starting to work up that, we're going to start at the bottom and start looking at the hardware platform. Obviously you've got to have some servers before you can start building the interconnects and building the applications to run on top of that.

So to start with servers, we've got a great new server to talk about at this conference. You won't be able to get it for a little while but it's still a great platform and we're really excited about the, now the rechristened XServe and the quad Xeon processor capabilities that it has. This thing is just a computing monster. You've got four Xeon cores that are in there. You've got an immense cache on the processor. You've got very fast hard drive options that you have on that.

You've got an immense RAM capability. If you haven't seen the new XServe yet, many of you may not know, we've got those down in the IT lab downstairs. If you do want to see that, that lab closes today at 5. But if you want to see one up close, there's, I know there's a few around different places but the IT lab is one place that you can go and see it today. So this is just a fantastic product. We've got the server out on Intel now so that you can see that.

And on top of that of course you've got to run an operating system. With the XServe Intel announcement we also announced Mac OS X Server running on Intel 1047 that's available today. Although the server is not going to be available for a little while yet. But that's got a 64-bit memory address space for command line apps for compiled Unix apps. It's got just announced Xcode 2.4.

Yuusuf will talk about some of the tools and some of the updates that we've done in that for performance optimization. And that can act as an Xgrid controller. So again you can right out of the bat with that control up to 128 nodes without going to any job managers or any of the things that Josh is going to show you.

And then in Leopard Server the story gets even better. In Leopard Server we've extended the 64-bitness from the command line all the way up the stack. Now that may not seem to be too important for a cluster job, a lot of that stuff operates at the command line. But we expect that some of our partners out there who make tools for small clusters, it's very conceivable that you could have GUI applications that might need 64-bit access to memory, to large amounts of memory.

And we've got a huge memory capacity as you know on the new XServe. So you can access all of that memory when Leopard comes out from any application that you can build with Xcode. We'll have Xcode 3, we'll have the Xgrid 2 controller. There's a session tomorrow on Xgrid 2 if you want to learn some more about that.

And one important feature that we haven't talked about much at the conference is that Leopard Server will be conformant with the UNIX spec. So we've always said that Mac OS X is UNIX based. But now we can say with Leopard that OS X is UNIX. It conforms with the POSIX specs, it's been certified.

It'll be an actual UNIX platform. And so the code will be much more compatible with any sort of random UNIX code that you want to throw on the machine and compile. And we've also added a DTrace facility within Leopard Server now. So there's other sessions on that this week and you can see about that. But that provides you with a very low-level debugging function. And so with that intro, I'm going to hand things over to Warner. And Warner's going to talk about the interconnection. Thanks.

Thanks Todd. I'm Warner Yuen. I'm a consulting engineer with the education division of Apple. And I'm going to be talking about interconnects. Oops, wrong way. And what are high performance interconnects? So these are specialized interconnects that don't use the TCP/IP stack but instead the data flows directly from the network to the memory on the computer and that way the processors don't actually have to wait for the data in order to act upon it.

And also these interconnects have a high bandwidth capability so it's not just having the data flow fast to the processors but also you can flow lots of data to the processors. In general these are external interface cards and the link is either copper or fiber optic and they are connected to purpose built hardware switches designed especially for high performance computing.

A little bit on when to use and when to choose high performance interconnects for your high performance computer. So the idea is that interconnect selection can actually drastically affect the performance of your application. These days often times a gigabit Ethernet provides pretty good performance. However there are times when there is a parallel computing code that passes lots of messages from one machine to the other. And you don't want it to spend a lot of time communicating back from machine to machine.

So you need a high performance interconnect with low latency. And then there are other times there is code that has very large messages that it's trying to pass from one machine to the next. And you don't want to saturate the fabric just by sending messages back and forth. And of course a combination of the two. Lots of messages and very large messages. So those are times to choose a high performance interconnect.

Interestingly another time that you want to select a high performance interconnect is if you are in a shared computing resource. So the idea is if you build a shared compute resource and you want a variety of users, a high performance interconnect can often attract more users to your computer.

And thus offering better utilization of your resources. And the reason for that of course is you have a variety of users out there with a broad range of requirements. And you can't predict what each person is going to require. So if you can offer a high performance interconnect you can make more customers happy.

So let's take a look at what some of these interconnects are like. So there's the MirrorNet 2000 components. This Miracom and MirrorNet are one of the earliest interconnects out there. And they are PCIX interfaces. They include offload engines with memory, processor and firmware. And the switches scale to thousands of nodes.

In fact the majority of the high performance interconnects on the top 500 lists are currently MirrorNet 2000 based. The LINQ cables are nice fiber optic pairs up to 200 meters in length. And the software stack includes a driver, interface firmware, libraries and middleware to run all of this.

About two months ago, MirrorNet actually introduced a new interface card, the MirrorNet 10G components for high performance computing. So this is a newer PCI Express based 10 gigabit Ethernet card from MirrorCom that offers not just a 10 gigabit Ethernet capability, but it also offers the ability to use it in a high performance, low latency computing environment. The switches are standard gigabit Ethernet switches that work in conjunction with the PCI Express Mirror10G cards. The link cables are either copper or fiber optic and it also includes all of the software drivers required to run it from MirrorCom.

And a third form of interconnect probably first made famous by Virginia Tech with their System 10 cluster is InfiniBand. So the SilverStorm InfiniBand. InfiniBand is currently offered in two forms. Here you see the PCI-X version and the PCI-Express version. The main difference being the PCI-Express offers a dual 4x bandwidth or much higher bandwidth with the same latency as the 4x bandwidth. Switches up to 288 ports in a single chassis so you can build a fairly good sized cluster with just a single large switch and it also scales up to thousands of nodes.

And the connection on this is copper based. And for more information on all of these and the websites provide a lot of really good resources. Miracom, both their Miranet 2000 and Miranet 10G solutions you can get information on the miracom.com. Miracom.com website and InfiniBand from either SmallTree Communications or SilverStorm Technologies. You can check out their websites for Mac OS X based solutions.

[Transcript missing]

So the first one from Argonne National Labs, MPitch 1.2.7 is a current iteration of this from Argonne. MPitch 1 is one of the first MPI middlewares that came out and there have been spin-outs of that. There are mirror net enabled MPI stacks, MPitch GM and MPitch MX. There's InfiniBand enabled versions known as MPapitch.

And one of the newest versions from Argonne is MPitch 2.1, it's the latest from them. In addition there are other ones, LAMMPI. LAMMPI is from Indiana University among a couple of other labs I believe. And it provides very good Ethernet performance as well as mirror net and InfiniBand support out of the box.

Of note the LAMMPI is probably the most widely used MPI stack on Mac OS X clusters based on its performance. But one of the newest more interesting ones is OpenMPI. It's a joint venture by several of the national labs and universities and international labs as well. It is Xgrid enabled. You'll see a demonstration on that from Josh when he comes up in a bit. It includes not only Ethernet support but also native mirror net and InfiniBand support.

So with all of those MPI middlewares software stacks out there one of the things we wanted to look at was well how do these MPI middlewares perform on Mac OS X. So one of the things that we did was we decided to run the MPI Ping Pong Benchmark on Mac OS X. The idea here is to measure the MPI software and fabric performance. The software itself is designed not to run on two cores of the same node but to run on across the fabric.

So from one core to another core sitting on another compute node. The message or the ping is sent from the client to the server process and the server process bounces back the message to the client or the Pong. And that performance time, the time that it takes to make that trip is measured from the Pong. So that's what we did for our performance benchmark. In addition you can actually vary the size of the message or the ping that you send back and forth to measure the performance.

Beyond just testing with the MPI Ping Pong we also decided to see what that meant for real-world applications. So we tested the Ping Pong Benchmark and we also chose a couple of different real-world software applications that our customers often run. So let's take a look at some of the performance benchmarks. So as I mentioned there are various message sizes that you can pass back and forth. So here on this chart we have one byte up to 16 kilobytes. Latency is measured in microseconds and shorter is better.

One thing that I did do on this particular slide is put up a gold standard benchmark just as a baseline. So we chose one of the high performance interconnects and showed the baseline on the performance of that with the different messages. So as we compare that to MPitch 1 over gigabit ethernet you can see that it does scale.

Similarly to the mirror net but the latency is about 10 times worse or an order of magnitude worse than the high performance interconnect. But as we switch over to some of the newer MPI layers we can look at the performance and what happens there. So as you see here it's interesting with the MPitch 2 there's a drastic improvement over the original MPitch 1. The latency is almost significantly better on MPitch 2 or the newer MPI software. and then LAMMPI.

Better still yet, not significantly, but it is still a little bit faster with the message sizes. So this showed us that any of the newer stacks seemed to look better on performance. So how did this affect real-world applications? So one of the applications we chose to run was GrowMax. It is a C-based molecular dynamics code.

And in this case we ran a specific number of processors on XServe G5s and measured the time for the benchmark performance. So how long did it take to run a particular simulation. In this case you can see we also ran the high performance interconnect as a baseline to just study and look at the scaling.

And then we ran MPitch2, so one of the newer MPI software stacks over Gigabit Ethernet. And it scales fairly similarly with the MirrorNet just edging it out a little bit. But Gigabit Ethernet's pretty good performance. And of note you can see that this particular application scales very well up to about 24 nodes and it starts to tend to taper off so that you might be looking at a plateau in terms of how many nodes to run per job. Now let's compare that to MPitch1. So as you recall the MPitch1 on the Ping Pong showed very poor results as message sizes got larger.

And this shows very odd results as well. It clearly shows MPitch1 results as not something you want to choose for running this particular application. So it gets very slow. It doesn't scale even after the first eight nodes. So just to be sure we decided to run another application. This time a Fortran code weather forecasting simulation benchmark. And again number of CPUs and execution time. So running the simulation, the baseline high performance interconnect benchmark.

MPitch2. So very good scaling on both of these fabrics. Again the high performance interconnect edging out the GIF. And now if we look at the MPitch1 performance with the same gigabit fabric. You can see here there are no results shown for the 8 and 14 CPU benchmarks. And the reason for that is when we ran it the scaling would have pretty much flattened out all of the other graphs out there. So we didn't even decide not to even show those performance benchmarks on this. on this slide.

So what's that mean? Basically what we learned from this is as users of high performace computers you need to choose your MPI software wisely. It's not just about hardware selection and switching fabrics but you need to choose your MPI software wisely. What that really means is don't use MPitch 1.

The exceptions are the specific fabric designed MPI's MPitch for MirrorNet or MPitch for InfiniBand. If you are going to use MPitch strongly are suggesting that you use MPitch 2. The performance is much better. LAMMPI provides excellent performance. It's a little bit better even than MPitch 2 and again it's compatible with all the different communications fabrics.

And the latest version OpenMPI and at the end of the session we'll have Brian Barrett from the OpenMPI group come up for Q&A as well. But excellent alternative to any of the other MPI softwares. It runs on Gigabit Ethernet and also the other fabrics as I mentioned InfiniBand and MirrorNet.

And what's unique about this is that it automatically selects the best fabric to run your application at run time. So if you have MirrorNet or InfiniBand enabled it will know that and it will select that fabric at run time. And as a bonus it can integrate with XGrid as a basic job scheduler. job scheduler, making setting up a cluster much simpler.

Of note, I wanted to put this little additional information here integrating Mac OS X G5 based machines and Intel based machines, so heterogeneous clusters. So again, MPitch, it cannot be used on a heterogeneous G5 and Intel cluster. Whereas, LAMMPI and OpenMPI can be used in a mixed cluster environment. The reason is both LAMMPI and OpenMPI handle the byte order swapping, the big Indian, little Indian issues of the architectures, whereas MPitch does not.

So the idea to run this mixed environment is you'll want to compile the MPI layer for both PowerPC and for Intel. And you can lipo and build a universal from that if you wish and install it on all of the nodes. And then from there, make sure that you compile compatible binaries that are accessible for each of the architectures and then just do the MPI run as usual.

So one other thing that I decided to put up was since I was showing a lot of performance differences between the various MPI layers, I also decided to just show a couple of slides on gigabit performance versus one of the high performance interconnects, in this case, MirrorNet. As you recall, we ran the Ping Pong benchmark and this time I'm just going to show you the gigabit results, again, various message sizes and latency in microseconds.

And the high performance MirrorNet interconnect, about an order of magnitude better for these particular message sizes. And then we decided to look at it again, what's it look like on the real-world applications. So again, the weather forecasting model will show the gigabit results and follow that up by the MirrorNet results.

In this case, from 8 to 28 CPUs, they both scale fairly linearly. So this shows that the high performance interconnect, although faster, you might not be able to justify purchasing a high performance interconnect. It looks like here you'd be better off purchasing, possibly purchasing more XSERVs, which would make us happy. And that way, it is a large investment to spend on a high performance interconnect.

So what's it look like when maybe you actually do need it? So VASP is another example. This is another molecular dynamics, it's a quantum mechanical molecular dynamics chemistry software package, a large Fortran code. And we look at the gigabit Ethernet. You can see it scales between 6, 8, and 16 CPUs fairly well, but there's almost no scaling when you double the number of CPUs. And this is with gigabit Ethernet.

But here you can see the mirror net performance scales almost perfectly linearly. So doubling the number of nodes you get double the performance. So in this case it does make a lot of sense to look into adding a high performance computing fabric. So what's that mean? So in summary the high performance interconnects are usually faster than gigabit ethernet and when applications can benefit the performance gains are usually significant. So in the case of VASP it was clearly a significant win to implement a high performance interconnect. Though not all applications can justify the higher performance interconnects.

The first slide with the Wharf application showed that you're probably better off just adding more servers. And then there's other strategies that you can look to evaluate whether it's worthwhile to add the interconnect or not. But one of the important things is performing your own tests with the various interconnects. That is very valuable in determining the performance of your applications. So that's the summary I have on interconnects and software middleware. And now I'd like to introduce the Josh on job scheduling. Thank you Warner.

So we're going to overview job schedules a little bit and then go into some more specific details about the job schedules that are available and we'll also do a little Xgrid demo. So what do job schedulers do? They basically work with the resource manager to maximize resource usage. So if you have a bunch of computers you want to make sure you effectively use all of them and none of them are wasted by just sitting there idle.

They help alleviate the political issues that are sometimes come up when you have a cluster. So let's say you have two teams both want to run on the cluster. Well obviously they probably want to be able to use the cluster as much as they can. So this kind of balances out between the two. Most resource managers come with a basic job scheduler.

So most of the times it's either like round robin or first in first out sort of thing. If you have more complex requirements or more complex requirements you can use the job scheduler. So if you have more complex requirements or more complex requirements you can use the job scheduler. more requirements than just that. There are more complex scheduling.

An example of one of those complex schedulers is called Mali. It is free, which is nice. A lot of people like that, but it is also very complicated. Lots of different things you can do with it. You can schedule policies, do different priorities, reservations and fair share. So, for example, if you have a bioinformatics cluster that's also letting the physics people get on the computer, you want to make sure the bioinformatics people get a higher priority when they want to run. So, something like this will help you kind of balance out things like that.

Resource management is basically the thing that tracks what is happening on each node, makes sure that system loads are correct. And they are also making sure that jobs don't exceed their allocated time. So if someone says, you know, submits a job for one hour and it's still running after that one hour, it's responsible for killing it and then allocating the resources appropriately after that.

So some available resource managers and this list is nowhere near all inclusive. There's definitely a lot more than this and I probably missed someone's favorite resource manager. But obviously Xgrid is available in Tiger. It allows basically distributed processing of what we call grid jobs which are a little bit loosely interconnected jobs to make work go faster.

I don't know if anyone saw the podcasting utility that we're doing but obviously Xgrid is a big part of that because now you can use Xgrid to do the podcasting formation and moving it along. So Tiger Client comes with the Xgrid. You do need Tiger Server though to do the controller which basically brings all those clients together and lets you allocate resources on them.

There's Grid Engine as well. They do provide OS X binders which is really nice. It is free and open source so available to anyone. It is distributed by Sun and Sun does have I think a commercial product that Space Health is so if you need it actually support you can get that through Sun. And the great thing about Grid Engine is it's actively being developed.

It's constantly being updated and that's really important. Another one is Torque which is very popular. It's actually based off of PBS and Torque is basically the free version once it branched off. And another one is Platform LSF which is a commercial resource manager. So we're going to just focus on a couple of these, Xgrid and Grid Engine. - Yeah.

So there's different kinds of resource management and you always you don't want to you know Deploy the one that is going to be a lot more than what you need. So the most basic form of resource management is what we call social resource management. Basically you usually have a couple guys, maybe a few people that can handle it among themselves who's going to run when and what just by talking. So this guy will have nodes 1 through 8, this guy will have nodes 9 through 16 and they can usually cooperate pretty nicely.

On top of that we have basic resource management with round robin allocation so that the first job will get the first few nodes, the next job will get the next few, so on and so forth. And Xgrid is an example of that. So if you need to step up and start actually looking at what are each system is doing, then we can do resource management with load balancing as well. So Grid Engine out of the box can do this.

Now if you need something a little bit more complex like that, like what we were talking about before with reservations, fair share, dynamic priorities, then you can basically take MAUI, the scheduler, and plug that into something like Grid Engine and be able to start doing those sort of things.

So Xgrid. Xgrid generally is used for serial applications where you can take a task and chunk it out into very loose integrated things and deploy it across a bunch of systems. So when Xgrid first came out I didn't like it because you had to use the GUI and I was always a command line sort of person and there's actually a lot of really good command line support in Xgrid that I didn't know about. So for example all you have to do is set a couple of environment variables to tell the system where you want to run on and then you can launch any job with using the Xgrid command line.

And you can also use this command line to retrieve the results and delete the results once the job is done. And so basically again what we said before, Xgrid is round robin. You do have your, in this case the Xgrid clients are on the right there and you can see that they're working on something.

Now you can actually integrate Xgrid with MPI. The OpenMPI team has actually done a lot of really neat work making this as seamless as possible. If anyone's ever done any sort of MPI integration with a resource manager it's kind of a little bit difficult because you have to make sure that the MPI knows which nodes it can run on which is dynamic.

You know you don't know beforehand which nodes you're going to get but they've done a lot of work integrating with Xgrid to get that to work. Same thing as before, same the two environment variables and then you can use OpenMPI to run. There are a couple of requirements that get that working. OpenMPI has to be installed on every single node.

You have to have a shared workspace where the user nobody can read and write permissions. So right now Xgrid is a little bit more complicated than you might think. It doesn't run as the current user who submitted it, it runs as the user nobody. So you have to make sure that user can read and write to that shared workspace. Set the path so that MPI is in your path on each system. And then you submit an Xgrid MPI job using MPIRUN. And we're going to try to demo this.

Great, so did a little bit of cook and show magic here. We've got Xgrid Admin already running, deployed. We have a cluster actually. This is a cluster over at the Apple campus. It's 16 nodes but we've got four of them on the Xgrid. So this isn't the cluster here in the front. It's actually over at the Apple campus.

So if you haven't seen Xgrid before we've got the nice little speedometer that tells you how much work it's doing. You can see the agents here. You can see the different nodes for example that we have. So we have four nodes. Yellow isn't bad in this case. It just means it's not doing anything at the moment.

So what we're going to do is we are going to use the command line just to prove that we can actually use the command line to submit jobs and we're going to do an open MPI job GROMACS. So like we said before we have to set the environment variables. And these variables aren't that short and they're pretty easy to mistype. Let's see if I get this right.

You can put these in your startup script or something like that so you don't have to type this in every time. So there you can see the two environment variables are now set. So we have the application GROMACS and again this is the cooking show magic. We already have the NFS deployed, the shared space, the setup across all the nodes. And from here it's simply just using the OpenMPIRUN command to run the GROMACS. So we'll do the nice little gauge here so you can see that.

So once we run it we should see in the Xgrid admin the jobs actually start queuing up and running there. So then I can see the little gauge is kind of cranking up there. And we can go to the agents and now instead of yellow you can see all the agents are now green. You can see they all think they're working on something. And you can see now here's the job which wasn't there before.

So this demo takes a little bit of time to run. You can actually see now it's getting the nodes and actually starting to run. And this is GROMACS. It does take a while to run so I'm just going to go and kill it with Control C. And you can see job disappeared. Agents think they're available again. And that's it. So that was Xgrid from the command line using OpenMPI.

Again, another resource manager that we like to talk about is Grid Engine. It's actually deployed on a lot of different systems because, for example, BioTeams Inquiry uses Grid Engine by default. So if you ever installed BioTeams Inquiry you kind of get this for free. So the pros on it, it supports a lot of MPI libraries. All the ones that Warner went over I definitely know that they support and you can use those.

It runs on a wide variety of platforms and these are the officially supported ones that you can download binaries for and there's a lot more beyond that that are probably not officially supported. But you can definitely download binaries for OS X and Linux and Solaris and probably a lot of other Unix operating systems that you may have to run into. And like I said before, active development community. It's up to version 6 now and it's a company sponsored project. It's a company sponsored project. But definitely a lot of community development on this.

The con is it's not that easy to set up. You know, if you ever set up an X Grid controller it's pretty easy. You just go out and say, "Hey, I want to grab these X Grid clients." And go from there. With this you have to install software on each system.

You have to make sure that certain things are running when you start it up. And, you know, a little bit more work required to do that. And it's all command line based, of course. It doesn't require direct job submission of a binary. And I'll kind of show you on the next slide what that is.

On the next slide what's required to get something running. And generally it's been my experience that it requires more maintenance than something like X Grid. But, you know, it's because it's more complex. But, you know, you have to make sure that the daemon is running on each system. The grid engine daemon. And make sure that the queues aren't messed up somehow.

So most of the time it works. But, you know, it does require a little bit more maintenance. So here's an example of a... How to do some... Summary of a job with grid engine. I'm not going to read out the script there. But that's what it looks like.

So basically you have to create a script that we call a wrapper script that will submit your job. This is just setting up some environment variables for Grid Engine and then basically running it there at the end using that user local mpish126 line. So again it's what we were talking about before though is that this script, you know, when you submit it you're not going to know what machines you're going to run on so that's why you see variables there like tempdura machines. That way when the job gets run it gets handed by Grid Engine which nodes it can run on.

And then to run the command it basically gets qsub and then the script name and then it gets put in the queue and then the resource manager and the job scheduler handle when to run it. And that's it. This script did come from the Bio Team website. They do a lot of integration with Grid Engine and OS X. So if you want to look at that in a little bit more detail definitely go there. And next is the framework. Frighteningly smart, Yousuf.

Hi, my name is Yusef Abdulghani. I'm from the Architecture and Performance team at Apple. As Apple makes the transition from PowerPC based systems to Intel based systems, we wanted to make sure that we provide you the right tools to make that transition easy and also provide you with tools which can actually give you better performance and you can use the underlying architecture to the maximum.

So with that in mind, I'm going to talk about a few tools that are available on the Apple platform. There are several others which I'm not going to talk about. These are the Accelerated Framework, the Intel C++ and Fortran compilers, the Intel's performance libraries and Shark. So let's get started with the Accelerated Framework. The Accelerated Framework is one-stop shopping for your computational performance.

It comes built in with every copy of Mac OS X and it is highly optimized and in some cases hand-tuned to run on the underlying architecture. It knows what you are running it on and then selects the highly optimized library to actually execute. It has got several libraries. The VDSP library is a library which has got routines for digital signal processing. The image contains libraries for digital image processing.

There is Blast Level 1, 2 and 3 and LA Pack routines. VForce library, this is a new library that was introduced in Tiger and we also have VMathLiv. So how do you use the Accelerated Framework? It's pretty easy. In your source code, you just do #include accelerator/accelerator.h and on the command line for the linker, you provide the library using the -wl command. So what's new with Accelerated Framework in Leopard? It is a 64-bit, it's a four-way FAT binary.

You can create a four-way FAT binary. So you can create either 32-bit or 64-bit applications and your applications can either run on PPC or x86 architecture. Having said that, the work on the Accelerated Framework is not done yet. We are still optimizing it for the Intel architecture. And as we draw closer to the Leopard release, you will see much better performance improvements in the Accelerated Framework.

The Intel C++ and the Intel Fortran compilers, these are one of the best compilers on the Intel architectures. They've got support for the Intel C++ compiler has got ANSI ISO support or compatibility. The Fortran compiler supports Fortran 77, Fortran 90 and Fortran 95 features. And there are some Fortran 2000 and 2003 features which are also included in the current version of the Intel Fortran compiler on Mac OS X. The Fortran compiler also has support for common extensions.

Here are some of the key performance features for the Intel compilers. Auto vectorization, this is really important because you don't want to spend time hand coding and making use of the vector engines on the Intel processors. Auto parallelization, we'll talk about these features in a little bit more detail in the next few slides. It has got full support for OpenMP and it is capable of doing whole program analysis through the inter-procedural, through the IPO switch.

It has facility to do profile guided optimization and it has also certain switches to control the precision of the floating point arithmetic on your application. As you turn on and go to higher levels of optimization, a lot of compilers tend to go away from the IEEE 754 compliance and give you lower precision. But if your application is dependent on the 754 precision model for floating point, then you can actually control that and limit the optimizations which change the precision of the floating point arithmetic using some of these switches. Such as strict, precise, fast, or accept.

So as far as compatibility goes, the Intel C++ compiler is source and binary compatible with the Apple's GCC and G++ compiler for most of the cases. There's no support for Objective-C, but it is binary compatible with the Objective-C objects. If you want to create universal binaries, the Intel compiler, you can do that either using Xcode or LiPo. So you want to create a binary on x86 using the Intel compiler, but for the PowerPC binary you might want to use the GCC compiler.

Some note on Xcode integration. The Intel C++ compiler fully integrates with Xcode 2.3 or later and it requires Mac OS X 10.4.4. The Fortran compiler integration is a preview release. The Intel's compiler team is working with the Xcode team to make that more streamlined and give full integration for the Fortran compiler.

There's a 30-day evaluation copy that you can download from the Intel website, give it a shot. And there's also some coupons that Intel is giving out for 50% discount. So if you go and want to buy the compiler you can go downstairs to the performance lab, talk to some of the Intel engineers and they can hand you out that coupon.

So let's talk about some of the compiler switches. What are the optimization switches that Intel compiler provides? At Dash 01, this switch actually optimizes your binary for size. So if you have got a lot of applications which you're going to deploy on a server, you might want to use this switch because it optimizes for the size. Dash02, this is the default optimization level and it optimizes for speed. So when you turn on this optimization the compiler actually looks at your loops and tries to convert, vectorize, auto vectorize the loop and use the SSE or SSE2 or SSE3 instruction set.

As you turn on the optimization level to -03, this optimization actually optimizes your code for data cache and it does a lot of loop transform. One thing to be concerned about at this level is that you might not get the accurate 754 compliant FP arithmetic. So you might want to use another switch to make sure that your application performs correctly at this particular switch.

Dash IPO, this instantiates the interprocedural optimization. The Intel compiler is capable of doing whole program analysis. One caveat with this particular switch is that your compile time increases a lot. So if you have a big application with lots of files and lots of objects, you will see that the compile time of your application jumps up a lot. But because it looks at the whole program, it can find out much more optimization, much more things to do, and then optimizes your code really well.

The Profile Guided Optimization, the PGO, is also done and it's also available for you to use. You can use it using the PROFGEN and PROFUSE flags. This is very useful for branching code, but one of the problems with this, and this is like a double-edged sword, is that the output is going to be as good as the input.

So if you're using a training workload for profile guided optimization and that training workload is not really representative of what you're going to run, then it will give wrong hints to the compiler and then the resulting binary that you generate with the profile guided optimization might give you regressions in terms of performance. So you have to use this optimization carefully. But if the training workload is good and it is representative, you can get really good results for code which are, which, which are really good. branching codes.

Dash Parallel is a new feature that they have introduced in the Intel compilers. Most of the products that Apple ships has at least, well, have got multiple cores in them. And Dash Parallel actually takes the onus on the compiler to actually parallelize their code. You don't have to do anything inside the code.

But again, I've played with it and sometimes it works fine and sometimes it doesn't. So you have to really test it out and see how good it is for your particular application. From HPC's point of view, because a lot of time people use MPI to do parallel stuff, the Dash Parallel switch is not really useful.

So try it out and see whether it works for you or not and then gauge the performance benefit of that. And finally, the Intel C++ and Fortran compilers, they both support the OpenMP standards. And if you have got OpenMP pragmas in your application, you can instantiate and make use of that using the -OpenMP switch.

So let's talk about the Intel performance libraries. Intel ships two types of performance libraries on Mac OS X. The Intel MKL and the Intel IPP. The Intel MKL or the Math Kernel libraries have got routines for scientific, engineering and financial applications such as BLAST 1, 2 and 3 levels, LAPAC, sparse solvers, FFTs, vector math and random number generators.

On the other hand, the Intel's IPP library contains image processing, digital signal processing, cryptography, video, audio, speech coding and speech recognition software. Both these libraries are highly tuned and optimized for the underlying Intel chipsets and the chips and they are also multi-core aware. So you can control on how many cores you want these libraries to run and they can give you better performance.

So when we talk about LAPAC and LINPAC, so LINPAC numbers, it's obligatory for us to see. So what we have here, I ran LINPAC numbers on a 2.16 GHz Core Duo Processor. Now this is one of the Apple 17-inch laptops with 2 GB DDR RAM. On the X-axis we have got, we're rating the matrix size from 1000 by 1000 to 12,000 by 12,000. And on the Y-axis we have got measurements in M-flops. So the first curve shows the results for Intel's MKL using single precision. So it gives you about 4.2 Giga Flops on a 2.16 GHz Core Duo Processor. And this is only using one core of the processor.

If you run the same benchmark with Apple's accelerated framework you get something close to 5 gigaflops. A pretty good improvement over the Intel's MKL library. If you are interested in looking at double precision you get about 1.4 gigaflops using the Intel's MKL library and with Apple's accelerated framework it is pretty much the same. There's not much performance differences.

So depending on your application and what kind of primitives do you want to use you might want to choose accelerated framework or the Intel's MKL library. Moving forward, once you write your application and you have ported it, we give you a very nice tool called a SHARK to actually profile and identify performance bottleneck.

So what is SHARK? It's a simple and fast profiling tool. It identifies performance problems in your code. It works with a wide variety of languages, C, C++, Fortran, and any compiler which is actually capable of generating stabs or dwarf debug symbols can be used to profile, can be used with SHARK.

We give you a GUI as well as command line SHARK and the command line SHARK is really good for scripting purposes. Shark is part of Chud Tools. Chud Tools also have applications like Big Top and Reggie and Processor Control Panel which are really useful tools and the most important part about Chud is that it's available for free.

So what's new in Shark? We've spent last year working with Shark and improving it in several aspects. One of the things that we have done is you can do 64-bit profiling not only on PowerPC based systems but also on Intel architecture or Intel based systems. We have got UTF-8 support and we have added the DWARF and DSIM file support and one of the features is Symbolication. I'll talk to you about Symbolication in the later slide. And finally, Window Time Facility. Again, I'll explain what that means.

So complete 64-bit support, the Shark binary itself is universal and you can also profile your universal applications. It shows symbols, code and leopard frameworks above the 4GB line. It has got EM64T support, you can look at the assembly either in Intel or AT&T syntax and it has also got integrated ISA reference. So for those of us who are not familiar with the Intel assembly language can actually learn what those things mean. So for those of us who are not familiar with the Intel assembly language can actually learn what those things mean.

If you have a binary which is compiled with Dorf information then you can use it with Shark. And it works with 32 and 64 bit binaries. You can mix and match steps and Dorf information to look up the symbols. And Shark also provides the DSIM support. Symbolication, this was one of the feature requested by our developers.

A lot of times when you ship your binary or you strip away the symbols from it, when it comes to, say, if it goes to the QA or your user base and they find any performance issue, they will actually take a profile and probably send it back to you. But the profile that they send back to you does not have symbol information. It's all addresses.

So if Shark allows you to actually reference a symbol rich binary and then automatically apply symbol information after the fact on the profile that you got, so that now you can see symbol information in the profile that you have received which did not have symbol information. So that is Symbolication in Shark.

[Transcript missing]

Okay, so we're just going to wrap up here and then we'll have some time for Q&A. The window time facility has the best name ever because you know once you're sitting there running and you want to know what just happened you can press the WTF button and then find out, right? So anyway, so for more information about some of the stuff that we've talked about today, if you have questions about the stuff that Yusef was covering you can send some email to the performance tools group.

Warner and myself both handle basically cluster pre-sales. I handle the business side and Warner handles the education side. If you're not sure we'd be happy to direct you to the right people so email either of us if you're international or you know research funded and aren't quite sure where you classify in. We're friendly, we know each other. So we'll be happy to get you to the right person.

If you're new to clusters this was probably pretty intimidating going through some of this technical information but we've got a couple good resources for you. These are kind of long URLs but you can find them on the web. You can find both of these off of the just general apple.com/science site which is kind of a collector site for all of this but there you'll find two PDFs.

One which is evaluating, acquiring, and deploying clusters which kind of is a soup to nuts kind of walk through of everything that we talked about today on an even lower level. And then you also can find the high performance computing solutions PDF which is kind of a cookbook that takes you through the whole process of setting up a cluster. Literally at the screenshot level showing you everything that you would need to do.

and a couple of those resources I talked about. There's a couple mailing lists. The SciTech list and the HPC list. You can go to lists.apple.com and find those in the master list there. Those are both good, fairly high volume lists. And then as I said the apple.com/sciencecollector site gets you to everything having to do with clusters.

And to end this up, some related sessions. We're towards the end of the conference but there are a couple more things going on. There's developing and porting Unix applications on Mac OS X which is happening right after this. And then tomorrow there's develop and deploy with Xgrid 2 which is actually right here in this room. A couple other things, Apple in the lab, Apple Science Connection are ongoing.

And also if you wanted to see the new Xserve, I'm sure there's a few floating around the conference but one place I know there's one is down in the IT lab. That does close at 5 o'clock today. It's not open tomorrow so if you want to see it you're going to have to go now and go see it after this show.