Large Installation and HPC Systems Administration - WWDC 2005

Enterprise IT • 53:29

Building and deploying a high performance cluster, or any large installation, requires a rigorous, repeatable methodology, edge-of-the-envelope system tuning, and hard-core troubleshooting skills. Learn what you need to survive such an installation from those who have.

Speakers: Dave Lopata, Josh Durham

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

My name's David Lopatta, and I'm here with Josh Durham. I work at Apple in the, I'm an area consulting engineer based out of Atlanta, and I focus on high performance computing and server and infrastructure stuff. And then Josh Durham is from Virginia Tech and is the root user of the Terrascale cluster there. So he manages the Virginia Tech Terrascale cluster.

And today we're going to talk about, the title I should say is a little bit incorrect. There's a couple of typos, so to speak. The word administration really should be installation and deployment, and large installation should probably be gone. So we're really going to focus on high performance computing. A lot of what we talk about will apply to large installations in general, particularly the first half where I focus on the physical layouts, but we really are focusing on the high performance computing clusters.

So in terms of an agenda, we're going to first sort of go over a computing 101, cluster computing 101, and then talk about the things you might do to architect an appropriate cluster for your specific problem. We'll go over some of the physical deployment considerations, and then I will switch over the balance of the time to Josh, who will talk about building a basic cluster framework and getting the LINPACK and why that's important.

We'll also talk about testing the various cluster components to make sure that... the system is up and stable and performing the way you would like it. We'll go into job scheduling actually at that point, and then jump back to tweaking OS X for performance and trying to eke the most performance out of the system.

Again, I'll be covering the first half specifically sort of the physical considerations. When we go through these things, hopefully none of these things are earth shattering and things you didn't really consider. I'm really hoping this is more like enhanced common sense. Whereas you look at the other side, you're going to see that the system is up and stable and performing the way you would like it. You look at these things, you say, "Oh yeah, I may not have thought about that, but I know I should have." And then we'll provide that sort of as a basis for creating a cluster.

So to step back a little bit and go over what cluster computing really means to us in this specific session, we want to talk about, first off, what we're not going to cover, and that's the available or scalable clusters that you might use in an IT infrastructure environment.

These will not be covered because the specific things that we're bringing to the table here in terms of techniques don't generally apply. And also, even in a large-scale, high-availability or scalable cluster, the number of nodes we're talking about aren't as large as you might see in an HPC environment.

So again, high availability. We're not going to discuss this, but just so you know what this is. This is basically where you have two systems providing a redundant front end for a service, whether it's an Oracle database or some other service that is business critical to your environment. And the idea here is that we eliminate single points of failure by introducing redundant systems and networks with redundant paths and heartbeats to the environment.

So pretty straightforward system. Tiger, as an OS, has a lot of features built into it to allow for this kind of environment. So there are really exciting things you can do with an HA kind of environment with Tiger. But again, this is not specifically something we'll discuss today.

And then the idea of scalable clusters. Again, things like Oracle Rack and also web services that might have load balancers in the front where, essentially, most of your data is static and can be handled by a bunch of stateless devices or tied to a shared storage environment in the back end.

So some of the things that we're going to talk about today, the features or the procedures we'll be talking about today for deploying HPC clusters does apply here. But again, it's sort of a one-off. At the end, we will have Q&A. So while these are sort of off limits from the slides perspective, if you have specific questions about this, we'll be happy to entertain them. And we will answer every question, provided you accept I don't know as a possible answer. And then we'll just move from there.

So computational clusters. Basically, we break this down into two types of computational clusters. The first is a traditional high performance computing cluster. In this environment, really, the computer is the solution. So folks that are involved in this space are developing their own code. They're selecting an architecture that is interesting or appropriate for that type of code. And then they're modeling and building and designing their software around that. So they pick it based on a hardware architecture or a performance characteristic of a hardware architecture, and then build up from there.

The second kind is the solutions-based architecture. And we have a great story here with the Apple Workgroup Cluster. And here, the software really defines what we're trying to deliver. And it's really about time to results and making sure that the software is robust and flexible on the platform that you're using. So the software drives the solution. And from there, you build out the appropriate cluster infrastructure that provides the best experience for the user.

So if we look again at sort of the generalized needs of the high performance environment, essentially good double precision floating point is critical. Traditionally here we also see the high speed interconnect, so Miranet or InfiniBand become very interesting in this space. And of course we need strong compilers, because again the researchers here are actually developing and designing their own code. So a strong compiler environment to be able to bring good performance to the algorithms that are being used here is essential. And this is what a typical layout might look like for an HPC kind of environment.

In the solutions-based environment, essentially the cluster is a black box for submitting jobs. We have a good story here again with the workroom cluster for bioinformatics. The idea here is that this is designed to be a product and not necessarily a project. The back-end infrastructure of the cluster becomes less important to the researcher, and it's really about time to results, being able to plug something into the wall, get your research done as quickly as possible, and get the research done and submitted. And however we get that accomplished on the back-end is really not of interest to the specific researcher.

[Transcript missing]

So, those are the types of architectures that you might have. Then the key here is figuring out exactly how you go about designing and building the cluster that's appropriate for you. So the key thing is to figure out exactly what your goal is here. It may be a specific application, for example. You may be building it for a solution like Blast, or you may be trying to target a specific performance benchmark. In this case, we're going to discuss the HP LINPACK.

Once you've got that, you need to figure out exactly what kind of performance you're attempting to achieve. So, it might be that you're trying to hit 2.0 teraflops, again, using the LINPACK experience. Once that's done, you really need to pay attention to exactly what you're going to do after the fact.

There's a lot of times where the cluster is and always is going to be a cluster environment, but there's also times where you may need to do redeployment over time. So, it may be that a research project, at the end of its life cycle, the systems may be... need to be redeployed for other uses, whether that's infrastructure services or maybe redeployed into a lab environment, etc.

So, trying to figure out what you're going to do tomorrow helps you decide what architecture that you're going to use to build the cluster today. Once that's done, that really gives you the opportunity to select what specific platform that you would want. So, for example, an XSERV G5 is an excellent system if you're going to have a dedicated cluster. If it's something that you're doing a short-term research project that you're going to turn over later on, you might also consider PowerMaker. PowerMaker is a system that's going to be deployed into a lab environment later on if you want to be able to repurpose them.

From there, we have to figure out what storage needs are going to be used by the system. This generally happens by discussing with your researchers exactly the kind of storage needs they're going to have, whether it can be dedicated storage on a single shared system, whether or not they need high-speed access using something like XSAN to communicate to a bunch of high-speed nodes, or whether they need a more distributed environment with lots of local scratch space, for example. Thank you.

The next thing is to figure out, well, what are we going to actually use to tie this thing together as an image? So we may be using MPI. There's been a lot of discussions about OpenMPI here at the show, and there's a variety of MPIs that are available for Mac OS X. We may be creating a PVM environment, or we may be using something like XGrid's BEEP protocol to exchange information across the network.

Once we've figured that out, the next key thing is to figure out, well, what the interconnect is. And this ends up being extremely important, because the interconnect can be a very large percentage of the actual cost of the system. So we want to take care to make sure that we select this wisely. When we look at the interconnect itself, there are a few deciding factors.

Basically, it's the old price performance decision that we have to figure out. Gigabit Ethernet is by far the cheapest solution. The XServe G5 ships with two gigabit ports on the motherboard, so it's a very convenient way to create two networks, one dedicated for the communications network, and then a second network for back-end services. So you don't need any additional HBAs for that, and it's just there out of the box.

In terms of performance, when you're looking at needing a low-latency interconnect, both MirrorNet and InfiniBand provide very low latencies versus Ethernet. So those are the two candidates on OS X that you would generally be looking for for a low-latency interconnect. In terms of throughput, MirrorNet advertises about 900 megabytes per second with dual-ported HBAs, where InfiniBand can scale up to 10 gigabits per second per port on a dual-ported HBA. And I mentioned there, of course, that exceeds the PCI-X specification that we ship on the XServe, but it does create a nice future-proof interface.

In terms of what's available on OS X, both are available and very robust on OS X, and we have a growing number of deployments in each. Currently, MirrorNet tends to be the leader in terms of the numbers of systems that we've currently shipped and are deployed, and is probably the most popular high-speed interconnect on that top 500 list as it exists today.

So again, what we're going into now is the physical deployment considerations, and hopefully this ends up being essentially enhanced common sense. Hopefully there's nothing really earth-shattering here, but we want to cover these kinds of things, because when we go through with customers and walk through the deployment cycle, this is where the biggest pains come from our customers. The physical considerations that tend to either get ignored or overlooked during the purchasing cycle cause the biggest headaches for our customers long-term.

So in terms of racks, the XSERV can support either a two or four post rack. We strongly recommend a four post cabinet, of course. The XSERV RAID is only available-- you can only configure that into a four post rack. You'll notice that the sizes for the various racks, the depths, actually do not exactly overlap for your XSERV RAID and your XSERV. So you want to make sure that if you're building out an infrastructure, you're getting the appropriately sized racks for your XSERV RAID as well as your XSERV.

In terms of mount order, if you're using a fairly small cluster, so you've got maybe one or two racks, you want to make sure your UPSs are physically on the bottom of the rack just because of the physical weight considerations, and then you'd put the XRV raids on top of that. That alone will save you a lot of pain and suffering in terms of back injuries and things like that. Then, basically, on the top of the rack, you'll basically start adding your cluster nodes and your head node from the top down to the middle.

All your infrastructure hardware, your gigabit Ethernet, etc., would go essentially in the center of the rack. For larger deployments, we would recommend considering a center rack that has basically your infrastructure services, your gigabit ports, your gigabit switches, and your marionette switches, and things like that, and then fan out the cluster nodes on the sides of that because as you're building this infrastructure, particularly with a high-speed interconnect like marionette or InfiniBand, the cost of the cables are directly related to the length of the cable.

So having a central point where these feed into reduces the cost by lowering the physical length of the cables. In terms of tools and parts, you want a lot of batteries for your screwdrivers. You're putting in a lot of systems. Generally, you do need two people installing this, one on the back of the system, one on the front. You'll find you can do an XRV raid in just a couple minutes if you've got two people doing it and you get into a good rhythm. It's a pretty straightforward process.

But it's fairly something you've just got to get used to. It becomes more of an art than a science in many respects. One thing that we run into a lot is in terms of which screws to use. I know this sounds really simplistic, but you've got a bunch of bags of screws around. Some of them are metric. Some of them are English measure. Make sure you're using the right screws.

So if you're attaching anything physically to the XRV or the XRV raid, use the screws that we provide. If you're attaching anything to the actual racks, make sure you're using the rack manufacturers. And that will save you a lot of headaches. And again, it's sort of enhanced common sense, but we have a lot of folks who sort of overlook that and have a lot of headaches associated with that.

In terms of power and environmentals, this is again probably the number one issue we have in deploying a cluster. Folks will buy a large, reasonably sized cluster and then attempt to deploy it in a closet with no heating or cooling consideration. These are the basic numbers. The K bases at the top are kept up to date as we provide various releases, the XSERV and the XSERV RAID. And I won't go over the numbers directly, but you can see sort of how they progress. And the XSERV RAID obviously pulls a lot of power and generates a lot of BTUs.

If you look at that in some typical cluster configurations, these are estimates. If you're planning on deploying a 32 or 64 node cluster, please do not use these as gospel. Make sure you measure it yourself. But this is the kind of power pull and BTUs that we're looking at.

The 1110 node cluster here is, of course, the Virginia Tech TerraScale cluster. So 310 kilowatts and just under 2 million BTUs in terms of heat. So again, you want to keep this in mind because as you add nodes, this gets to be a pretty significant number. And with that, I'm very, very early. I'm going to be turning this over to Josh from Virginia Tech to cover the software aspects. Great. Thanks, Dave.

So I'm going to go over some of the software side of things on what's involved in setting up a cluster. And we're going to go over some different terms. There's different kind of roles that we have of systems in a cluster. And we're going to start out with the service and head node setup.

So a service node in a cluster is basically providing infrastructure. It provides different services like DHCP, DNS, LDAP, the common storage. Basically all the things that the compute nodes are going to need to be able to run, the service nodes are going to provide. This can be one system, or if you have a larger cluster, maybe larger than 32, you might want to think about putting this onto multiple systems.

So the Service Node Setup, the first thing we have to do obviously is install the OS. I recommend installing basically the full OS with everything added. That way if later you find out you need something, it's already there, so nothing's going to be missing. So install the base OS, install all the developer tools, and basically just put everything on that you get in the box. There's different ways of doing this.

One of the problems in clusters is that often the systems won't have a video display or a keyboard or mouse. So if you are lucky enough to have that, you can do your standard install. You boot from the CD, stick it in, just go through the standard install like normal. If you can't do that, you have a couple of other options.

What you can do is you can stick the XServe in a target FireWire mode, get a laptop, and basically use your laptop to install onto that hard drive. And that works out pretty well. A third thing you can do is do remotely with the CD and server assistant.

So what server assistant will let you do is basically over the network, act like you're doing a GUI install, but you're actually installing a system across the network. And finally you do a network install, which is basically set up a system, have all the services on it, and then you deploy that image across the network. And we'll go into that in a little bit.

So we're going to go over some of the services that I think are very important for installing a cluster and for these infrastructure servers. The first one is DHCP. So DHCP, most of you know, is a way of dynamically allocating IP addresses to systems. And we're going to use this to allocate addresses to the cluster nodes.

Well, we don't want it to be quite dynamic, because the problem we're going to have is that if you don't know which cluster node has which IP address, and you have a problem with that cluster node, it's going to be a lot harder to find and diagnose an issue.

So if you have a system that might be acting unusual or weird, you want to make sure that the third system down has the third IP address that you've assigned. Otherwise, it could be the fifth, and you'll spend some time just trying to figure out which system really is causing the problem. So obviously, this is set up through server admin, which is part of the OS X server tools.

The second service that we provide is DNS. And DNS is -- DNS and DHCP I kind of put in the same basket, because DHCP will give out the IP addresses, and then DNS lets the infrastructure nodes, the service nodes, and the head nodes basically find each other and work together.

So through our experiences, we found with the original System 10, when we did the PowerMax, we had this node name number of N0001 through N1100. And it turned out to be a scripting nightmare, because we'd have to pad out the node name, and regular expressions just wouldn't work. So when we did the second System 10 with the XSERVs, we switched this naming scheme of basically node and then just the number.

So don't try to pad it with the zeros there. And we always get into this debate, the computer scientists at Virginia Tech want to start with the node number being zero. And the system administrators wanted to start out with one. And so I finally won out on that one, because I was the guy who got to do it. Again, this is set up using Server Admin. Pretty straightforward. The only thing that you need to make sure of is that the IP addresses and the names here correspond to what you also did in DHCP.

So shared storage, this has kind of been an issue of debate on the cluster stuff, with how much it can scale in OS X and which one you should use. So at Virginia Tech, we use NFS, and we actually do have all 1,100 systems. Actually, each NFS server has all 1,100 systems attached to it. So definitely no problem scaling there. And I think we've been doing that since 10.3.6, so we haven't even looked at how to do that with Tiger yet.

But with NFS, it's going to basically provide this common store across the entire cluster. And what that will let us do is that, instead of like before, if you saw the XGrid demonstration, when you submit something to XGrid, XGrid handles taking those jobs, submitting it to all the nodes, and moving the files to those nodes, the nodes work with them, things get moved back, and then downloaded to the agent. Well, we don't need to do that, because we're going to do a shared NFS. So instead of having to transfer these files back and forth, the files are already directly accessible from each node.

So before I said that on larger systems you want to kind of break out the roles that each service node does, an NFS is probably the first one you'll want to break out because it's probably the most demanding of all the roles. Again, this is sort of using the OS X server tool work group manager. Basically just set up an NFS export and then the clients can pick that up.

Finally, the last service I'm going to talk about is Open Directory. Open Directory allows basically a common user view across the entire cluster. And this is important because you want each compute node to-- so when you submit a job, and you want all the compute nodes to have the same user ID and have the same access control stuff that is common with the local interface. So.

We use Open Directory. What I love about Open Directory, and we do use this at Virginia Tech, is it really does take about 10 or 15 seconds to create a new user. So it's very easy, very quick. You do it in one place, and then you get the users, basically, users can access all the systems once you've done that. This is done using Workgroup Manager. You do set up using Server Admin, and then you do the user maintenance in Workgroup Manager.

Another thing that you can do is do the system monitoring, also from a service node. So I think a lot of people have probably seen the XServe system monitoring. One thing I have to note is that you really need to have an XServe or an XServe ray to do system monitoring. The hardware is just designed to provide a lot more information.

It provides all the thermal information, all the CPU power usage, for example. I can't even remember how many different metrics there are. I think it was over 22. Obviously, you can't get that if you're trying to install a cluster on a Mac mini or something. I don't think it provides all that information.

So what's great about the server monitor is that you can set it up to email out if there's something on certain events. So you can say, if a CPU temperature gets out of a certain boundary, go in and email someone. So you can have a system administrator or have someone get paged or an SMS type thing happen. And that's really helped us integrate into our call center. We have a 24-hour call center at Virginia Tech separate from the system. And that's really helped us be able to basically monitor the system after hours.

So that was the service nodes. What we're going to talk about next is the head node. Head node's a little bit different because it's basically what the users have access to directly. It's probably the only systems they should be having access to directly. The service nodes are kind of more of the system administration type stuff, and you don't usually have users logging into that. But the head nodes basically provide this bridge, or this portal, that allows people from the external network to get in and use the resources of the internal network.

So it really should be the only system that is on both networks. Most of the times in the common cluster setup, people will be using SSH to log in. They can use SCP to move files back and forth from the cluster. So on the head node, users will often compile their programs and they will run their codes directly from the head node.

So the head node's in charge of getting the jobs out so people can run on the entire cluster. And again, on small clusters, maybe smaller than 32, it might not make sense to break this out into a separate system, just because it's kind of a waste of a system on a small cluster to have head nodes and service nodes, and then all your computational nodes.

With the compute nodes, I'm going to talk a little bit about how to deploy that to your cluster, how to deploy a common image across your cluster. Basically, a compute node is the workhorse of your cluster, so you shouldn't be doing anything but running user codes on this. It shouldn't be doing any other services or tasks, so all the services that you don't need should be turned off. So basically, we start with a basic image.

And we may install the developer tools on that basic image. And we configure it with DHCP on the first interface, because when you want the cluster nodes to come up, you obviously want them to use the infrastructure that we just talked about. And again, I should mention that the easiest way to probably do this is to kind of set up one system exactly the way you want it.

So again, with basic image and DHCP. And then we're going to install an SSH key, which is going to allow us to do passwordless login. And that's going to be really important, because when you want to run something across the entire cluster, you don't want to be prompted for a password 16, 15, or 1100 times. That would take a lot of time.

And then you want to make sure that the cluster nodes are going to be using all the services we had just set up. And finally, something I like to do, because I don't like to mess with the compute nodes too much after I've set them up, I don't like to log in and change scripts and modify things, is we create a compute node startup script that's kind of going to refer to a centralized directory and a centralized script. And I have an example of one coming up here in a slide or two.

So how do we get that image onto the cluster? Sorry, skipped ahead. Basically, we need to, on the image, this is kind of what we were talking about before, we need to install the Server OS, developer tools. Don't enable any services except for maybe ARD. If you're uncomfortable with doing SSH to manage your systems, ARD is a very good way to do it. And I think there's an ARD session tomorrow if you want more information about that. You want to make sure the image has all the updates that are relevant for your situation.

And you want to make sure that the image has any software that needs to be local. So any libraries such as As MPI, or maybe your drivers like Marinette GM or InfiniBand, they all need to be local. Also, I know in the past, like if you installed IBM's XLF, you had to install certain libraries on each system as well. I think it was their runtime library or something like that. So basically, you set up the image exactly the way you want it.

This is the startup script that I was referring to before. We basically want to achieve two things here. We want to mount the NFS server, and then we want to run the centralized script that I talked about. So the centralized script is sitting on the NFS server under ETSI nodes.start. And so the reason we do that is, let's say we want to install or start up InfiniBand or across the entire cluster.

Well, instead of going in and logging into each system and modifying the local start, that's important later because let's say you wanted to try a new InfiniBand driver out, but you didn't want to toss the old one out. All you've got to do is put this again in a central location, update this one script, and reboot your entire cluster.

Alright, so now we're going to talk about imaging. There's a few different ways to do it. So we have the golden image, and the golden image is set up exactly how we want it, and it looks exactly how a cluster node is going to be. We want to first make a disk image out of that, and I use ASR.

You can also use the disk utility that comes with every Apple system to basically make a disk image. Easiest way to do that is basically to stick the golden client into the target FireWire, and from another system like a laptop or a desktop, you can then just create a disk image from that.

An application we like to use, and we use a lot with the first cluster especially, is NetRestore, which Mike Baumbach wrote. And what's great about NetRestore is it lets you drop a cluster image over the network, and it actually has full automation support so that you can just say, do a net boot and walk away, and it should drop the image on the server and reboot the server.

So then the only real step there is that you have to make sure that the server boots up and does a net boot, which is you can modify the lights on the front when you're first doing it. I'm not sure if you've done that, but the XSurf has a way of basically, right as you turn it on, you can say, I don't want to boot from the local drive, it's going to boot over the network.

It's similar to like holding down the N key on a PowerBook or a PowerMac or something like that. So you can do that completely on the front without having to have a keyboard or a video thing. And that works out really well. And that's how we did a lot of deployments.

Finally, you can do the sneak-it-up method. So what I love about the XSERV is it's got the nice pop-out hard drives. So you can just pop out this hard drive and walk over to an imaging system, stick the hard drive in, and just image that drive. Another great thing about the XSERV is that the drives are hot-swappable, so you don't have to power down the imaging XSERV, stick the drive in, power it up, image, and then recycle again. You can just go in, pop it in, it automatically mounts, and then you can ASR it.

Alright, so after imaging, we have the golden image on all the cluster nodes. We have all of our infrastructure services set up, so it should have the common storage, common authentication set up. Should have the portal node or head node set up and the service nodes. Cluster nodes should have their DNS set up, and everything should be resolving with DNS just fine. We do the common file script, which makes doing driver deployment and upgrades much easier. And then we also have that passwordless key, which I'll go into why you want to do that in just a minute.

So there's a couple ways of making changes or modifying how your compute nodes work. The first one is to edit that central script, and you can kind of drop some things that might make a change or a fix on the next reboot. Like, for example, what I was saying before, where you might want to load a different driver on a reboot.

Another command that we use a lot is PDSH, and that stands for Parallel Distributed Shell. And what that lets you do is it lets you run a command in parallel across your entire cluster. So in the example above, we have We're running command on nodes 1 through 999 with a timeout of 30 seconds, and we're just telling it to reboot. But you could do anything.

You could have it copy files. You can change system settings. Anything that you can do by the command line, you can do with this command. And finally, we should be able to run basic MPI jobs, since now all the systems have MPI on them, and all the service is required for that.

So one of the problems that, so you have your cluster up, is you really need to make sure that everything's working as expected. Averaging and tech, we had been rushing so much to get... We had to get to do the benchmarks. We were on such a time frame that for the first three nights we had just turned everything on and then we kept trying to run these benchmarks across all the systems. And it would continuously fail because one system will basically mess up your whole run. So everything has to be stable and reliable for you to be able to do that.

So the first thing we need to do is we want to test each system out. And what we did is we used LINPACK, or the HPL benchmark, to kind of stress test each system. I kind of chose LINPACK because it pegs the CPU, it can use a whole lot of RAM, and you can have it run for a really long time unattended. So those met kind of some of the requirements that we needed to kind of stress test the systems locally.

So we run a two CPU job, or two CPU tests per each XServe, so P being one XServe and Q being two CPUs. And we pick a matrix size, about 80% physical RAM. If you don't know what HPL is or LIMPAC benchmark is, it's basically a linear algebra problem solver. And so we want to be able to create a matrix size that uses most of the CPU. And we want to do that because we want to stress all of the RAM as we can, so we can see if there are any errors with the RAM.

And we want to run multiple tests. The reason we want to do that is because we want to make sure things get nice and warm inside the XServe, because some of the problems don't develop until they've been running at kind of a warmer temperature for a while. So for example, I think on a 4 gig XServe, the most you can have a single run be is about 8 minutes, so adding multiple sets, like in this case 4, you can test it out for much longer.

The way you really find out if a compute node is acting strangely is just to kind of compare it to the rest of the systems. So let's say you're testing across 32 systems, and 31 of them may be finished in 400 seconds, but that last one may be finished in just 600 or 700 seconds. You have a pretty good idea that there's something not quite right with that one.

And there's a few things to try. Obviously, one of the things you might want to do is just try to reboot and see if that kind of clears up the error. Sometimes a service might get stuck in a weird state, or may not come up all the way.

If the rebooting doesn't fix it, then you might want to look at hardware issues. One of the more common ones that we see is memory errors or memory failures. When the XR boots and it detects a bad DIMM or something like that, it will actually disable that DIMM.

So the nice thing about having a test that uses almost all of what we think is all the RAM is that we can really stress and make sure all the RAM is being used. So if an XR came up and let's say it had 4 gigs and found a bad DIMM and so it came up with really 3.5 gigs of available RAM, it's going to start to swap a lot when you run this test, and the swapping is going to create a huge time increase in how long it takes.

So again, you can check for ECC errors by looking at the system logs. It's in VAR log, system log, and it's also in the hardware monitor D log on the XRs. But it might not just be memory. It might be memory. We've seen one or two bad CPUs that the system will work fine, but the CPUs will perform inconsistently. And then finally it could be logic boards.

Usually a logic board error is a little bit different because the system is just not reliable. Usually it will crash, and on the XRs when the system crashes, it can actually reboot within five minutes. That's called a watchdog on OS X. So if a system crashes during your test, then you might want to be able to look at the logic board or the CPU on that one.

So between tests, you want to replace the DIMMs that have the high ECC errors. And really, when I say high, ECC errors are going to happen. I mean, at Virginia Tech, we have 8,800 DIMMs. We're going to have ECC errors all the time. What you want to look out for is the stuff that's persistent, especially the stuff that's persistent across reboots.

So if I do a bunch of tests, and I see one system that has ECC errors, what I'm going to do is I'm going to reboot all the systems and run the tests again. If that same system is getting ECC errors, then I know I definitely need to replace that one DIMM.

And I can't stress this long enough, but you really should keep very good records of problems you've encountered. An example I can give you is that you can maybe perhaps replace a CPU on a system when you thought the CPU was bad, and pop it back in and everything works fine.

Later down the road, the system starts to develop errors. So unless you kind of kept a log of what the problem was, you might think, hey, it's a CPU, and replace the CPU again. So keeping a good log is good, especially when you start getting larger systems. It becomes very critical to do that.

So after that, we should have all the nodes working well individually. So now we need to start stressing the fabric out. And when I say fabric, I mean either the Ethernet network or MirrorNet or InfiniBand, the communications fabric that you're going to be using. So we kind of step it up a notch.

Instead of running on one XServe, we're going to run one test across two XServes. And so what we're going to do is we're going to set up a four CPU job. So two CPUs on two XServes. Let's say you get the four CPUs. Same thing as before, we're going to use 80% of the physical RAM. But you've got to remember that this is the available physical RAM on all your systems. So before, we may have been running on one 4GB XServe. Now we're running on two 4GB XServes. So that's actually 8GB of 3GB. of available RAM.

And again, like the single pairs, we want to run multiple tests just to kind of stress things out and make sure that when things start running at a higher heat load, it doesn't matter. I've seen things where the heat will actually start to loosen the connector a little bit, because obviously the connectors are getting much warmer in the back, and the connector wasn't seated properly. So the heat can actually affect how good the contact is of your connectors. So running at a high heat is actually kind of important for making sure things are working.

So slow performance here is often going to be a network issue, but not always, because you could see problems before that just didn't develop earlier. And with Ethernet, obviously, you can look at the error counters, and you can also do the same thing with Mirrodin and FinnaBand. Just check the error counters and see if they're counting up. If you're getting any error counters whatsoever, I think you should be looking at the equipment and maybe replacing the cable, receiving the cables.

So one of the problems that we had with this is that, so now we're kind of trying to figure out network problems. One of the first things we did at Virginia Tech was we kind of did a binary search for bad nodes. So we would take maybe a cluster of like 32 and run two 16-node tests.

And then the first one would fail, and we'd go, okay, so we got it down to 16, and then we'd run two more tests of 8, and okay, now we know which one of the 8 is doing it until you finally got down to it. And that took a lot of time. So we came up with this different way. Basically, you have the pairs like I mentioned.

So you pair like maybe node 1, node 2 up, and node 3 and node 4 up, and so on, and run those tests. So let's say that node 1 and 2 tests failed. Well, now you've narrowed it down to 2. And so what we do on our second iteration is we'll just kind of shift it by 1. We'll match up nodes 2 and 3 together, and maybe nodes 4 and 1 together, and then test that. So if the tests 1 and 2 failed, and the tests 2 and 3 failed, then chances are it's node 2 that's causing those problems.

And so, obviously, you can look more closely at Node 2's errors and see what is going on with it. So that's basically some of the testing methodologies. There's a lot more than that, but that really got us to a very good point where we were kind of happy with the state of the system. I'm going to briefly go over some of the job scheduling and resource management stuff that's involved in clusters. I always tell people that you don't want to do this if you don't need to, because it really does increase the complexity of your system.

The only reason you really need to do job scheduling and resource management is if you have multiple users, and it becomes really more important if they all want the system at the same time. So, for example, if you have a shared resource, a cluster of 16 nodes, and you have maybe two faculty professors that, you know, faculty, they want to run on the entire system, right? So you can basically set it up so that if both of them submit a 16-node job, then it can split it, or it can schedule the two and do that.

So we're going to break this down into kind of resource management job scheduling. Resource management is in charge of running the actual codes on the compute nodes, and That the nodes are operating and they're doing what they're supposed to be. So make sure they're running the jobs, make sure they're up, and everything's staying fine.

So this basically just handles the queue of jobs, making sure the jobs run, and making sure they don't exceed resources. Often when someone submits a job, they have to estimate the resources. So they'll say, I want two wall hours of time, which means they want their code run for two hours. And if their two hours comes up and they're still running, then the resource manager is responsible for killing it and cleaning that up.

A resource management solution I recommend is called GridEngine. It was originally called Sun GridEngine, or SGE. It is free and it is open sourced, and it is distributed by Sun, and is probably one of the more actively developed, freely available resource managers. There are a lot of others that are commercial and free, but I really like GridEngine. I've been pretty happy with it. They support OS X, and I think they've been doing it for a few years now.

So the job schedule will work with the resource manager. And its goal is to basically make sure that your resources are being used effectively. And basically, this is where you get into the political issues that I was kind of talking about before. Another issue that you might run into is that you might, let's say you build a cluster and you have, Some people donate some money to build that cluster. Well, maybe the person who donated twice as much as everyone else might want to have a higher priority access. A job schedule will let you kind of give it so that when he submits a job, his jobs get a higher priority in the queue.

Most resource managers, though, do come with a basic job scheduler, and it's very similar to what the X-Grid one is, for example, which is basically first in, first out. If you submit a job, you're going to be the first one to run, and so forth. But again, if you need the fair share queuing, if you need to perhaps have only certain users run at certain times on certain systems, then you need to look at a more complex job scheduler.

And one of those is called MAUI. And MAUI is kind of the de facto standard in HPC clusters for doing complex job scheduling. But it is complex, because there's a lot of different ways of prioritizing your users. And MAUI has all the, what we call different knobs and dials where you can tune and tweak it so that people can run at different levels.

Maui is free. There is a commercial aspect of it. It's called Moab. So if you need commercial support for Maui, then you can look at Moab as well. And it works pretty well. Job scheduling is a really hairy thing. So for example, at Virginia Tech, we have someone who's probably 80% of their time is devoted to just administering job scheduling. And honestly, it can be quite hairy. And he's responsible for making sure everybody is happy, which is a really difficult thing to do.

I'm also going to talk about tweaking the OS. This is actually something I presented about last year. And one of the things we're going to talk about is how to reduce the number of services that's running on a system, on one of your compute nodes. So what is our goal?

Everyone's probably seen, either in TOP or in the activity monitor on OS X, all the things that are running. And there's a lot of things in there that you're not sure about. Well, a lot of those are kind of the core infrastructure for OS X. But a lot of them aren't required in the cluster environment. So what we're going to do is we're kind of going to wean these out and get it down to kind of a core set that we need to run on the system.

So why would we do this? Because we want to free up more cycles for the user codes. And not only that, but we want to make sure that these things running in the background don't ask time for the processor. That generates what's called a context switch. The processor has to go do something else and then come back. And it just generally increases the amount of time things take to run.

Also by reducing the number of services, you generally reduce the system complexity and potentially be able to reduce things that go wrong. So if you turn off a service that you don't need, then that's one less service that might do something that's not right and interfere with your task.

This can make diagnosing certain things more of a problem. For example, one thing that I often do is I'll turn off, and this is hard to say in front of an OS X audience, but I'll turn off the user interface, because we don't need it on the compute nodes. You're not logging into each compute node and running applications on it. Most of the time you're running it over the network through SSH.

But obviously this can make diagnosing a little bit harder, because you can't just sit down, log in, and then go through the logs or look at certain things. So because it makes certain things like diagnosing harder, we only want to do this on the compute nodes. We don't want to do it on the service and head nodes.

So Mike Bombeck, when he was working on the UIUC Turing cluster, created a script that would go through a system and kind of disable a recommended set of services. And so he did that, and I actually ran this on System 10. I thought it was great. I had done some of this work before, but it was much more complete than the stuff I had been doing.

So what we can do is we can, based on each cluster node, run HPC Tune dash dash recommended. It's going to turn off a recommended set of services, and it's going to turn things off like the Windows server, and it's going to turn off things like the Bonjour DNS stuff, which if you're doing things like XGrid, you don't want to turn Bonjour off because it's going to break that. It's also going to break XSAN as well because XSAN uses Bonjour to communicate and find the metadata controllers.

So there are certain times where you want to make sure a service is turned back on. And so you can actually use HPC Tune and kind of adjust that recommended set. So one thing it does leave on is the auto-mounter. And in certain situations, you might not want the auto-mounter. So there's an example of how to apply the recommended set, but instead of leaving the auto-mounter on, we're going to turn it off.

It is being updated for 10.4 right now. 10.4 did change a few things about how services started. It introduced LaunchD, which is kind of completely changing some things that happened. It will be available on my .Mac account, which is right there, and it'll probably be available soon. I would probably say at the end of next week for 10.4, and there's a 10.3 version there already. And with that, I'm going to hand it over to Jason.

Thanks a lot. So I'm Jason Anthony Guy. I manage a developer technical support team. So more information, developer.apple.com/wwdc2005. We've got a bunch of fantastic sessions that either have passed and you'll be able to watch on a streaming video in the future or are coming up shortly. One I'd like to definitely highlight is the monitoring your system with ARD and open source tools, which is Thursday at 2:00.

So if you want to see how to monitor those systems you're deploying remotely, this is a great session to attend. You should also join us in the Enterprise IT Lab, and a bunch of these guys and a bunch of other systems administrators and so on will be there to take your questions. Chris Bledsoe is the person you want to email if you have questions or comments.