Enterprise IT • 1:03:39
High Performance Computing continues to gain momentum on the Mac platform, and the technology changes quickly. Get up to speed on the newest tools, technologies and software for HPC on Mac OS X.
Speakers: Skip Levens, Kevin Howard
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it may have transcription errors.
Welcome to session 609, the HPC technology update. We've got a lot to cover. It's been an exciting year in HPC. Today we're going to cover and touch on some of the HPC trends and showcase the major deployments and some of Apple's larger HPC deployments. We're then going to survey the HPC technologies on the platform. There's a very wide range to choose from. We don't have time to talk about each, but I should point out there are representatives from many of these projects, many of these companies in the room today and thank you for coming and thank you for your support on the platform.
We are next going to cover the two major high-performance interconnects available on the platform, MirrorNet and InfiniBand. And next, we're going to take a survey of the tools available on the platform and get a very important update from Absoft, who are going to let us know about everything in their portfolio and what they're doing on Mac OS X. Finally, we're going to give you a glimpse at a very interesting HPC technology, and then we'll just go ahead and wrap up after that.
So it's pretty clear to almost anyone that is building high performance clusters, rather high performance computers, that clusters are rapidly becoming the predominant way that we build and put together high performance compute environments. This chart put together at the last supercomputing conference tracks the major supercomputer architectures going back to 1998. So we've got 500 systems going back from 1998 In orange at the bottom are clusters. We can see that as they gradually rise, they push -- they have displaced other architectures on the platform. Clusters -- just about everyone knows -- are the predominant way, and this on our platform is personified in the XSERV G5.
It's -- the XSERP G5 is a foundation for clusters. There have been a tremendous amount of HPC deployments tackling a very broad range of challenges in high-performance computing and the sciences. Let's check in with some of the largest ones. Probably the most famous one and well familiar to all of you is Virginia Tech System 10. System 10 this year converted from dual 2 gigahertz PowerMac workstations to 2.3 gigahertz XServe G5s. This allowed them to reduce their footprint by 2/3 and pulled a 425 kilowatt load down to 310 kilowatts. So a major savings in power. And well done. Major savings in power and the concomitant heat that has to be managed out of that, again, that problem becomes even more manageable. This year they generated a 12.25 teraflop run and qualified as number seven on the top 500 list of supercomputers. And they remain the largest academic cluster.
We think they're going to get some competition this coming year, but are still definitely the largest and most powerful academic cluster. They entered production this year, and they, what's very exciting is not only an HPC cluster for researchers, are also a technology test bed. And in fact, they welcome new technologies to the platform. If, say, someone wants to do a scalability test and see if the new technology can come over, if you contact them, they are very willing to entertain such requests. So we're very pleased, very excited at the work that Virginia Tech's doing.
Another very exciting cluster this year, the UCLA Dawson Cluster. The Dawson Cluster investigates plasma physics. It is a mix of two 128-node systems, one composed of 2 GHz XSERV G5s and the other of 2.3 GHz XSERV G5s, runs over GigE, leverages LAMMPI. This system entered production very quickly, and they were able to generate just on one of those 128-node sides, just shy of a teraflop run, which actually qualified them last year as number 444 on the top 500 list. Since then, using Dagger Research Pooch Pro and Mac MPI, they generated an even higher run. I think this shows, though, that 128 nodes is within the reach of a lot of universities, and frankly, that can qualify as one of the top 500 supercomputers.
Of course, we have to mention one of the largest systems, COLSA, the Mach 5 system in Huntsville, Alabama investigates hypersonic missile research. These are incredibly dense CFD codes that are running on this very large collection of systems. This system has seen a lot of development this year and again is crunching those very large CFD codes. So fantastic system. At University of Illinois Urbana-Champaign, they stood up the Turing Cluster this year.
This is used by their CSE department and the Center for Simulation of Advanced Rockets. So imagine the very complex CFD analysis that has to be done as you're burning away solid motor fuel in a rocket, and that's exactly what they're simulating there among other codes. We think this is going to be a very, very high performing system. It's 640 nodes, leveraging the mirror net interconnect. The system was budgeted around 3 million, and we think it's going to place very well on the very soon to forth come top 500 list. So outstanding work there at Urbana-Champaign.
Another fantastic cluster that you may have heard of that did generate some press is the XSEED cluster at Bowie State University in Maryland. So this system, again, is 224 nodes leveraging the mirror net interconnect, and they generated a 2.1 teraflop benchmark run. And so naturally we think they'll make this next list that comes out shortly. It was really fun to watch the progress of this system because this system was student and faculty planned. It was an incredibly useful teaching aid, incredibly useful learning aid to help teach HPC and CSE material for the campus. So, well done.
Now, for every large system that makes the list and makes a lot of headlines, there are dozens and dozens of smaller systems. And this is probably best personified in the Apple Workgroup cluster. The concept here is that we take it about as far as you can possibly get to true turnkey cluster computing. Very well-integrated set of applications. Hardware is integrated. All the technology pieces that comprise a cluster are all very tightly integrated.
inexpensive enough and with enough performance that these types of machines are within the reach now for work groups, desk side, departments, discretionary budgets, and we've even seen people enter it into grants and then be awarded funds for a grant, purchase the cluster, and away they go. So it really can scale to fit a very wide variety of needs. I should point out it makes a very nice, dedicated x-grid cluster as well.
So it takes a lot of technology, a lot of technology to make a cluster. All of the technology-- and we've represented them from the hardware at the bottom up through the OS, hardware interconnects, the middleware, all the way up to your application, which is what we want to get at. There's all this technology. There's a very rich array of technology in the platform for you to choose from when you're building your cluster. I should point out that we don't have time to talk about each of them, but we'll call out unique developments that you should be aware of that have happened this year.
At the hardware layer, now, we do know that it's primarily for visualization. Many people are using the Power Mac G5 because it can host very fast graphics cards. So we do know that the Power Mac G5 remains probably the leading choice of people doing visualization in HPC. And we do know that people are trying to build very large collections of Mac minis for X-Grid research. However... That would be quite fun to play with. But the XSERV G5 remains the primary building block of building clusters. The 64-bit computing, the massive amount of memory available, amazing amount of performance power, all told, make it an extremely well-balanced, basic building block of a cluster with fantastic I.O. throughput, very good environmentals, and excellent processing power. Major developments here are that we now support up to 2 gig DIMMs on the XServe G5 and if your wallet can stand it and you can afford to max out 2 gig chips, you can fit up to 16 gigs of ECC memory on the PowerMac G5.
So at the OS level, Tiger landed a tremendous amount of benefit of immediate interest and use for HPC customers. Of course, 64-bit memory address space, being able to run and build 64-bit applications using Xcode 2 and GCC4 with the GCC4 underpinnings is of immediate use and benefit to our HPC customers.
For those of you that have had many trips to sys control to tune kernels, kernel parameters, we want to guide you to have a look at LaunchD, which is a way to set global system-wide kernel parameters. Man LaunchD gives you a wealth of information, and I'd like to point out that at the large system administration session, Josh Durham of Virginia Tech is going to take you through exactly how he leverages LaunchD and Tiger to do exactly that, tune his systems for HPC environments.
One of the things you may have heard at Virginia Tech was that for very large systems, if you want to eke out the very last mile of performance on your HPC codes across a very large number of systems, you would like to have physically contiguous memory to be available to those HPC applications.
I am very pleased to tell you that Apple's DTS department has written code on Tiger that does exactly that for you. And so if you are an HPC customer and you know exactly what you're doing, this code I think will be very instrumental in helping show you how to get even more performance out of your HPC codes. And if you'd like to see that, I recommend you have a look at Friday's session on debugging parallel applications where that will be introduced in that session.
So moving up through the hardware interconnects, we have a broad range of choices in the gigabit Ethernet. A company by the name of Smalltree has actually taken a very early lead in writing Mac OS X drivers for gigabit Ethernet cards from a number of vendors. So they, in fact, offer one, two, four, and even six port Copper Gig E cards, and then that combined with 802.3ad runs very nicely in a server environment. They also offer one and two-port optical gigabit Ethernet cards. So if you already have an optical infrastructure or want to do very long runs, this is an excellent choice.
Surprisingly, another reason that some customers have reported wanting these cards is because in very highly secure environments, they know that you can't basically listen in on an optical line. So movie studios, for example, this might be of interest. Certain three-letter agencies don't want people knowing what's passing down those wires. It might be a good choice. You might like it just for the simple fact that you already have fiber optic plumbing.
Also available, Smalltree wrote a driver for Intel's 10 gigabit Ethernet card. So if you need 10 gigabit Ethernet, you want to move very large files, possibly from one work group to another, this again has been a very popular choice with our customers. Thank you. So moving on to high performance interconnects. Miranet, you know, have been a long time developer on the platform. This is an example of their connectors and the types of switches they have.
So news from them is that they expect two new XSERV-based clusters will appear on the June 2005 list of the top 500 supercomputers. So stay tuned for news and more product news from them. They've had a lot of success with their MX message passing framework. The MX2G has been released for the PCIX network interface cards. This is an even lower latency than the previous generation of cards from them. They are getting on the order of 3.6 microseconds signal latencies over raw hardware and only 3.8 microseconds using MPICH MX, their message passing framework. One of the really neat things about these cards is that you install them in your system and it shows up just like a configurable network interface card. You're able to pass your TCP IP traffic over that. They do offer, again, their own middleware and are working on the next generation, MPitch2 MX, to be coming soon. Thank you. So surprisingly, they report that Tiger allowed them, not surprisingly, but very welcome, they report that Tiger allowed them to get even better performance. MX uses a unique approach to memory management. So for messages smaller than 32 kilobytes, they simply copy it across from node to node.
Larger messages than that get pinned and unpinned on demand, And they say that Tiger's code made that impractical on Mac OS X and allowed them to deliver as much as four times the memory to the application. They also announced experimental support for 64-bit binaries on Tiger. And again, we will stay tuned for news from them at the Top 500 at the Supercomputing Conference next month.
So the other high performance interconnect on the platform is InfiniBand. And we are very pleased to welcome to the platform the combination of SilverStorm technologies and SmallTree Communications, who have partnered to deliver a tremendous amount of InfiniBand options on our platform. So, who are SilverStorm? Well, the first thing to tell you about them is that you probably already know them as Infinicon Systems. They were founded in June of 2000, and again, originally as InfiniCon systems.
They have sales and support teams to answer technical questions or to field customer requests in the US, Europe, and Asia. They have a very strong emphasis in high performance technical computing or commercial computing field, so areas like oil and gas. They offer a complete solution set for both HPC and enterprise compute customers and will explain in a little bit exactly what that could mean to your data center. One of the things that could mean is that through their solutions, you can pull together fiber channel, Ethernet, and InfiniBand architectures. And again, we'll show you what that means in a few minutes.
So who are Smalltree? So we've already mentioned them once for their gigabit Ethernet cards. Smalltree are a hardware and software development company. The principles of the company come from companies like Cray and SGI and Sun and HP. They have technology partnerships in place with Intel, Silicon, and SilverStorm, obviously, and also have experience as system integrators. They have a wide range of offerings that we've seen from GIGI up to InfiniBand. Probably the best news here is that they are focused exclusively on the Mac OS X platform. So they've taken a very nice lead in building network performance on Mac OS X.
So we may have heard a lot about InfiniBand and exactly what it means. I can tell you InfiniBand is a very dense spec with a lot of protocols. But the vision for InfiniBand was really building an enterprise class infrastructure that really started with a clean sheet of paper. What if we didn't have to sort of pull along other networking technologies? What if we could start with a clean slate and try to bridge together fiber channel This is all made possible through a very fast, high bandwidth, low latency interconnect. And this was originally what Infinicon was founded on. So in the center in orange would be one of their switches that is able to tie together gigabit Ethernet, bridge off to fiber channel storage, and bridge off to high performance applications, be it for commercial computing or technical computing.
So they do offer a complete range of solutions, starting, of course, at the very bottom with the host channel adapters. This is an example of their PCIX host channel adapter. Switches like this that are able to bridge the I/O infrastructure. It might be a bit hard to see in the picture, but on the left is a modular chassis that allows you to plug in gigabit ethernet in the center up to 12 ports of InfiniBand, and on the right, two ports of fiber channel.
They of course have a range of switches ranging from the 1U switches that make possible very flexible and very efficient topologies for designing supercomputer layouts. And also great big monsters that will go up to 288 ports as well. And then to tie all of that together, they have arranged, so in combination of, by working with Smalltree and Silverstorm, they've pulled together a fantastic range of software from what sits directly on top of the HCA all the way up to how you manage the fabric and how you enforce things like quality of service. So, a complete end-to-end solution.
So what's here now inside that InfiniBand stack? Today, as you would expect, MPI is here. MPI codes run on SilverStorm InfiniBand. They have also already landed and fully support virtual NIC, so iNIC, and IP over IB, meaning that you can pass TCP IP. Any TCP IP application can simply run over InfiniBand. They have also landed SRP, storage resource protocol, that allows things like Fiber Channel to simply plug in using one of the bridging switches we saw and allows InfiniBand clients to reach out and simply see Fiber Channel storage. They're faithfully passing through the Fiber Channel and SCSI protocols. So one of the things having that capability unlocks is that we can have applications like Oracle 10G Rack sitting up in user land that is actually reaching out and seeing storage, fiber channel storage, but it's all happening over InfiniBand, and TCP IP also, same wire, going over InfiniBand. So we think it's going to unlock some very interesting ways for enterprise and HPC centers to leverage InfiniBand. Amen.
So we're actually demonstrating exactly what I described down in the data center downstairs. In one of the racks in the data center on the first floor, you will find four XSERV cluster nodes that have Oracle 10G RAC installed. Oracle 10G RAC is leveraging storage on an XSERV RAID, and again, simply by plugging into one of the InfiniO switches in the bottom. So while we could plug in Fiber Channel larger fiber channel switches. In this case, we're simply plugging directly the two ports of fiber channel from the XRV RAID into the InfiniO switch. That is then simply presented and available to InfiniBand clients. And then, of course, we had extra ports on the switch, so we added on up to six more ports to run HPC codes. So not too bad for one one-use switch. Well done.
What's also exciting and what SilverStorm bring to the table in working with Apple is that because of a lot of their existing emphasis on commercial codes, they have partnerships in place with a fantastic portfolio of companies to work with, including Fluent, Ansys, and Abacus, and they have committed to help drive these applications to Mac OS X, and naturally we'll be looking for customers that are ready to do that. ready to help them.
In the IPC middleware arena, moving slightly up the stack, we have a very broad range, very large toolkit to choose from. Each of the hardware vendors, of course, have a version of MPI that is tightly integrated to their hardware. But other major news is OpenMPI, this is the next generation MPI from the people who brought you LAMMPI, has announced, is on the platform and have even said that they're working on Xgrid support. So thank you for that. We have MVAPG, the MVAPG team at Ohio State, have conducted a very, very fast run-- in fact, it might even be the fastest networking speed recorded on a Mac-- in excess of 900 megabytes per second and latencies under six microseconds. So a lot of fantastic work going on at the MVAPG program, Dr. DK Panda and his team at Ohio State.
In the area of code libraries, probably the major news here is that if you do anything, if you use BLAS levels 1, 2, or 3, LAPACK, if you do any type of digital signal processing, if you do fast Fourier transforms, you should really leverage Dash Accelerate. So the Accelerate framework delivered in Tiger is absolutely your friend. A little bit later, Rodney Mock of AppSoft will take you through and talk about some of the IMSL math libraries that Absoft will deliver.
So in the area of file systems, the state-of-the-art kernel in Tiger has landed, has yielded even better performance for NFS and AFP on the platform. XAN has been a very compelling choice. It is being tested and used in a number of HPC installations. A common way to do this would be to make each head node on top of a rack of other compute nodes, the client on XAN and now they can all now see and pass along and share shared storage. Lustre on the heels of announcing that they had worked on Panther have done the same and are now in beta stage on Tiger and are if large scale file systems like Lustre are of I encourage you to contact them. They are looking forward to working on Tiger. Lastly, I'll mention Blaze. Blaze is a network file system offering from the guys at Smalltree. It is their goal to make network file transfers very fast, and I encourage you to check that out. That was just announced this Monday.
In the area of resource management, I have to absolutely encourage you to check out Xgrid. People are using Xgrid in some very interesting ways. It already comes from Apple. It's built in a Mac OS X server, so leverage that if you can. Moab, the commercial, it can be grossly oversimplified to think of it as the commercial version of MAUI. Moab was released this year. They substantially re-architected what happens under the hood. Very recently, the team at OSCAR HA at Louisiana State have said that they, in conjunction with the OpenMPI team, are now going to tackle OSCAR HA on Mac OS X. So expect news on that.
Okay. In the tool space, obviously Tiger delivered fantastic version of Xcode with GCC and GDB 4.0. Some very important news is this year, Etnis have released their TotalView parallel debugger on the platform. And in fact they accompanied us on a road show where we went to several major cities and people got a chance to see Yetna's TotalView in action. Absoft have a very nice portfolio of tools, and what I would like to do is go ahead and invite Rodney Mock, who is the Absoft HPC Technical Director, to come up and tell us more about what they have coming up next. Rodney. Thank you. Nice job.
Thanks, Skip. I'm Rod Mock. I'm Absoft HPC Technical Director. So if you haven't heard about Absoft, we've had continuous tool support on the Apple platform since 1984. We've been there through the various transitions, including the transition to PowerPC. We're going to be there for the transition on Intel, so you'll be seeing tools from Absoft for the Mac on Intel coming forth. And we've also done the first commercial Fortran, Fortran 95 compilers for the Mac platform. We'll have a full Fortran 2003 compiler also available on the Mac platform and intend to be the first there. And we are gonna be continuing development of cutting edge developer and high performance tools for OS X Tiger.
Some of the EBSOFT solutions that we've done on the platform are our high performance computing software development kit for XServe clusters. This is a kit that will bundle together the best of breed solutions available for the platform, make it all easy to install in less than five, six seconds. You'll be able to install this kit, have all the compilers, debuggers, math libraries, and other tools that you need to do to optimize on the platform. And we'll talk about each of these in detail here in a few minutes. Another thing that we've announced at this conference is our FXP MPI parallel debugger for OS X, Tiger, and Panther. You'll be able to debug basic MPI codes with this debugger. It's a low-cost entry-level debugger from Absoft. You can also get TotalView, which Skip talked about. If you need to move up to the next level, do hybrid debugging, more advanced type codes, that solution is also available from Absoft.
64-bit and 32-bit Pro Fortran compiled for Tiger. Right on the heels of Tiger, we released our 64-bit Fortran compiler as well as our other tools. This is available right now. You can download it if you need to do 64-bit development on Tiger, you can use our tools and immediately take advantage of the 64-bit capabilities.
We're also announcing the 64-bit and 32-bit IMSL FORTRAN numerical libraries, version 5.0, with serial and MPI support for XSERV. If you're currently using IMSL on the Mac platform, want to take advantage of your cluster, it makes it very easy to use with MPI, hides a lot of the MPI complexity away from the user, and lets you take advantage of the resources that you have.
So let me talk a little bit more about the High Performance Computing Software Development Kit. We've done these on multiple platforms with other vendors, including Intel. And this is now available for XServe clusters. You are able to have, like I mentioned before, commercial compilers, debuggers, libraries, MPI, all precompiled, set up for you on a CD that you just insert. You hit Enter a couple of times. It solves on your entire cluster. You're able to basically start compiling, running, and debugging your applications right away. And increase productivity by eliminating the need to configure all these components to work together. The MPIs are already preset up to work with all the vendor compilers. We have a unique ABSOFT modules environment that lets you pick and choose components. If you need to do a 32-bit development environment, you can load up a 32-bit environment with just one command. If you want to switch and start doing 64-bit development, you type another command. It does all the magic of setting up paths, working with the correct compiler flags that have the most optimization for the current settings. All this is abstracted away from you to make it easy to develop HPC applications on the platform.
Some other best of breed components that are included, we have MPish2, LAMMPI, we'll also be supporting OpenMPI and other various MPIs as they come out in the kit. We have the 64-bit and 32-bit IMSL libraries, as well as Absoft Pro4chan we discussed earlier. We also have available VAST. VAST has two technologies. One is VAST Altivec, which is bundled on the CD. This lets you do auto vectorization for your codes. If you have a code, you want to be able to take advantage of the Altivec unit, but you're not familiar how to do it. You can just run it through VAST. It'll automatically auto-vectorize your code for you. It gives you good diagnostics to tell you why certain parts of your code couldn't be vectorized. For example, if you have possible pointer early seeing or other reasons and helps you kind of work your way through to get even higher performance out of your code. It also has another tool that's automatic parallelization and OpenMP support. It'll go through your code, automatically parallelizes the parts that it feels that you can gain performance on. If you want to use handwritten OpenMP directives, you can also compile your code using VAST to support OpenMP.
The next product that we announced is our FXP MPI parallel debugger. It's built on our FX2 serial debugger that supports all the major compiler vendors on all platforms. It also supports all the major MPI implementations all in one easy-to-use package. It is very easy to use. It's very sharp-looking based on the Aqua interface. If you want to see a sneak preview of this debugger, please come on Friday to the parallel debugging session. We'll be able to show it to you, and you can enter in our beta program to help us beta this this month. If you need a higher level debugger, like I mentioned before, TotalView is available and we can make that, that's available through AppSoft and supported as well.
It has some unique features. One is automatically attached to MPI processes. This is very popular, so you don't have to figure out what your PIDs are on different nodes. It will automatically attach to whatever MPI processes are out there, present it to you in one easy-to-use interface, so you'll be able to step through your code, run on various, you know, view the registers, variables, stack on all the different nodes, all through the same interface. It also has a visual element that lets you see what the state of your nodes are. It has, like, a color-coded scheme, so you can see which nodes are dead or possibly deadlocked, other various easy-to-use visual elements that help you quickly identify problems with your code.
It also has, if you run in batch, if you're using like PBS or one of these other batch queuing systems and your code's hung up, you can use FXP to connect remotely to your system and see, you know, dump the message queue, possibly see why your code is being deadlocked.
Absoft Pro Fortran, this also was released on the heels of Tiger. Like I said, it has 64-bit and 32-bit support for Fortran 95. It includes free technical support from Absoft, which is one of our, you know, we have very good tech support. We support all the products, including on these various HPC SDKs, Total View, all are supported centrally through Absoft. It includes C, C++ compilers. It has an IDE that's cross-platform on Windows, Linux, and other platforms, including our FX2 graphical debugger is included at no charge. It has Blast libraries, Atlas, HDF, other libraries that make it easy to use on the platform. And it is fully link compatible with GNU. Thank you.
Finally, the IMSL Fortran numerical library. This has a complete set of scientific algorithms and routines. Instead of having to hand code your own, they have a vast array of these routines that you can use. They're already pre-validated for the platform, so you know you can get the correct answer.
They have excellent documentation. If you want to then scale up, if you get a cluster with a few simple modifications to your code, you can then have MPI support, and it hides much of this complexity from you. It handles bundling up, packaging the arrays, setting them out to the nodes. All you have to know is I want to solve this problem. Here's the algorithm that I want to use. You leave the rest to IMSL.
So roadmap going forward. Of course, we're going to be supporting the Mac on Intel. That's going to be a very big push for us. We're going to have full Fortran 2003 support on the platform. We're going to be increasingly improving our speed of our compilers, tools, and utilities, of course, on the platform. Continued emphasis on our technical support and introduction of some new HPC tools that you can look forward to seeing this time next year. So thank you.
Thank you. Thank you, Rod. Okay. So we did a quick run through of the technology stacks, but what we're really trying to get at is your applications. We're trying to get at what exactly, what challenge it is, what question it is you're trying to answer. There are tremendous amount of technologies, tremendous amount of applications that hit the platform. Some of them commercial, some of them open source. But what we wanted to do here today was introduce you to technology that looks to be very promising in the HPC arena. And to show that to you, I'd like to ask Kevin Howard from Massively Parallel to join us and give us a look. Kevin? Thank you. Thank you. Thank you.
Okay, I'm Kevin Howard from absolute parallel technologies. We're going to show a demo, but first I have to tell you, since this is an Apple audience, that it's going to be ugly, really ugly, because it doesn't have flashy graphics or anything like that. It's just going to show one impossible thing. So hopefully that one impossible thing will be compelling enough so that you, or at least Apple, invites me here again sometime.
Okay, why massively parallel technology? What do we do? Where do we come from? Nobody's ever heard of us. Well, our technology really is based upon years of research. We've been around for I think six years now. Primary funding came from the Defense Advanced Projects Agency, DARPA. What we did is we derived a new version of Amdahl's Law. What this version does, the standard version of Amdahl's Law is kind of pathetic. Everybody in here who does high-performance computing knows that if you open up a parallelization book, in Chapter 1, page 3, it will say Amdahl's Law, and you never see it again. Why is that? That's because Amdahl's law does not fundamentally talk about number of channels, speed of channels, latency issues, how physical topologies are connected up, or any of the other real issues that face practitioners of the art. And because it really only talks about two items, it talks about Amdahl's fraction, which is the percentage of parallel activity, and it talks about the number of nodes. That's kind of useless. for doing real work. What we do is we've taken this model, this derivative, found out how it works in terms of complete overlap, overlapping communication and computation, and overlaid that with different communication models.
We'll show you some of those models, not all of them, but just a few of them to give you a flavor. These are, of course, proprietary, but so what? What we are going to show here are a number of applications with true, what we call true superlinear capability. What do I mean by true superlinear capability? What I mean is it is not a function of cache effects. Superlinearity that comes right from Amdahl's law, and it comes right from the algorithms themselves if done properly.
Okay, basically our technology is kind of instantiated in a number of libraries because like everyone else, we want to hide how hard it is to actually do this stuff because it is hard. What we want to do is show that you can get much higher performing systems using some of these techniques. In fact, what we're going to show is something kind of silly. In the world of high bandwidth, low latency systems, we have a machine over there where we just are using 100 base T, and we're going to show at least one application with superlinear performance. We're going to show some pictures in a few minutes of some performance graphs on real machines also using just 100 base T. We're talking 80 nodes, that sort of thing, so a reasonable-sized cluster, a number of applications with super-linear performance.
Okay, well, here's the first one. Doesn't, uh-- in fact, we're going to show e to the x. Why do we choose e to the x? A friend of mine, John Gustafson, who's been in the high-performance computing-- computing community for pretty much forever, and he works for a competing company, so I won't mention it.
He ran e to the x on his competing companies machine, and what they found is that it doesn't really scale very well. Why? Because unlike pi and other things like that, where Ramanujan was good enough to give us solutions that you compute the last digit, and so you can spread that over multiple machines, unfortunately, e to the x is a true series expansion problem. So you have these huge series expansions, and you have to do the cross-summing all the way through. And so it doesn't really scale greatly. It scales OK, but it doesn't scale greatly. What we were able to do-- this is with two channels, and we'll explain why-- 200 base T channels on fast machines, 80 of them. And the effect of that machine was to have performance that was near 140 times-- 140 machines is what the equivalent was. So we had 80 machines that are acting like 140 machines. How can that be?
Here's a real application, triangulated irregular networks. People use them for lasers, doing laser 3-D mapping of cities. They use them for a number of applications. Typically, the way people do them is they'll split up the image into sub-images, compute the sub-images separately, and then not really worry about the interface very much so that it looks like it grows-- it scales linearly, although the edges do have to be healed. What this is showing is a true parallelized version of a triangular, triangulated irregular network.
That means we took a single image, we broke it up into its component parts, we ran it across multiple machines, and then we healed the solution. We brought back the solution and merged them together as a single coherent image. Now, what's interesting about this model is that with 26 machines, the system was acting as though it had 30 machines. And when we had 30 machines, because of the shrinkage of the work, what happens is that you started our downward path, as these things were wont to do. But 31 machines versus 15 machines. I have to read my charts better. showed performance that was actually less than the 15 machines.
That's a consistent number. This is not just a one-shot bad programming effect. This is a real effect. When we showed, the only reason we kept getting money, we're on our fourth grant from DARPA now, and that's because we keep showing real numbers to them, and they claim that people don't usually do that. So I don't know. Here's another chart that we find particularly compelling.
The reason why it's compelling is it's on 100 base T again. We used MPI, an MPI instantiation of LU decomposition in the heart of LINPACK. And using fast nodes, and MPI doesn't scale. Why? because, well, the problems are too small, the bandwidth and latency issues are too great, and it fall down, go boom. With our technology, with our communication models, our new flavor of all-to-all communication, we have a dozen or so different kinds of all-to-all communication models. We have pretty much an infinite number of models for doing scatter-gather, but we're only gonna talk about at or gather stuff, mainly because the patents are already issued so I can talk about them. But if you look here, the yellow line represents the linear scaling line. We did reasonable scaling with three nodes, but when we went to seven nodes, we were doing strong super linear scaling. Now, how long can we keep that? That really is a function of the -- a a combination of the underlying algorithm itself, how it wants to naturally scale, as well as the kind of models, communication models that we use. So finally, I think this is the final one. Here's BLAST, everyone knows of BLAST as, well, that's embarrassingly parallel. Why are you even including such a stupid, embarrassingly parallel problem in this discussion of high performance computing? The reason why we're including it is the same reason that the HPCS community is going to start using BLAST as one of the primary benchmarks.
This is not the standard high throughput BLAST. For those who don't know, BLAST is a bioinformatics code. Does protein searches, nucleotide searches, that sort of thing. What this is is a single query or a small number of queries going into a dense array, in this case 255 nodes, and showing what the effect of that kind of query on in parallel really is. Why is this important? This is important because most researchers aren't putting 10,000 queries in at the same time.
They're putting in a relatively small number at the same time. If you're a small organization, like a small research team, you can't afford to do the high throughput methods. An example of that is some work we did with the University of Colorado, their health sciences department. It took them, they were doing some work trying to find certain proteins that occur in the human pancreas in mouse DNA. Their standard model took 26 days, I believe, to process, to do the queries, et cetera, et cetera. 255 nodes, and this was with the cluster. I mean, it's not-- well, you know, using the standard models. But on our machines, it was 24, 25 hours, I believe, to do the same-- the exact same group of queries.
And that's going to be-- I believe they're writing a paper on that, so it'll be out sometime. I don't know when. Okay, so some of the things that we're going to show that we consider important, which is not necessarily point-to-point bandwidth, but effective bandwidth. Because if you do the effective bandwidth properly, it can mask most of the latency effects, and it can mask most of the problem with the point-to-point speed. The effective latency really is a function of the bisection bandwidth speed-- or, I'm sorry, the effective speed of the channels. So in our model, we try to produce an impulse function that has what we call a zero max entropy model, which means in the least amount of time, we want the maximum number of communications taking place without interference and without collisions. These are expansion times. What we do is we compared four models that we had. The models-- and we'll show you some of these. we'll show you all these models, in fact-- what they're doing is they're trying to increase that bisection bandwidth per unit time. And as you can see, as long as the back plane of the attached switching fabric can maintain the speeds, you can get very, very high effective bandwidths on your machine.
Okay, here's our first little-- our simple cascading model. Why are we showing this? It doesn't look like much. It looks kind of like just a tree architecture, right? Except that if you look at a standard B-tree, binary tree architecture, the expansion rate is 1, 2, 4, 8, et cetera. You can go play over there. This is Glenn. He'll be playing the Mac for you today.
This model here, the growth rate of how many nodes are in communication per unit time-- oh, and by the way, that thing that says "home node" is really our version of a head node or something analogous to that. It's not really part of the communication. But its growth rate for a single channel is 1, 3, 7, 15, 31, et cetera, et cetera, which is basically 1 minus 2 times faster than a B-tree. What the heck does that mean? Since what it means on a single channel is that prior to our discovery, our company's discovery of this pattern, the binary tree was considered for a single channel to be the best expansion rate you're going to be able to have. This is showing the effects of just adding a single additional port at the head node level, just one port doubles the effective bisection bandwidth of the system. So rather than having to double the speeds of all of the point-to-point connections or doubling the number of physical connections, we double the effective performance by increasing one by one single port.
Now, I have to admit that the kind of problems that can use this particular series of graphs are those that are across some, like, agglomeration, that sort of thing. But we have analogous models for those kinds of things, like image processing, et cetera, that do not cross some. So we handle all of those cases. And we, of course, have the equivalent mechanisms for doing all-to-all exchanges, which show some of these same effects. Here's what we call a manifold. All we've done here is we've increased the number of head nodes.
We added two more head nodes to it, and we have one more channel per head node, and we added one more time unit to clear between the head nodes, to clear the movement of data between the head nodes. That has, for a single channel case, a relatively small number of nodes that has the effect of increasing the bisection bandwidth by a third. So what that shows is that the geometry of the structure of the system is critical for increasing effective bandwidth in a system. And here is probably our last of these little pictures here. This is our manifold.
This is our simple manifold. We have a hyper manifold, but it gets really hard to see all the little dots all over the place. But this is two channels with four channels at the head nodes. And what this is showing is in four time units, four time units. And four time units for a binary tree, for example, single channel binary tree, you could have one, two, four, eight, you can have eight machines talking. Here in four time units, we can have basically 80, that's 80 machines, 80 nodes, depending on however many machines are in the nodes, it's 80 nodes that can clear or fill. Again, these are just the scatter-gather mechanisms. When we do the hypermanifold version of the same thing, It turns out that we get a much greater increase in performance. In four time units for the same -- basically the same number of channels, dual channels, we move 96 nodes' worth of data by adding only a couple of additional ports. What are we talking about here? What we're talking about is that for the first time, The number of channels that are involved in a communication model have a nonlinear effect on the speed of your system.
a nonlinear effect. In fact, it--the nonlinear effect is-- we're gonna do the demo, so you might as well start. Can--can-- Can you switch over to demo one so that he can-- oh, you've already done that. The reason I'm just showing you-- I told you it was going to be ugly, guys.
He just ran a quickie little version of e to the x. We're only showing 300 digits because it goes nuts, but this has 65,000 digits of precision. And it took, on a single node, it took 160 seconds. What we're going to do is now run it on seven nodes, that machine right there. We're running it on seven nodes.
Let's see. You know, it takes a long time when you're standing up here and waiting for a machine to write little green numbers on the screen, but there you go. Okay, it's done. Now, it took 17 seconds for seven nodes. So if we do the math, we have a little calculator here. If you do the math. you can see that our seven machines acts as if we had nine. But that to me is pretty cool. I don't know about you, but.
So where are we going with this? How are we going to provide this kind of capability to the community? What kind of other things can we do with this? Of course, we're doing bioinformatics. We're building right now a very large bioinformatics website application to try before you buy. It will be made of, of course, Apple clusters. Molecular modeling. It turns out that our method of, we call it beta phase, overlapped I.O. and communication and computation model, using some of these cheap tricks, allows us to do molecular modeling that scales in most cases super linearly, but there are some cases where it will scale just a little under sub-linearly. Oh, well, can't have it all. Not everything scales super linearly, I will say that. It depends on the underlying nature of the algorithm. We do a lot of image processing today, naturally doing things with DARPA. What they want us to do is find stuff in bizarre places. And I can't tell anybody, including my wife, what I'm actually looking at. What did you see today?
Nothing. So neural networks. The reason why neural networks are particularly nice for us is because they're an order N squared problem. And it turns out that the mathematical order of the problem, depending on certain aspects, but the mathematical order of the problem is what's generating the superlinear effect. It's the order of the problem because what's happening is that acts as a nonlinear decreaser of the amount of work that's being calculated when you break the problem up. Seismic processing, we've already shown very, very good effects with Kirchhoff time migration codes and other codes. I know that that's an embarrassingly parallel mechanism, but using some of our I.O. models, we were able to show a 10x increase on the same cluster over several existing codes.
And computational fluid dynamics models, these are very, very tough codes, mainly because of the very, very small number of bytes that are being transmitted between pieces and parts during the computation. You just don't move a lot of data. And of course, if you don't move a lot of data, then scaling becomes an issue because all your latency builds up. In our model, we found some relatively unique ways of being able to continue to mask the, uh, the communication time completely, which means that even though we're moving small amounts of data floating around the system, it's okay. So we're--we are, um... As I said, we are just now coming to market. That's why we don't have a lot of flashy stuff. We've been around, again, for a very long time. We've been primarily working on mathematical models of computation and how those models interact with numbers of nodes. We found several effects that we think are novel. The effects really revolve around this notion that if you give me two slow channels, I can make them work much faster than one fast channel. If you give me two fast channels, I can make wonders occur because the effect occurs regardless of the speed of the channel. So I know I'm not consuming all of my time, but, you know, I'm done talking, so... Thank you, Kevin. So I hope you enjoyed that sneak peek. It's not often you get to see what DARPA actually gets to work on.
So for more information, Apple HPC resources, what you may or may not know is that if you were a developer of an HPC application on the platform, if you have a technology you want to bring to Mac OS X, we have made incredible gains by having engineers come to Apple or occasionally we are able to take this team on the road, but when we conduct a performance and optimization workshop, we actually sit down with your code, Apple engineers, HPC experts in Apple, this is your opportunity to sit down with them and really help find the log jams in your code and get a tremendous amount more performance. So I can tell you that at least a few of those very large clusters that we mentioned earlier did exactly that. That was absolutely what helped them come to the platform. So if you have an application or code and this is of interest to you, talk to Apple. We are finding that the performance you can unlock by attending one of these performance and optimization workshops works no matter where your code lives. So if you have cross-platform codes, you improve your performance in Mac OS X, you get one of those aha moments and it helps your entire portfolio. Obviously our Apple system engineers have been tremendously busy this year, helping our customers and working with them as they put these systems together. If you have an interest or something you want to know about how to do on the platform, Apple system engineers have been a great source of expertise. I would also welcome you to take advantage of Apple's developer technical support services. [email protected] is a great way to get someone's attention. So in short, if it's not clear, if you have an HPC application, you want to bring it to Mac OS X, we want to hear about it. If you have a technology, you know, interesting new technology you'd like to bring to the platform, again, we'd like to hear about it because, odds are, we may have just the customer who is really looking for that type of technology.
There are some great resources where we can exchange this type of information. The Apple SciTech list is really sort of the granddaddy of HPC at Apple. It's where a tremendous amount of HPC-oriented and obviously science-oriented communication takes place. Since HPC did receive so much activity, this year we actually created an Apple HPC list. So these can both be found at lists.apple.com and receive a tremendous amount of traffic, users helping users, Tremendous amount of Apple engineers sign on and actually provide expertise on those lists as well. So that should really be something you monitor continually. The HPC page at developer.apple.com is found at developer.apple.com slash hardware slash HPC. Success stories, how-tos, that sort of thing are the types of material you'll find there. And of course the Apple and Science website, apple.com/science, is another fantastic resource to keep up to date on what's happening on the platform. We mentioned a couple of sessions earlier. I would recommend that you check out the large installation and HPC systems admin session. We have a small InfiniBand cluster in the Enterprise IT lab for you to get your hands on and talk to some of the SilverStorm and SmallTree guys. I also recommend that you check out the data center downstairs where you can see that very large system that we put together there and ask questions of that team. The Xgrid session looks fantastic. Using Xgrid, create and deploy distributed computations. A using Xgrid session on Friday. Again, another excellent session. And if you would like a peek at FXP that Rodney showed you earlier, and Etnis TotalView, I highly recommend you check out the Debugging Parallel Applications session. That's also the session where we will be giving details of how you can get your hands on the memory manager code that was discussed earlier.
And if all else fails, I strongly encourage you to contact [email protected]. Incredibly easy email to remember. I look forward to helping you engage the resources at Apple and being of help wherever I can. I want to thank all of the Apple HPC developer partners, customers that are here.