HPC Technology Update - WWDC 2005

Enterprise IT • 1:03:39

High Performance Computing continues to gain momentum on the Mac platform, and the technology changes quickly. Get up to speed on the newest tools, technologies and software for HPC on Mac OS X.

Speakers: Skip Levens, Kevin Howard

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

We've got a lot to cover. It's been an exciting year in HPC. Today we're going to cover and touch on some of the HPC trends and showcase the major deployments of some of Apple's larger HPC deployments. We're then going to survey the HPC technologies on the platform.

There's a very wide range to choose from. We don't have time to talk about each, but I should point out there are representatives from many of these projects, many of these companies in the room today, and thank you for coming and thank you for your support on the platform.

We are next going to cover the two major high performance interconnects available on the platform, MirrorNet and Infiniband. And next, we're going to take a survey of the tools available on the platform and get a very important update from Absoft, who are going to let us know about everything in their portfolio and what they're doing on Mac OS X. Finally, we're going to give you a glimpse at a very interesting HPC technology, and then we'll just go ahead and wrap up after that.

So it's pretty clear to almost anyone that is building high performance clusters, rather high performance computers, that clusters are rapidly becoming the predominant way that we build and put together high performance compute environments. This chart represented, this chart put together at the last Supercomputing Conference tracks the major supercomputer architectures going back to 1998.

So we've got 500 systems going back from 1998. In orange at the bottom are clusters. We can see that as they gradually rise, they push, they have displaced other architectures on the platform. Clusters, just about everyone knows, are the predominant way, and this on our platform is personified in the XSERV G5.

The XSERP G5 is the foundation for clusters. There have been a tremendous amount of HPC deployments, tackling a very broad range of challenges in high performance computing and the sciences. Let's check in with some of the largest ones. Probably the most famous one and well familiar to all of you is Virginia Tech System 10.

System 10 this year converted from dual 2 GHz Power Mac workstations to 2.3 GHz XServe G5s. This allowed them to reduce their footprint by two-thirds and pulled a 425 kW load down to 310 kW. Major savings in power and the concomitant heat that has to be managed out of that, again, that problem becomes even more manageable.

This year they generated a 12.25 Teraflop run and qualified as number 7 on the top 500 list of supercomputers. And they remain the largest academic cluster. We think they're going to get some competition this coming year, but are still definitely the largest and most powerful academic cluster. They entered production this year and they, what's very exciting, as not only an HPC cluster for researchers, are also a technology test bed.

And in fact they welcome new technologies to the platform. If, say, someone wants to do a scalability test and see if the new technology can come over, if you contact them, they are very willing to entertain such requests. So we're very pleased, very excited at the work that Virginia Tech's doing.

Another very exciting cluster this year, the UCLA Dawson Cluster. The Dawson Cluster investigates plasma physics. It is a mix of two 128-node systems, one composed of 2 GHz XServe G5s and the other of 2.3 GHz XServe G5s. Runs over GigE, leverages LAMMPI. This system entered production very quickly, and they were able to generate just on one of those 128-node sides, just shy of a teraflop run, which actually qualified them last year as number 444 in the top 500 list.

Since then, using Dagger Research Pooch Pro and MacMPI, they generated an even higher run. I think this shows, though, that 128 nodes is within the reach of a lot of your devices. It's one of the top 500 universities, and frankly, that can qualify as one of the top 500 supercomputers.

Of course, we have to mention one of the largest systems, COLSA. The Mach 5 system in Huntsville, Alabama investigates hypersonic missile research. These are incredibly dense CFD codes that are running on this very large collection of systems. This system has seen a lot of development this year, and again, is crunching those very large CFD codes. So, fantastic system.

At University of Illinois, Urbana-Champaign, they stood up the Turing Cluster this year. This is used by their CSE department and the Center for Simulation of Advanced Rockets. So, imagine the very complex CFD analysis that has to be done as you're burning away solid motor fuel in a rocket. And that's exactly what they're simulating there, among other codes.

We think this is going to be a very, very high performing system. It's 640 nodes. It's a very high performing system. It's 640 nodes. It's a very high performing system. It's 640 nodes. It's leveraging the mirror net interconnect. The system was budgeted around 3 million, and we think it's going to place very well on the very soon to forth come top 500 list. So, outstanding work there at Urbana-Champaign.

Another fantastic cluster that you may have heard of that did generate some press is the XSEED cluster at Bowie State University in Maryland. So this system again is 224 nodes leveraging the mirror net interconnect, and they generated a 2.1 teraflop benchmark run. And so naturally we think they'll make this next list that comes out shortly. It was really fun to watch the progress of this system because this system was student and faculty planned. It was an incredibly useful teaching aid, incredibly useful learning aid to help teach HPC and CSE material for the campus. So well done.

Now for every large system that makes the list and makes a lot of headlines, there are dozens and dozens of smaller systems. And this is probably best personified in the Apple Workgroup cluster. The concept here is that we take it about as far as you can possibly get to true turnkey cluster computing. Very well integrated set of applications, the hardware is integrated, all the technology pieces that comprise a cluster are all very tightly integrated.

Inexpensive enough and with enough performance that these types of machines are within the reach now for workgroups, desk side, departments, discretionary budgets, and we've even seen people enter it into grants and then be awarded funds for a grant, purchase the cluster and away they go. So it really can scale to fit a very wide variety of needs. I should point out it makes a very nice dedicated x-grid cluster as well.

So it takes a lot of technology, a lot of technology to make a cluster. All of the technology, and we've represented them from the hardware at the bottom up through the OS, hardware interconnects, the middleware, all the way up to your application, which is what we want to get at.

So there's all this technology, there's a very rich array of technology on the platform for you to choose from when you're building your cluster. I should point out that we don't have time to talk about each of them, but we'll call out unique developments that you should be aware of that have happened this year.

At the hardware layer, now we do know that it's primarily for visualization. Many people are using the Power Mac G5 because it can host very fast graphics cards. So we do know that the Power Mac G5 remains probably the leading choice of people doing visualization in HPC. And we do know that people are trying to build very large collections of Mac minis for X-Grid research.

However, That would be quite fun to play with. But the XServe G5 remains the primary building block of building clusters. The 64-bit computing, the massive amount of memory available, amazing amount of performance power, all told, make it an extremely well-balanced, basic building block of a cluster, with fantastic I/O throughput, very good environmentals, and excellent processing power. Major developments here are that we now support up to 2GB DIMMs on the XServe G5, and if your wallet can stand it, and you can afford to max out 2GB chips, you can fit up to 16GB of ECC memory on the Power Mac G5.

So at the OS level, Tiger landed a tremendous amount of benefit of immediate interest and use for HPC customers. Of course, 64-bit memory address space, being able to run and build 64-bit applications using Xcode 2 and GCC4 with the GCC4 underpinnings is of immediate use and benefit to our HPC customers.

For those of you that have had many trips to SysControl to tune kernel parameters, we want to guide you to have a look at LaunchD, which is a way to set global system-wide kernel parameters. ManLaunchD gives you a wealth of information, and I'd like to point out that at the large system administration session, Josh Durham of Virginia Tech is going to take you through exactly how he leverages LaunchD. ManLaunchD and Tiger, to do exactly that, tune his systems for HPC environments.

One of the things you may have heard at Virginia Tech was that for very large systems, if you want to eke out the very last mile of performance on your HPC codes across a very large number of systems, you would like to have physically contiguous memory to be available to those HPC applications. I am very pleased to tell you that Apple's DTS department has written code on Tiger that does exactly that for you.

So if you are an HPC customer and you know exactly what you're doing, this code I think will be very instrumental in helping show you how to get even more performance out of your HPC codes. And if you would like to see that, I recommend you have a look at Friday's session on debugging parallel applications where that will be introduced. that would be introduced in that session.

So moving up through the hardware interconnects, we have a broad range of choices in the gigabit Ethernet. A company by the name of Smalltree has actually taken a very early lead in writing Mac OS X drivers for gigabit Ethernet cards from a number of vendors. So they in fact offer one, two, four and even six port copper gig E cards, and then that combined with 802.3ad runs very nicely in a server environment. They also offer one and two port optical gigabit Ethernet cards. So if you already have an optical infrastructure and want to do very long runs, this is an excellent choice.

Surprisingly, another reason that some customers have reported wanting these cards is because in very highly secure environments, they know that you can't basically listen in on an optical line. So movie studios, for example, they're not going might be of interest, certain three-letter agencies that don't want people knowing what's passing down those wires might be a good choice.

You might like it just for the simple fact that you already have fiber optic plumbing. Also available, SmallTree wrote a driver for Intel's 10 gigabit Ethernet card, so if you need 10 gigabit Ethernet and you want to move very large files, possibly from one work group to another, this again has been a very popular of choice with our customers.

So moving on to high performance interconnects. Miranet, you know, have been a long time developer on the platform. This is an example of their connectors and the types of switches they have. So news from them is that they expect two new XSERV-based clusters will appear on the June 2005 list of the top 500 supercomputers. So stay tuned for news and more product news from them.

They've had a lot of success with their MX message passing framework. The MX2G has been released for the PCIX network interface cards. This is an even lower latency than the previous generation of cards from them. They are getting on the order of 3.6 microseconds signal latencies over raw hardware, and only 3.8 microseconds using MPitch. The MX2G is a great example of a message-passing framework.

One of the really neat things about these cards is that you install them in your system, and it shows up just like a configurable network interface card. You're able to pass your TCP/IP traffic over that. They do offer, again, their own middleware, and are working on the next generation MPitch2 MX to be coming soon.

So surprisingly, they report that Tiger allowed them, not surprisingly, but very welcome, they report that Tiger allowed them to get even better performance. MX uses a unique approach to memory management. So for messages smaller than 32 kilobytes, they simply copy it across from node to node. Larger messages than that get pinned and unpinned on demand, and they say that Tiger's code made them more efficient.

And that practical on Mac OS X and allowed them to deliver as much as four times the memory to the application. and Kevin They also announced experimental support for 64-bit binaries on Tiger. And again, we will stay tuned for news from them at the Top 500 at the Supercomputing Conference next month.

So the other high performance interconnect on the platform is InfiniBand, and we are very pleased to welcome to the platform the combination of SilverStorm Technologies and SmallTree Communications, who have partnered to deliver a tremendous amount of InfiniBand options on our platform. So, who are SilverStorm? Well, the first thing to tell you about them is that you probably already know them as Infinicon Systems. They were founded in June of 2000, and again, originally as Infinicon Systems. They have sales and support teams to answer technical questions or to field customer requests in the US, Europe and Asia.

They have a very strong emphasis in high performance technical computing, or commercial computing field, so areas like oil and gas. They offer a complete solution set for both HPC and enterprise compute customers. And we'll explain in a little bit exactly what that could mean to your data center. One of the things that could mean is that through their solutions you can pull together fiber channel, Ethernet, and InfiniBand architectures. And again, we'll show you what that means in a few minutes.

So who are Smalltree? So we've already mentioned them once for their gigabit Ethernet cards. Smalltree are a hardware and software development company. The principles of the company come from companies like Cray and SGI and Sun and HP. They have technology partnerships in place with Intel, Silicon and Silverstorm, obviously, and also have experience as system integrators. They have a wide range of offerings that we've seen from GIGI up to InfiniBand. And probably the best news here is that they are focused exclusively on the Mac OS X platform. So they've taken a very nice lead in building network performance on Mac OS X.

So we may have heard a lot about InfiniBand and exactly what it means. I can tell you InfiniBand is a very dense spec with a lot of protocols. But the vision for InfiniBand was really building an enterprise class infrastructure that really started with a clean sheet of paper. What if we didn't have to sort of pull along other networking technologies?

What if we could start with a clean slate and try to bridge together fiber channel? This is all made possible through a very fast, high bandwidth, low latency interconnect. And this was originally what Infinicon was founded on. So in the center in orange would be one of their switches that is able to tie together gigabit Ethernet. So this is a very simple way to bridge off to fiber channel storage and bridge off to high performance applications, be it for commercial computing or technical computing.

So they do offer a complete range of solutions, starting of course at the very bottom with the host channel adapters. This is an example of their PCIX host channel adapter. Switches like this that are able to bridge the I/O infrastructure. It might be a bit hard to see in the picture, but on the left is a modular chassis that allows you to plug in gigabit Ethernet in the center, up to 12 ports of InfiniBand, and on the right, two ports of fiber channel.

They of course have a range of switches ranging from the 1U switches that make possible very flexible and very efficient topologies for designing supercomputer layouts, and also great big monsters that will go up to 288 ports as well. And then to tie all of that together, they have arranged, so in combination of, by working with Smalltree and Silverstorm, they've pulled together a fantastic range of software from what sits directly on top of the HCA, all the way up to how you manage the fabric, and how you enforce things like quality of service. So, a complete end-to-end solution.

So what's here now inside that InfiniBand stack? Today, as you would expect, MPI is here. MPI codes run on SilverStorm InfiniBand. They have also already landed and fully support VirtualNIC, so iNIC, and IP over IB, meaning that you can pass TCP/IP, any TCP/IP application can simply run over InfiniBand. They have also landed SRP, Storage Resource Protocol, that allows things like Fiber Channel to simply plug in using one of the bridging switches we saw, and allows InfiniBand clients to reach out and simply see Fiber Channel storage. They're faithfully passing through the Fiber Channel and SCSI protocols.

So one of the things having that capability unlocks is that we can have applications like Oracle 10G Rack sitting up in user land that is actually running on the Mac OS X. So that's a great opportunity to have. So we're going to be reaching out and seeing storage, Fiber Channel storage, but it's all happening over InfiniBand, and TCP/IP also, same wire, going over InfiniBand. So we think it's going to unlock some very interesting ways for enterprise and HPC centers to leverage InfiniBand.

So we're actually demonstrating exactly what I described down in the data center downstairs. In one of the racks in the data center on the first floor, you will find four XServe cluster nodes that have Oracle 10G Rack installed. Oracle 10G Rack is leveraging storage on an XServe RAID, and again, simply by plugging into one of the InfiniO switches in the bottom.

So while we could plug in fiber channel off to larger fiber channel switches, in this case we're simply plugging directly the two ports of fiber channel from the XServe RAID into the InfiniO switch. That is then simply presented and available to InfiniBand clients. And then of course we had extra ports on the switch, so we added on up to six more ports to run HPC codes. So, not too bad for one one-use switch. Well done.

What's also exciting and what SilverStorm bring to the table in working with Apple is that because of a lot of their existing emphasis on commercial codes, they have partnerships in place with a fantastic portfolio of companies to work with, including Fluent, Ansys, and Abacus, and they have committed to help drive these applications to Mac OS X, and naturally we'll be looking for customers that are ready to do that. We're ready to help them.

Um... In the IPC middleware arena, moving slightly up the stack, we have a very broad range, very large toolkit to choose from. Each of the hardware vendors, of course, have a version of MPI that is tightly integrated to their hardware. But other major news is OpenMPI, this is the next generation MPI from the people who brought you LAMMPI, has announced it is on the platform and have even said that they're working on XGrid support. So thank you for that.

We have mVapage, the mVapage team at Ohio State have conducted a very, very fast run, in fact, might even be the fastest networking speed recorded on a Mac, in excess of 900 megabytes per second and latencies under six microseconds. So a lot of fantastic work. We have a lot of great work going on at the mVapage program, Dr. DK Panda and his team at Ohio State.

In the area of code libraries, probably the major news here is that if you do anything, if you use BLAS levels 1, 2 or 3, LAPAC, if you do any type of digital signal processing, if you do fast Fourier transforms, you should really leverage Dash Accelerate. So the Accelerate framework delivered in Tiger is absolutely your friend. A little bit later, Rodney Mock of Absoft will take you through and talk about some of the IMSL math libraries that Absoft will deliver.

So in the area of file systems, the state of the art kernel in Tiger has landed, has yielded even better performance for NFS and AFP on the platform. XSAN has been a very compelling choice. It is being tested and used in a number of HPC installations. A common way to do this would be to make each head node on top of a rack of other compute nodes the client on XSAN, and now they can all now see and pass along and share shared storage.

Lustre, on the heels of announcing that they had worked on Panther, have done the same, and are now in beta stage on Tiger, and are, if large scale file systems like Lustre are of interest, I encourage you to contact them. They are looking forward to working on Tiger. Lastly, I'll mention Blaze. Blaze is a network file system offering from the guys at Smalltree. It is their goal to make network file transfers very fast, and I encourage you to check that out. That was just announced this Monday.

In the area of resource management, I have to absolutely encourage you to check out Xgrid. People are using Xgrid in some very interesting ways. It already comes from Apple. It's built in a Mac OS X server, so leverage that if you can. Moab, the commercial, it can be grossly oversimplified to think of it as the commercial version of Maui. Moab was released this year. They substantially re-architected what happens under the hood. Very recently, the team at Oscar HA at Louisiana State have said that they, in conjunction with the OpenMPI team, are now going to tackle Oscar HA on Mac OS X. So expect news on that.

Okay. In the tool space, obviously Tiger delivered a fantastic version of Xcode with GCC and GDB 4.0. Some very important news is this year, Etnis have released their Totalview parallel debugger on the platform. And in fact, they accompanied us on a road show where we went to several major cities, and people got a chance to see Etnis Totalview in action.

Absoft have a very nice portfolio of tools, and what I would like to do is go ahead and invite Rodney Mock, who is the Absoft HPC Technical Director, to come up and tell us more about what they have coming up next. Rodney. Rodney Mock: Thank you. Nice, John.

Thanks Skip. I'm Rod Mock, I'm Absoft HPC Technical Director. So if you haven't heard about Absoft, we've had continuous tool support on the Apple platform since 1984. We've been there through the various transitions, including the transition to PowerPC. We're going to be there for the transition on Intel, so you'll be seeing tools from Absoft for the Mac on Intel coming forth.

And we've also done the first commercial Fortran, Fortran 95 compilers. For the Mac platform, we'll have a full Fortran 2003 compiler also available on the Mac platform and intend to be the first there. And we are going to be continuing development of cutting edge developer and high performance tools for OS X Tiger.

Some of the Absoft solutions that we've done on the platform are our High Performance Computing Software Development Kit for XServe clusters. This is a kit that will bundle together the best of breed solutions available for the platform, make it all easy to install in less than five, six seconds.

You'll be able to install this kit, have all the compilers, debuggers, math libraries and other tools that you need to do to optimize on the platform. And we'll talk about each of these in detail here in a few minutes. Another thing that we've announced at this conference is our FXP MPI Parallel Debugger for OS X Tiger and Panther.

You'll be able to debug basic MPI codes with this debugger. It's a low-cost entry-level debugger from Absoft. You can also get Total View, which Skip talked about. If you need to move up to the next level, do hybrid debugging, more advanced type codes, that solution is also available from Absoft.

64-bit and 32-bit Pro Fortran compiled for Tiger. Right on the heels of Tiger, we released our 64-bit Fortran compiler, as well as our other tools. This is available right now. You can download it if you need to do 64-bit development on Tiger. You can use our tools and immediately take advantage of the 64-bit capabilities.

We're also announcing the 64-bit and 32-bit IMSL Fortran numerical libraries, version 5.0, with serial and MPI support for XServe. If you're currently using IMSL on the Mac platform, want to take advantage of your cluster, it makes it very easy to use with MPI. It hides a lot of the MPI complexity away from the user and lets you take advantage of the resources that you have.

[Transcript missing]

Some other best of breed components that are included, we have MPish2, LAMMPI, we'll also be supporting OpenMPI and other various MPIs as they come out in the kit. We have the 64-bit and 32-bit IMSL libraries, as well as Absoft Pro4chan we discussed earlier. We also have available VAST. VAST has two technologies. One is VAST AlteVec, which is bundled on the CD. This lets you do auto-vectorization for your codes.

If you have a code, you want to be able to take advantage of the AlteVec unit, but you're not familiar with how to do it, you can just run it through VAST. It'll automatically auto-vectorize your code for you. It gives you good diagnostics to tell you why certain parts of your code couldn't be vectorized.

For example, if you have possible pointer early seeing or other reasons, and helps you kind of work your way through to get even higher performance out of your code. It also has another tool that's automatic parallelization and OpenMP support. It'll go through your code, automatically parallelizes the parts that it feels that you can gain performance on. If you want to use handwritten OpenMP directives, you can also compile your code using VAST to support OpenMP.

The next product that we announced is our FXP MPI Parallel Debugger. It's built on our FX2 serial debugger that supports all the major compiler vendors on all platforms. It also supports all the major MPI implementations all in one easy to use package. It is very easy to use, it's very sharp looking based on the Aqua interface.

If you want to see a sneak preview of this debugger, please come on Friday to the parallel debugging session. We'll be able to show it to you, and you can enter in our beta program to help us beta this this month. If you need a higher level debugger, like I mentioned before, TotalView is available, and we can make that, that's available through AppSoft and supported as well.

It has some unique features. One is automatically attached to MPI processes. This is very popular, so you don't have to figure out what your PIDs are on different nodes. It will automatically attach to whatever MPI processes are out there, present it to one easy-to-use interface, so you'll be able to step through your code, run on various, you know, view the registers, variables, stack on all the different nodes, all through the same interface. It also has a visual element that lets you see what the state of your nodes are. It has like a color-coded scheme, so you can see which nodes are dead or possibly deadlocked. Other various easy-to-use visual elements that help you quickly identify problems with your code.

[Transcript missing]

Absoft Pro Fortran, this also was released on the heels of Tiger. Like I said, it has 64-bit and 32-bit support for Fortran 95. It includes free technical support from Absoft, which is one of our, you know, we have very good tech support. We support all the products, including on these various HPC SDKs, Total View, all are supported centrally through Absoft.

It includes C, C++ compilers. It has an IDE that's cross-platform on Windows, Linux, and other platforms, including our FX2 graphical debugger is included at no charge. It has Blast libraries, Atlas, HDF, other libraries that make it easy to use on the platform, and it is fully link compatible with GNU.

Finally, the IMSO Fortran Numerical Library. This has a complete set of scientific algorithms and routines. Instead of having to hand code your own, they have a vast array of these routines that you can use. They're already pre-validated for the platform, so you know you can get the correct answer. They have excellent documentation.

If you want to then scale up, if you get a cluster with a few simple modifications to your code, you can then have MPI support, and it hides much of this complexity from you. It handles bundling up, packaging the arrays, setting them out to the nodes. All you have to know is, "I want to solve this problem. Here's the algorithm that I want to use." You leave the rest to IMSL.

So roadmap going forward. Of course, we're going to be supporting the Mac on Intel. That's going to be a very big push for us. We're going to have full Fortran 2003 support on the platform. We're going to be increasingly improving our speed of our compilers, tools and utilities, of course, on the platform. Continued emphasis on our technical support and introduction of some new HPC tools that you can look forward to seeing this time next year. So thank you.

Thank you. Thank you, Rod. Okay. So we did a quick run through of the technology stacks, but what we're really trying to get at is your applications. We're trying to get at what exactly, what challenge it is, what question it is you're trying to answer. There are a tremendous amount of technologies, tremendous amount of applications that hit the platform.

Some of them commercial, some of them open source. But what we want to do here today was introduce you to technology that looks to be very promising in the HPC arena. And to show that to you, I'd like to ask Kevin Howard from Massively Parallel to join us and give us a look. Kevin? Thank you.

Okay, I'm Kevin Howard from Astle Loop Parallel Technologies. We're going to show a demo, but first I have to tell you, since this is an Apple audience, that it's going to be ugly, really ugly, because it doesn't have flashy graphics or anything like that. It's just going to show one impossible thing. So hopefully that one impossible thing will be compelling enough so that you, or at least Apple, invites me here again sometime.

Okay, why massively parallel technology? What do we do? Where do we come from? Nobody's ever heard of us. Well, our technology really is based upon years of research. We've been around for, I think, six years now. Primary funding came from the Defense Advanced Projects Agency, DARPA. What we did is we derived a new version of Amdahl's Law.

What this version does, the standard version of Amdahl's Law is kind of pathetic. Everybody in here who does high performance computing knows that if you open up a parallelization book, in chapter one, page three, it will say Amdahl's Law, and you never see it again. Why is that? That's because Amdahl's Law does not fundamentally talk about number of channels, speed of channels, latency issues. How physical topologies are connected up, or any of the other real issues that face practitioners of the art.

And because it really only talks about two items, it talks about Amdahl's fraction, which is the percentage of parallel activity, and it talks about the number of nodes. That's kind of useless for doing real work. What we do is we've taken this new model. This derivative. Found out how it works in terms of complete overlap, overlapping communication and computation, and overlaid that with different communication models. We'll show you some of those models, not all of them, but just a few of them to give you a flavor.

These are, of course, proprietary, but so what? What we're going to show here are a number of applications with true, what we call true super-linear, capability. What do I mean by true super-linear capability? What I mean is it is not a function of cache effects. Super-linearity that comes right from Amdahl's Law, and it comes right from the algorithms themselves if done properly.

Okay, basically our technology is kind of instantiated in a number of libraries because like everyone else, we want to hide how hard it is to actually do this stuff because it is hard. What we want to do is show that you can get much higher performing systems using some of these techniques.

In fact, what we're going to show is something kind of silly. In the world of high bandwidth, low latency systems, we have a machine over there where we just are using 100BaseT. And we're going to show at least one application with super linear performance. We're going to show some pictures in a few minutes of some performance graphs on real machines also using just 100BaseT. We're talking 80 nodes, that sort of thing, so a reasonable size. A number of applications with super linear performance.

Okay, well here's the first one. Doesn't, in fact we're going to show E to the X. Why do we choose E to the X? A friend of mine, John Gustafson, who's been in the high performance computing community for pretty much forever, and he works for a competing company, so I won't mention it.

He ran E to the X on his competing companies machine, and what they found is that it doesn't really scale very well. Why? Because unlike Pi and other things like that where Ramanujan was good enough to give us solutions that you collect, compute the last digit, and so you can spread that over multiple machines, unfortunately E to the X is a true series expansion problem, so you have these huge series expansions, and you have to do the cross-summing all the way through. And so it doesn't really scale greatly. It scales okay, but it doesn't scale greatly. What we were able to do, this is with two channels, and we'll explain why.

200 base T channels on fast machines, 80 of them, and the effect of that machine was to have performance that was near 140 times, 140 machines is what the equivalent was. So we had 80 machines that were acting like this, and we had to do a lot of work to get it to work. And so we had 100 machines that were acting like 140 machines. How can that be?

[Transcript missing]

Okay, here's our first little, our simple cascading model. Why are we showing this? It doesn't look like much. It looks kind of like just a tree architecture, right? Except that if you look at a standard B-tree, binary tree architecture, the expansion rate is 1, 2, 4, 8, etc. You can go play over there. This is Glenn. He'll be playing the Mac for you today.

This model here, the growth rate of how many nodes are in communication per unit time, oh and by the way that thing that says home node is really our version of a head node or something analogous to that. It's not really part of the communication. But its growth rate for a single channel is 1, 3, 7, 15, 31, etc., etc., which is basically 1 minus 2 times faster than a B-tree. What the heck does that mean? Since what it means on a single channel is that prior to our discovery, our company's discovery of this pattern, the binary tree was considered for a single channel to be the best expansion rate you're going to.

So we're going to be able to have this. This is showing the effects of just adding a single additional

[Transcript missing]

The number of channels that are involved in a communication model have a non-linear effect on the speed of your system. A nonlinear effect. In fact, the nonlinear effect is-- we're going to do the demo, so you might as well start.

Can you switch over to demo one so that he can, oh, you've already done that. The reason I'm just showing you, I told you it was going to be ugly, guys. He just ran a quickie little version of E to the X. We're only showing 300 digits because it goes nuts, but this has 65,000 digits of precision. And it took, on a single node, it took 160 seconds. What we're going to do is now run it on three, run it on seven nodes, that machine right there. We're running it on seven nodes.

Let's see. You know, it takes a long time when you're standing up here and waiting for a machine to write little green numbers on the screen, but there you go. Okay, it's done. Now, it took 17 seconds for seven nodes, so if we do the math, we have a little calculator here. If you do the math... You can see that our seven machines acts as if we had nine. But that to me is pretty cool. I don't know about you, but.

So where are we going with this? How are we going to provide this kind of capability to the community? What kind of other things can we do with this? Of course, we're doing bioinformatics. We're building right now a very large bioinformatics website application to try before you buy. It'll be made of, of course, Apple clusters.

Molecular modeling, it turns out that our method of, we call it beta phase, overlapped I/O and communication and computation model, using some of these cheap tricks, allows us to do molecular modeling that scales, in most cases, super linearly, but there are some cases where it will scale just a little under sub-linearly. Oh well, can't have it all.

Not everything's scalable. It doesn't scale super linearly, I will say that. It depends on the underlying nature of the algorithm. We do a lot of image processing today. Naturally doing things with DARPA, what they want us to do is find stuff in bizarre places. And I can't tell anybody, including my wife, what I'm actually looking at. What did you see today?

Nothing. So neural networks, the reason why neural networks are particularly nice for us is because they're an order n squared problem. And it turns out that the mathematical order of the problem, depending on certain aspects, but the mathematical order of the problem is what's generating the super linear effect. It's the order of the problem because what's happening is that acts as a nonlinear decreaser of the amount of work that's being calculated when you break the problem up.

Seismic processing, we've already heard about that. We've already shown very, very good effects with Kirchhoff time migration codes and other codes. I know that those are -- that that's an embarrassingly parallel mechanism, but using some of our I/O models, we were able to show a 10x increase on the same cluster over existing -- several existing codes.

And computational fluid dynamics models, these are very, very tough codes, mainly because of the very, very small number of bytes that are being transmitted between pieces and parts during the computation. You just don't move a lot of data. And of course, if you don't move a lot of data, then scaling becomes an issue because all your latency builds up. In our model, we found some relatively unique ways of being able to continue to mask the communication time completely, which means that even though we're moving small amounts of data, including around the system, it's okay.

So we are -- As I said, we are just now coming to market. That's why we don't have a lot of flashy stuff. We've been around again for a very long time. We've been primarily working on mathematical models of computation and how those models interact with numbers of nodes. We've found several effects that we think are novel.

The effects really revolve around this notion that if you give me two slow channels, I can make them work much faster than one fast channel. If you give me two fast channels, I can make wonders occur because the effect occurs regardless of the speed of the channel. So I know I'm not consuming all of my time, but you know, I'm done talking.

Thank you, Kevin. So I hope you enjoyed that sneak peek. It's not often you get to see what DARPA actually gets to work on.

[Transcript missing]

There are some great resources where we can exchange this type of information. The Apple SciTech list is really sort of the granddaddy of HPC at Apple. It's where a tremendous amount of HPC-oriented and obviously science-oriented communication takes place.

Since HPC did receive so much activity, this year we actually created an Apple HPC list. So these can both be found at lists.apple.com and receive a tremendous amount of traffic, users helping users, tremendous amount of Apple engineers sign on and actually provide expertise on those lists as well. So that should really be something you monitor continually. The HPC page at developer.apple.com is found at developer.apple.com/hardware/HPC.

Success stories, how-tos, that sort of thing are the types of material you'll find there. And of course the Apple and Science website, apple.com/science, is another fantastic resource. So you can find all of that there. And of course, we also have a lot of resources that you can use to keep up with the data and what's happening on the platform. We mentioned a couple of sessions earlier. I would recommend that you check out the large installation and HPC systems admin session.

We have a small InfiniBand cluster in the Enterprise IT lab for you to use and ask questions of that team. The Xgrid session looks fantastic. Using Xgrid, create and deploy distributed computations. Using Xgrid session on Friday, again, another excellent session. And if you would like a peek at FXP that Rodney showed you earlier and Etnis TotalView, I highly recommend you check out the debugging parallel application session. That's also the session where we will be giving details of how you can get your hands on the memory manager code that was discussed earlier.

and if all else fails, I strongly encourage you to contact [email protected]. Incredibly easy email to remember. I look forward to helping you engage the resources at Apple and being of help wherever I can. I want to thank all of the Apple HPC developer partners, customers that are here.