Debugging Parallel Applications - WWDC 2005

Enterprise IT • 1:08:13

Taming complex MPI codes is easier with a good plan and the right tools. In this session we'll discuss best practices and techniques in deploying MPI codes to Mac OS X clusters, and how to get the most from your code with tools from Apple and third-party providers, including Etnus's TotalView Parallel Debugger.

Speakers: Steve Peters, Yusuf Abdulghani, Chris Gottbrath

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Welcome and congratulations for making it to the last session on the last hour of the last day. will be trying to top off your tank this afternoon and send you home brimming with tips, tools, and techniques for parallel debugging on HPC systems. I'm Steve Peters, I'm a senior software scientist in the Mac OS Platform Performance Engineering group. And this afternoon, I'm going to give you a bit of a personal, perhaps idiosyncratic perspective on the kinds of cluster debugging problems across my desk. A colleague of mine will tell you about Shark, some recipes for debugging using old menus and some new features.

A long-time partner on the Apple platform, nearly two decades running, I think, getting close. Absoft has some interesting and exciting new technology they're going to tell us about, and a new partner coming to Mac OS X, Etnus, will tell us about their TotalView Debuggers, something you may know from other places.

So I'm a software guy, interested mainly in performance of math and science codes. I did some work with the Virginia Tech crew bringing up the System 10, both in its G5 PowerMark incarnation and in the final G5 XServe setup. I helped with some of the tuning of the benchmarks on the Kulso Mach 5 system. I meet with HPC developers in one-on-one performance kitchens. If you haven't had an opportunity or feel you have a problem that's Important to work out, check with Skip and maybe we can get you into Performance Kitchen where the results have been, I think, uniformly good.

and I occasionally do some consulting internally on contracts and projects. So maybe some reasons you might want to listen to me. I'm going to talk a bit about what I see are kind of the two classes of problems that come across my desk. First, new HPC codes. I'll have a bit to say about that. I think some of the following presentations will have really specific tools to help you work out those issues. I'll try to give down a brief rundown of best practices implementation performance and pointers to some debugging aids.

But for the most part, what I get to see are existing HPC codes, people bringing from Linux, Iris, Solaris, elsewhere, bringing that code to Mac OS X. And their debugging usually means, "How come this isn't working as fast as I hoped it would?" And that's where I get involved. I like to point out where the Mac OS X performance high ground is, and perhaps leave a few signposts to how to get there quickly. I'd like to leave you with some rules of thumb and some cautions about bring-up, particularly of big clusters.

So what's to say for implementing new code? Why debug when you don't have to? Please consider Apple's tested and tuned libraries. We offer in our Accelerate framework industry standard BLAS, linear algebra, and LAPAC. LAPAC is the gold standard for numerical linear algebra. We offer VMathLibed single precision, four at a time, elementary math functions. Vforce single and double precision arrays of memory operands for the elementary math functions, much like the mass library on IBM. VDSP single and double digital signal processing, particularly good on FFTs. We now think, at least for certain radices, we handily beat FFTW.

and Vimage, an image processing library, many pixel formats, highly optimized for the platform. One of the advantages of using these codes is Apple takes care to insulate you from changes in the underlying instruction set architecture. So folks who code to the Fast Fourier transforms in the VDSP library don't need to worry about changes to the underlying processor.

Consider some auto vectorization tools. All the current processors and all the ones we contemplate will have SIMD engines, and you can get big performance gains and correct code quickly from automatic tools. Absoft has offered VAST, a wonderful tool in that domain. There are some features and David Koehn, and I'll be joined by David Koehn, who will be presenting some of the most recent features in GCC4 that you might want to check out that do similar, but I don't think is ambitious approaches best.

Follow some cautious bring-up discipline. A lot of stuff can be worked out on a dual processor, single mode of G5, either a Power Mac or a single head node in your server rack. And I'll say a little bit more about expanding that upwards a little later in the talk.

And consider TotalView for your high-end debugging needs as you move up to larger and larger number of systems. And finally, for performance issues in new code, understanding where to spend your time, Please, please use Shark. It's one of the best things I've ever seen in my career of doing this business. It's a wonderful, wonderful tool.

Ported codes, the bulk of what I see, ranging from dusty decks to last year's implementation or version of BLAST or the BLAS or scalable BLAS. This is existing code. It already builds and runs on some cluster, maybe yours, maybe a colleague's. You want to port to Mac OS X. Often, stuff just works. But, and this is where I often get called in, performance has been left on the table. We're just not seeing the kind of gains people have laid down their hard cash for. What can we do?

I usually start right from the beginning. Let's check the config scripts, the make files, and just see that we're not being defaulted out of the performance world. We want to make sure that the config script is aware of the platform, that it either knows about Mac OS X or Darwin, that it knows about the particular processor we're targeting, G5, not just a generic Apple processor.

Most auto-config scripts nowadays are understanding Darwin, often by updating to the latest version of that stuff, the latest config script, you'll find Darwin has been included. Much of the, you see the stuff on Fink, for example, which is an open source repository of tools ported to Mac OS X, are all of that character.

Has any tuning been done for the platform at all? Is the code aware of the SIMD unit? Is the code aware of any additional double precision floating point resources? Are there machine-dependent implementations in subdirectories? Would it be important to replace the generic loops with ones that are a little bit closer tied to the capabilities of the Mac OS X platform? There are compiler issues, as you might well imagine. Are you using a performance compiler? G77, I'm sorry, doesn't count. But there are other options. Absoft has a wonderful compiler suite.

Are you using the right performance options? C flags need to be set perhaps a little bit differently, and it's worth checking that, again, you don't default into debug settings, for example, or storage conserving settings that you really don't need and you would benefit by optimizing at level three or beyond.

Library Issues. Again, I'm going to return to my mantra of try to use Apple's Accelerate framework. For all those good reasons of isolation from processor differences, highly tuned by the folks at Apple, good strong industry standard tested routines. You want to choose an appropriate MPI, perhaps one that your vendor, the vendor of your interconnect recommends. There are many out there today.

And on occasion I've seen places where codes will fall back to slow reference implementations of, for example, linear algebra loops, based on some if-def on the code, where if it doesn't discover your Irix, Solaris, or Linux, you end up in the slow path. It's worth scanning through the code, perhaps even after a hint by Shark that, "Hey, this loop is just really sucking up a lot of time, and you might do well to improve it." A good place to look first is to see if the code already knows how to improve it, and you've just been boxed out by an if-def setting.

So I mentioned bring up plans, how to bring up a very large or a very large cluster. This is experience mostly that I gleaned on the 1100 node cluster at Virginia Tech. Start small and work the size up.

[Transcript missing]

Verify, first of all, that the communication fabric is solid. There are usually vendor performance tools and communication tools and diagnostic tools. It's well worth spending some time making sure you're not dropping bits on the floor, that you're not overrunning buffers and causing large numbers of retransmits.

[Transcript missing]

It seems that we live in a binary world and the problems in communication and messaging arise binarily, I guess. So expect issues to arise as power of two boundaries are crossed on your bring up. Expect issues to arise as physical boundaries are crossed, as communication occurs off a rack, across a router, across a big switch from one aisle to another. Anybody building machines in multiple buildings? I don't know.

But those are places to expect to spend a little bit more time and do a little more sniffing around to make sure that the bring-up is going as expected. You need to investigate all scaling anomalies early on. These things tend to multiply as your performance drops and your sort of sadness increases. It's really important to sort of stay on the line you're expecting for the scale-up.

And finally, do some A/B comparisons on symmetric subsets of the cluster.

[Transcript missing]

I was asked, I guess Skip hinted at a special topic that I do here, an HPC-friendly memory allocation. This derives from work done at Virginia Tech. This was undertaken by our very own Quinn in DTS, and we're grateful to have this contribution today, and happy to be able to pass along to you.

So the story is something like this: Mac OS X, as we ship it on new machines and on the DVDs that you buy, has a memory allocation scheme in the user land that's tuned for the desktop. This is Libsys, Malik, or I believe it's the same stuff that underlies the Fortran dynamic allocation.

It favors small, short-lived allocations. I think our studies show that the most often allocated block is about 40 bytes long. We make it very fast to get those to recycle them. Bigger allocations are aggressively released. When the application is done with them, you call free, they're aggressively released back to the OS to reduce pressure on the real pages in the system.

And this scheme really doesn't pay much attention to any particular access pattern. And that's different than HPC. HPC is going to prefer large, long-lived allocations, your big data arrays. Sequential access. And it's going to be essential that the memory cache hierarchy be used effectively. So behind all this is the notion that we want to ensure that a contiguous virtual address range and array is backed by contiguous physical storage. This is kind of the best situation for the machine.

We get the full benefit of the L2 cache. It turns out the most salient feature for Mr. Goto's BLAS to run effectively on the platform, and often the communication layers like this too, because the pages can be wired in sequence and sent out in just one big hurry.

So here's a sample code called HPC memory. If you need to get a hold of this, contact Skip. It really comes in two parts. There's a kernel extension that loads when the machine boots. It grabs a contiguous extent of physical memory that you can specify using a plist. And then there's a user land library that basically offers five simple calls to connect to the KEXT, get an allocation, free it, and check that it's really contiguous.

So we have HBC MemOpen gets us a file descriptor-like object used for accounting. We'll close it at the end and then we'll allocate and free from that. Pages obtained this way are contiguous. You can double check using HPC MemCheck config. And they should enjoy a little bit better performance. We're seeing single digit percentages in big numerical codes. Okay, so that was the piece I wanted to talk about, and now my esteemed colleague, Yusuf Abdulghani, will come up and tell us about SHARC.

Hello. Steve talked about Shark quite a bit in his presentation, so I'm going to tell you about Shark. So, what I would like to do is start off introducing Shark to those who do not know about it. And for those who already know and love Shark, we'll be talking about some features that you can find in the version 4.2 of Shark.

And then we'll show you how to get started quickly with Shark. It's a very easy to use tool. And after that, I will show you how you can use Shark in your HPC or cluster environment, where you want to profile a computer or a node for which you do not have direct access. There is no monitor, and it's a headless node somewhere sitting in a cluster. So, what is Shark? Well, Shark is a simple and a fast profiling tool that is available on Mac OS X. It runs on Tiger as well as on Panther.

It works with the language and compilers of your choice. So, if you have an application which is written in C, C++, Objective-C, Java, or even Fortran, you can profile that application with Shark, and it will give you the source view for that application. And any application compiled with GCC or Code Warrior or XLC, XLF, or even Appsoft, Fortran, can be profiled using Shark. It recognizes both the Mac OS as well as CFM binary formats, Thank you. There is a GUI shark available as well as a command line shark for scripting purposes.

As I said, it is available with CHAT 4.2 preview. You can download it from developer.apple.com, and it is available for free. So how do you profile your application in Shark? There are several workflows, starting from time profile to mallet tracing, Java profiling, counter spreadsheet. These are some of, some of these are new features in Shark 4.2, like custom configuration, counter spreadsheet, and network profiling.

So let's talk about time profile. Time profile actually is the most common workflow that you will use in order to profile and optimize your application. It exactly shows you where you're spending most of the time. It focuses your attention directly to the function where most of the time is spent, so that you can look at that function and start optimizing and get the best benefit out of it.

So how does time profile work? Well, when you start time profiling your application, Shark stops the system at regular intervals and records the backtrace, the process, and the thread that's currently running, and records it into the profile. It has got very low overhead, and it captures everything from drivers, kernels, and application.

Another new feature in Shark is malloc tracing. This tool actually monitors allocations and deallocations of memory in your application. It helps you to understand your memory usage pattern in the application, and it also helps you to visualize complex applications and see where you're spending time, whether your memory allocations are done in how you're allocating and deallocating memory. So, if you have a Java application, that also can be profiled using Shark. And there are three configurations that you can use, including time profiling, ALEC tracing, and function tracing. And you can also view your Java applications in the code browser window for Shark.

[Transcript missing]

You can design your own metrics, for example, these counters, and write your own equations to come up with metrics and see how these metrics are behaving in your application. This is one of the most requested features from developers that they wanted a very easy way to configure, to create configurations, an easy way to find the performance monitor counters on all the platforms, on all the processors that we have. So here, if you just type the word cache in this particular search window, it lists all the events that are related to cache and shows you up in the table view.

You don't have to specify which processor you are running it on. It automatically detects the processor that it's running on and finds out the events that are relevant to that processor and shows up in this table view for you. If you want to get more advanced features in the profiling, you can control each and every aspect of it by clicking on the advanced drop-down menu at the bottom.

So, let's see how you can use Shark and get started quickly. But before that, let's try and revisit some of the basics of performance analysis. So you're using Shark to analyze your code. So the first step is to establish a baseline. It is very important that you come up with an appropriate workload so that the workload is representative and it's reflective of what you're measuring. Then come up with also a meaningful metric.

So once you have established a baseline, then you can go to the second step, which is profile optimized deployment code with debug symbols enabled. This is very important. If the debug symbols are not enabled, then... We will not be able to actually get and relate the source code in the Shark code browser.

Secondly, profile your optimized code, because if you're not profiling the optimized code, you're profiling your debug code. You can have two very different profiles and it might be very misleading. So make sure that profile your optimized and the deployment code. And the third step, of course, is to Shark your application. When you start Shark, it comes up with a very simple window like this one. There's a one button click, which is used to start and stop your application, start your profiling, and you're on the way.

By default, Shark has several configurations. Time profile is selected by default. The other configurations include system trace, malloc trace, and so on. So select what configuration you want to profile, or what you want to use, and click on the start and stop button. We recommend that from the target dropdown menu, you select the system profile.

We want you to use system profile most of the time so that it gives you an idea of how your application is behaving with respect to the entire system. However, in some cases, you might not be able to use the system profile. As an example here, if you are trying to trace your memory allocations, you might want to just use one particular process to look at your memory allocations. The other example is when you want to do a static analysis of your file or a process. That's when you would probably select a file, an application, and do static analysis on that one.

The SHARC shows up the profile, the data that is collected in this session window. By default it shows the heavy view. Heavy view actually takes your code and points you where you're spending most of the time. In this case about 80% of the time is spent in this particular function called a Cycle True Brain Scaler. So it focuses and draws your attention to the function where you're spending most of the time. The other view that SHARC also allows you to look at your code is the Tree View.

This gives you an idea of your hot paths. How did I get to this particular hot function? And then if you want to look at both the views, which function is hot and how did I get there, there's a heavy and a tree view that you can select at the bottom.

Once you double-click on a function which is spending most of the time, it opens up the code browser. The code browser highlights the lines of the code with yellow. And brighter the yellow, hotter the code is. There's a special gutter on the right side that you can see, which is color-coded. So there are horizontal yellow bars which points to you where is the hot code, so that you can easily navigate to the hot code.

Once you have identified the hot code, you might want to edit it. So when you can click on the edit button, as soon as you click on the edit button, it opens up Xcode and opens up your source code in Xcode and takes you directly to the application that you want, directly to the line that you are looking at. So this way, Shark is very well integrated with the Xcode IDE and helps you to reduce the profile. profile change, turnaround cycle, a lot.

So that is Shark in a nutshell, how to use it and how easy it is to use. But what do you do? How can you use Shark in an environment where systems are connected over the network? So let's look at that. Shark actually is your camera on the cluster. You can share your computer for network profiling using either the GUI Shark or the command line Shark.

You can discover and control the shared computers using Bonjour, that is automatic detection, or you can add specific IP addresses. You can simultaneously profile multiple machines, and you can retrieve profiling session either automatically or on demand. So how do we do that? Well, the first thing that you would do is to set up a computer for network profiling. There are two ways to do it.

One way is through the GUI, and the second way is through the command line shark. In the network profiling tab, you would click on the radio button which says, "Share this computer for network profiling," and that's it. So that particular computer now becomes ready to be profiled by another computer.

On the command line you would probably just say "shark -n" and then that computer becomes ready for network profiling. If you look at the profile set, we are profiling time profile here, but you can select whatever profile you want to and you can set that up. Once the computer is set for network profiling, the second step is to actually connect to that computer and do the profiling, either from your laptop or from your desktop machine.

The way you do that is click on the "Control Network Profiling of Shared Computers" radio button. As soon as you click on that, all the machines that are automatically discovered by Bonjura show up in the table view. You can also add machines by computer name or IP addresses, and they also show up in the table view. You can select one or multiple machines using the check boxes, and then from the target drop-down menu, you can select whether you want to target the entire system or a particular process and so on. Once this is all set, you just click on the "Start" button.

and StartButton is going to start profiling, start the collection of the profiles across these multiple machines across the network. And then all these profiles come back to your laptop one at a time and you can see them. Each of these profiling window is uniquely identified by the IP addresses from where you're getting the profile and the type of profile that you collected. You can save these profiles and analyze them or whatever you want to do with this. So this is how you will use Shark in a network or a cluster environment in order to gather information about what's going on on a particular system.

So in summary, Shark is very easy for beginners and powerful enough for experts. It is great for high-level and low-level performance analysis, and it is compatible with all major Mac OS X compilers. It is available for free from developer.apple.com, and it is available today. So you can go ahead and try it out and see how it works out for you. One thing that I want to mention about Shark is that it is a universal binary, so if you have got a developer system, you can give it a shot at that system as well and see how you can use it.

Sessions saved on the developer system can easily be taken to your Power Mac or PowerBook, and you can view the sessions and do the analysis over there as well. So that is Shark, and now I would like to invite Rodney Mark from AppSoft to talk about the FXP technology. Thank you.

From Absoft. So our FXP technology is built on our FX2 serial debugger. It debugs Fortran, C and C++. We had a lot of customers that were requesting a low-end beginning MPI debugger that would let them debug their MPI code without having to resort to printf-style debugging. So we came up with our basic entry-level MPI debugger. It supports debugging 64 and 32-bit codes and serial as well. It has the same interface as our FX2 debugger, an aqua-looking interface. It has support for Fortran, C, C++, and Assembler.

It supports all the major compiler vendors and MPI implementations in one package. As an easy to use graphical interface, you can basically get started without even having to read the documentation. It's very intuitive. You don't have to spend a lot of time trying to learn the MPI debugging paradigm. If you're already familiar with serial debugging, it'll translate quite well. As I mentioned before, it builds on our FX2 serial debugger.

Some of the features we have is automatically attached to MPI processes. So if you're using LAMMPI or MPish, you'll be able to automatically attach to all processes that you've started up, so you don't have to manually know the PIDs or figure out what hosts you're on or anything like that. It lets you view local and global variables, stack traces, registers, the message queue across all your MPI processes.

It has some visual elements that let you see the state of processes running on your cluster. You'll have like green is everything's good, red stops. It's very intuitive state mechanisms that let you see it in an instant what's going on with your code. It is a basic MPI debugger. Like I said, if you're doing hybrid codes, you're using OpenMP, doing some other things, then TotalView is also available.

So this is a screenshot of FXP. It's kind of hard to read, but it has two windows. This is a Fortran code, Hello World. You can see in the left-hand corner there, the variables rank and size are shown across all the nodes in the cluster. You can use named groups to select which variables from which nodes you would like to see, or which ranks. It also has-- you can control which nodes you want to stop, which ones you'd like to step through.

This is showing the name groups. You can give them any name you want. Whether you have a group, maybe you want to have ranks one, two, and five called like the batch mechanism. And you can have whatever intuitive name you like. It has a history mechanism so you can go back through and easily select different groups that you've looked at.

Some other, if you're, right now we're in beta, so if you'd like to beta test this for us, just contact me or Woodlots, this is our contact information. We'll scale from small number of nodes, we're targeting 32 nodes, so if you have a really extremely large cluster, like a thousand plus nodes, then we do recommend TotalView. So, thank you for your time, and I think we'll introduce Etnus now.

Hello. I'm Chris Gottbrath from Etnus. And I'd like to-- there's a little bit of a deviation from the normal procedure here. I really-- I think it's a very cool thing. Apple set this up and you don't have to take, you know, bags of stuff home with you. But my sales people sent me here with like 50 of these.

So they're back lined up right near the entrance there. So if you're interested in learning a little bit about TotalView, or maybe you aren't going to use the debugger, but you know someone who would, this would be a great thing to take home and just drop on their desk and, you know, you can forget about it after that. But this is a nice little packet of information. We have like a quick start guide, which is a little mini manual. Our real manuals are pretty thick. And some other information detailing TotalView in different ways. So.

I apologize for the plug for the literature, but it's back there and I don't want to carry it home. So, okay. So, as it says there, I'm an engineer at Etnus. I'm one of the developers of TotalView, and we're very proud to be bringing TotalView, which has a very long heritage of being a parallel debugger, a debugger for complex code in general, on Unix platforms of other varieties.

We were very excited when Mac OS was moved over into the Unix world when they saw the Unix Lite, I guess, and we were able to take the opportunity to move TotalView as well and bring all of you guys into the-- into, I guess, our world of debugging. So this is--I guess I'll apologize a little bit in advance. This is not--this is an X11 application.

So we've taken advantage of the fact that Apple has, I think, very wisely allowed people to bring applications over and use the X11 graphics engine as, you know-- obviously the Quartz engine is great, but this is a way that allows us to bring our application very easily over to OS X.

So it'll look a little bit motif-y, which, you know, I've had some people raise their hand and say, "Is it always gonna be so ugly?" And I was like, "Well, you know, it allowed us to bring the debugger over," which I think is a nice thing. Okay, so what is TotalView? I know some people may be familiar with it, but for those who aren't, TotalView is a source code debugger, and it's an application development tool. This is not a kernel debugger.

It's a source code debugger for both serial and parallel applications. I'm mostly focusing here on parallel applications, but some of the complexities that TotalView can help you out with, which are detailed in the little flyer, you know, are equally appropriate for serial debugging. For example, threads-- some of the same concurrency issues that come up with parallel debugging of having lots of things running. TotalView's very comfortable with dealing with that, and it'll be very easy to see how you can transfer what I'm talking about into the threads world.

We handle C, C++, Fortran, and Fortran 90-- basically all the compiled languages. Actually, one of the things that I'm interested in for this audience-- I've heard a lot about Objective-C here, and coming from the Unix world and not the Apple world, that's kind of, you know, that's kind of new for me. How many people--is it a ding against TotalView or a reason you wouldn't be able to use it if it doesn't work with Objective-C?

How many people is that an absolute requirement that you need to be able to do GUI debugging for this to be interesting? Okay, so a couple of hands, but not an overwhelming majority. Okay, that's really good information for me to take back. We're looking at Objective-C. My opinion, having just barely looked at it, looks like it shouldn't be a problem to support, but we don't currently. Wide compiler and platform support-- so we've been historically on Linux and the Unixes of the world, the SGIs, the ERIXs, the Crays.

So one of the nice things for you if you are bringing applications-- one of the things that was mentioned earlier was this idea that if you're doing HPC with Apple, it's likely, if you already have the application, it was already working on one of those other platforms. If you're doing the porting process, hopefully it will all just work.

Maybe they're just performance issues, but if it doesn't, if it crashes due to some idiosyncrasy, TotalView may be useful. You may be able to bring up TotalView on the Apple and also on the previous platform and do some sort of comparison debugging, and you'll have almost exactly the same feature set in both cases, and we think that's a really important advantage.

And of course, multi-threaded debugging, one of the features that makes TotalView different is that we handle multi-threaded debugging, even if it's within the context of MPI. So especially in the future, given the fact that Apple is moving towards Intel, and if you look way down the road map in Intel, you see lots of discussion of dual cores and hyper-threading and things like that. You may want to start taking advantage of the ability to use threads even within an MPI application. So TotalView will be there for you and can handle all that comfortably.

Distributed debugging, obviously, is the main thing I'm going to be talking about here today, and cluster architecture. And I guess I already mentioned the idea that we have an X11 GUI. We also have a command line interface, which can allow you to do scripting. If you have something that takes, you know, a scenario to debug, you know, one of these sort of hairy situations that takes, you know, 53 steps and three weeks of run time to set up, you know, hopefully you don't have that situation, but if you do, you don't want to be clicking all the buttons. You want to be able to write a script for it. We also use it internally for testing.

So this is the sort of road map of things I'm going to talk about today. So I've already sort of-- I'll talk about TotalView as a parallel debugger. I've already sort of introduced it a little bit. I'll talk about TotalView's architecture, which I think is very unique and very neat. Then I'll do a little live demo, and in the live demo, I'll try and cover automatic process acquisition, parallel debugging features, and the message queue debugging.

Finally, I'll talk a little bit about scalability, and then I'll talk a little bit about our road map bringing TotalView to the OS X platform. So you can just see a sort of screenshot over the right of TotalView, and I'll be doing a live demo in a moment, so I don't want to dwell on that too much.

Okay, so the architecture of TotalView, we think it's really neat. TotalView basically--so in a cluster situation, you're going to have some number of compute nodes, and your job is going to be running one or more instance of your MPI job out on the compute node, and those are represented by the little red boxes there. There may or may not be a job running on the local system where you're--the interface node.

The user is going to start up TotalView on the front-end node, and that's where we're actually going to do a lot of the analysis, the code analysis of the application. And then we're going to start up-- basically, you can think of them as debug agents, little lightweight units of TotalView stuff out in the cluster, and we're going to communicate with that separately from the MPI communication, and those little debug unit-- sort of agents are going to live out in the cluster, and those are what actually handle debugging those tasks. But all the information is channeled not through the MPI, so we're not messing with any of the MPI communication, through a separate channel back up to the front end version of TotalView that's running in the front.

So there's sort of two elements. There's a main debugger, the debugger server, and they can allow you to debug even multiple instances of a process on each one of the nodes. And TotalView starts and handles those separately. What are some of the advantages of that architecture that make this a really neat way of doing things is that it's very lightweight. It's a widely applicable model. Almost every cluster is going to have something where this idea will work.

Because we stay out of the way of the MPI, we're not going to be messing with that. And the debugger processes run as the user, so there's no privilege. You don't need to worry about running these things as root or anything like that. So this is a robust, scalable mechanism. Okay, now enough talking about it in the abstract. Let's show you what this looks like. So if we can switch the screen input to the demo.

This is TotalView running on G5 here. It's just reading the data off of my PowerBook. We're just going to start up. I'm going to do the simplest possible situation. I'm not going to try and give a debugging tutorial here or anything like that. I'm just going to give you guys a chance to see the debugger and to see how the user interacts with it. If you've been using MPI before, I apologize, this font here is small, but the TotalView font will be a little bigger.

I'm actually going to just do the CPI application. It comes included in MPI-CH as an example. I think it's probably even in the LAM. It's a very simple, very standard application. Just to show that it can run, I'm doing an MPI run. The way you would run this in the command line, at least, is you'd run an MPI run, the number of processors argument, and then the application. Just to show that it runs, that's the CPI.

In order to get TotalView into the mix, you're going to have to use a little bit of a different syntax here. This is an MPI-specific thing. I'm having to be using MPI-CH. With LAM, it may be a little bit different if you're using some other MPI in the future. There may be a slightly different invocation syntax here, but the idea is it's a very small tweak on what you would do normally to start the application. In this case, I just added a "-tv" flag. That has to do with the architecture of MPI-CH.

The MPI has to actually start us. We throw in the "-tv" flag, and then you get the TotalView windows. TotalView starts two windows. This has to do with the fact that we're running a total view. The total view is really designed to handle multiple processes. This left window here, as we get into the demo, you'll see, serves as a navigation point. There's going to be one item in this left window here per process. You'll be able to see feedback.

The right window, on the other hand, is your focus on one process. In the usage model I'll do here, we'll be able to refocus this right window very easily from different processes. It gives you the idea that you're debugging a process, and that process may have other friends, which may do some of the same things or not do some of the same things.

You can see what's going on with the root window, but you're controlling everything through a very familiar one process debugging window. This has all the features you'd expect to see, a go button, a halt button. Once I get into things, there's going to be stack trace and source code down here.

Let me switch this to processes. In order to get things started, what are some of the challenges the debugger has to handle to really make debugging on MPI comfortable? One of the things is that an MPI program is not just a single process. It's a bunch of processes running on a distributed set of processors, a distributed set of computers. One of the things that could be potentially very painful would be finding out where all those processes are, attaching debuggers to them, and dealing with the feedback.

What TotalView does is it makes that all very transparent. You just start the MPI process. We'll get some notification here that TotalView recognized the thing that was starting is a parallel job. Then we have an interaction with the MPI where we get some information about where that job is running.

The question here is really asking, do you want to stop the job now? The reason it's asking that is you may want to set breakpoints. Remember that the application I'm running on right now is a little demo app, and it will exit almost immediately. If I said no, we would just run to completion before I had a chance to click anything else. I want to say yes.

That startup question there is configurable. You may actually want a little bit more control. You may want to debug a subset of a larger application. That's the basic behavior. I can go into the preferences and change things so that when it launches, I get a list of all the processes, and I can choose which ones I want to attach to. The nice thing is that's not a final decision.

I can start off just debugging a couple of processes. Then I can run the application for a little while and look at the communication and see what the problems may be, and then attach to other processes as I go. That's one behavior there, but there are lots of wrinkles and other possibilities. Now I've run into the application a little bit. What's happened here behind the scenes is that the application launched three other processes. In this case, locally, but they could be.

This would have been equally easy in a cluster. In this case, this says host local, but in a cluster, this would just have the list of hosts. You'd have all the processes that were part of this job. You'd be attached to them wherever they are in the cluster. They each have their rank. This comes from the MPI rank that's associated with the process, and then the debugger assigns a separate ID to each one.

What I've done here is I've branched out this. TotalView, I mentioned earlier, handles threads as well, and particularly it handles threads along with MPI. You could easily imagine writing an application, or even your MPI implementation might do some stuff with threads to handle I/O. You can imagine that your application might be made up of a number of different processes, each of which has multiple threads. TotalView would handle that by having one process item here and then all the threads that are part of it under this tree structure here. In this case, I'm just showing that this is an application that has four processes, and they're each made up of one thread.

Um, okay, in this case-- let me just show you a little-- uh, read for you here the statuses. This is saying that the first process is at a breakpoint. These other processes have a status of "T," which means--which comes from debugger terminology. It's the word--it stands for "traced." They're stopped, but they're not stopped at a particular breakpoint. You'll also see a green "R" here when they're running, uh, and you'll see, you know, they'll switch to breakpoints.

Once we get to user set breakpoints, they'll have breakpoint numbers and things. So this, again, is the sort of navigation point that you'll see in the back place where you'll see what's going on with the processes in the cluster. Over here we have the process window. It's focused on one of the processes.

In this case, the first process gives us a stack trace. All these, uh, things are configurable here. You can see the stack trace. We can click back to main. You can see all the local variables, um, and in this case, we're--we've currently stopped at the end of MPI-NIT. Remember, the program had to run a little bit, create the parallel job. That's what it's completed doing, or it's in the middle of doing, Thanks.

This interaction should be very familiar with you if you've used other debuggers. You know, we don't think our-- what we're trying to do is actually give a nice interaction, a very familiar interaction that handles all the complexities of parallelism here. But, you know, the basic paradigms are very-- should be very familiar. So I've set two breakpoints. Now, in a serial debugger, setting those two breakpoints has a very unambiguous meaning. Program runs. When the program gets to the breakpoint, the program stops. In a parallel application, what is it that you want to have happen?

'Cause remember, the parallel application is four processes, each of which is running separately. Any one of those could hit that breakpoint. And what do you--how do you-- Mike Cutout, oh, it's back. What do you want to have happen in that case when any one of those processes hits that breakpoint?

Do you want it to-- You want it to stop that individual process on its own and leave the rest of the simulation running, or the application running, or do you want it to stop the entire application? And those are two very rational choices, and you may, in the debugging session, you may want both at different points in time.

By default, TotalView, this actually is a change recently, by default, TotalView stops just the individual process. That's almost always what you want. The meaning there is you'll have everything will line up. If you set the breakpoint in a line that all the processes go through, obviously if you set it off in an if statement, it won't work, but if all the processes do reach this breakpoint, they will all line up at that breakpoint. So it allows you to sort of synchronize. Often we've found what people want to do. They want to sort of synchronously move their application through so they can have a clear idea of what the application is doing.

They imagine that that's what happens when they run it, and for the most part they may or may not be right, but when they're debugging, that at least gives you a clear state. So you'll maybe want to bring the whole application to a particular point, and then you might want to take one process and run it ahead or something.

So setting the breakpoint by default, we're going to just stop that one process. But if you wanted to do something different, you have that capability. So what I did was I brought up the sort of properties dialogue for that breakpoint, and I have the first choice it gives me here is what do I want to have stop when the breakpoint gets hit. I could choose the group, choose the process, or I could choose the thread.

So in this case, process is almost always what you want, and you can say, "Okay, so let me now let the application"-- Okay, so just like setting the breakpoint has a degree of freedom that's new in parallel debugging that's not there in serial debugging, also the basic commands, the go and the halt and the step, those have a degree of freedom that's new.

And the question is, when I click go here, what do I want to have happen? I'm focused right now on the rank zero, this first process here. Do I want to have just that one process run and leave the other ones stopped, or do I want to launch them all? And the way you control that is there's a scope selector here.

Which has a couple of different options. It's got the process you're looking at. It's got the thread that we're focused on a particular thread within the process. In this case, the thread and the process obviously have the same meaning, but in a multi-threaded application, that wouldn't be true. And then there's several different groups. For this application, they all sort of have the same meaning. So group control means all the processes.

So if I hit go, they should all start running. What you should look for is you should see the statuses over here all changed to green R's as they run. And then they should all, at their own time, they should get around to getting to the breakpoint and stop. You'll also see the feedback occurs down here in the process list. So hit go. They all ran. It was too fast to see. You didn't see the little green R's, but they were there for a moment if it updated in that time.

And they all ran to the breakpoint. And so you can see the nice thing is the feedback. You can all probably read this already. These are all sitting at the same breakpoint. You maybe don't know which breakpoint two is, but you know they're all at the same one. And you can see which one it is here with the little arrow.

And if I bring up this, you know, there's a couple different things here. There's a couple different things that this bottom pane can display. And one of them is action points. It gives you the number, the line number. And you can zoom to these if these happen to be over in other code segments and control them from here.

Okay, so I've run all my processes. What I've done here is I've synchronized now my parallel application. It was no harder than just running a normal serial application to a particular breakpoint. But, you know, there was a lot more going on under the covers here. We see a stack trace that makes sense. We're in main.

We've got variables that make sense. If you're an MPI programmer, one of the things that you expect to see here is my ID. If I were to dive on that, that's the term we use in our documentation for get more information about. We'd see that the ID for the first process is indeed zero.

If you want to, one of the things you can do that's kind of neat in TotalView is when you have a multi-- we have a variable which exists. A lot of MPI applications are what's called SIMD, single instruction multiple data. A lot of times you have code like this where everybody has an ID.

And you're kind of--you might sometimes be interested in what are all the different IDs. One thing you could do is you can get--basically make an array out of these. So this now constructs an array where the values of the array are the IDs in each one of the different processes.

Another way you could have done that is you could actually go here. We could actually navigate to the next process if we want to, and look at the separately on ID. We could also go here. This is one of these things where there's gonna be several ways to do it in the interface, but they all have the same meaning. We've switched which process we're focused on. Now we're focused on rank three, this one down here. And again, if we were to look at the ID here, that'll have a value of three. I don't even need to dive, I can just read it right there.

So the kind of point I'm trying to get across here is that the interface is really not that hard to use, but that it's doing a lot behind the scenes, doing the important stuff for you here. In this case it's just four processes. You probably could launch four different debuggers here.

But imagine if you had 16 or 32. You want an interface like this that brings everything down into one window. Actually, it doesn't have to bring it down into one window. You might want to actually do comparison debugging. If you do, let me get the right window here, you can actually bring up two of these.

We don't think you probably ever want more than two, but you might want to compare what's going on in two different processes. And TotalView is perfectly happy to let you do that. So what I did there was I chose a different rank, and I asked TotalView to bring up a new window focused on that.

Again, with TotalView the underlying engine is attached to all of them all the time, but you may or may not want more than one window. Okay. That's the basic functionality process control. I talked about wanting to sometimes control an individual process. Let me just demonstrate that. I think everyone sort of understands what's going to happen here. If I run this one process, it's going to run. The other ones are not.

So that one, I saw it flicker for a moment to green, and it's now sitting at breakpoint number three. That's this one here. We were again, the rank ID is reported there, rank three. So rank three was the one that we were focused on, and it ran ahead. So it's now at breakpoint number three. It's a little bit further down.

I can now, another thing you might want to do, just sort of going through things that you might reasonably want to do in a debugging session. I'm now holding this process. So it says held. I went to the process menu here and said hold. And now this process, it also is reflected over here, is not going to go if I tell the other ones to go. So now I can issue a command to the entire group and say go.

This one will not move. The other ones will all catch up to where it is. So now they're all caught up. I've resynchronized my group of processes. So what this is giving, so the debugger is still going to be used in the same way that you use a debug normally. It's just that there are lots more degrees of freedom to what you might want to do when you have a parallel application. And you want a debugger that makes that really fairly easy. Okay.

So what I've talked about now, up till now, are mostly process acquisition. You know, basically it's no big deal with TotalView. And basic debugging functionality that everyone sort of expects to see. They want to see if they can set breakpoints. They want to see if they can control their process. They want to see if they can look at variables. So I've tried to cover those bases.

If anybody has any questions about any details, I probably shouldn't handle them now. Maybe you can come up afterwards and I can walk you through the demo and you can tell me to push buttons and see if it crashes. What I want to talk about now is something that's really, kind of specific to MPI. Sort of a failure mode that can occur in a serial application.

And that is a communication pattern mismatch. So, you know, without doing a little mini lecture on MPI, in MPI one of the things you can do is you can pass a message from a process, a rank, it's called in MPI parlance, to another rank with a particular ID.

And, you know, there's some built-in constructs like send to everybody. And you use those a lot of times. But sometimes, especially as you're tuning, you get to the point where you want to have really fine-grained control of where you're passing the messages. And anytime the user has that kind of power, they have the ability to screw it up. Right?

So, you can get in this failure mode where your program is running, all the processes have done the work that they were given to do, but they've passed messages to the wrong place. Or a message didn't arrive or something like that. And the thing is that MPI is designed to be very forgiving of that.

Because there's all these big latencies and things like that. So the program, there won't be any, like, segmentation fault that occurs because of that. Everything will, you know, the wrong receiving process will accept the message and stick it in its pocket. And it just won't ever look at it. And so, because there's no segfault, there's no sort of obvious failure, the program just sort of grinds to a halt.

Everybody is waiting for something from somebody. In that case, it's very, very difficult to debug without a little bit of help from the debugger. So I'm going to show how that works. And I'm actually going to simulate it in this application by just holding it. That one that I held earlier, I'm just going to leave it held.

So it's never going to run beyond this point. That's the thing. I'm just sort of sticking my thumb on this process and not allowing it to go. And I'm going to now tell the rest of the group to run. And actually, let me disable this breakpoint here. They're going to run down here into this MPIReduce, which is a collective operation. They're all supposed to exchange information back and forth. And I'm going to use that to show you how this is a deadlock, just like what I described.

In this case, I've done it by simulating it by sticking, you know, sticking my thumb on one of the processes so it won't run. And telling all the others to do a collective where everybody has to participate. So since one of the guys isn't playing, he isn't participating, it actually deadlocks.

So these are all running. They would continue to run all night if I left it. But what I'd like to do is actually

[Transcript missing]

0, meanwhile, was expecting a message from 2 with a tag of 11, and 2 was expecting a message from 3 with a tag of 11.

So whose fault is this is actually really easy to work out. You just sort of walk back the arrows in the case of a receive queue, and even if I didn't know what I had just done to cause this, I would be able to read this and say, I'm pretty sure that what I need to look at is why 3 hasn't gotten around to delivering to 2, because then probably 2 would deliver to 0, and 0 would deliver to 1, and I'd be done.

So this gives you a graphical way to get at this sort of new sort of error state, which is possible in an MPI application. We think that's really kind of nifty and handy and cool. And obviously, this scales up quite a bit. I mean, I'm only doing 4 here, but you can imagine 32 or 64, you'd still be able to, you know, and you can actually move these guys around, you could still sort of unwind, and the nice thing is you can do it sort of graphically right there in front of you.

Unwind these things, and you'll get something that looks like a tree. Somebody's responsible, and everybody else is waiting, or something that looks like a circle, and that tells you that you've got a deadlock where everybody's waiting on somebody, but no one's going to start. So this is, I think, a really neat and very powerful sort of graphical way of debugging this sort of new class of error that can occur with parallel debugging, or parallel programming.

And of course I would resolve that in this case by just unholding that process and allowing everybody to run, which I can do. But in general you'd probably have to go in and edit code.

[Transcript missing]

Can I switch back to the slides? Okay, thanks. Okay, so these couple slides were actually my insurance policy in case the demo didn't come off for some reason. They couldn't sync with the screen or something. So these slides actually kind of cover the stuff I've just talked about.

It talks about process acquisition, some of the capabilities, parallel debugging features, talks a little bit about, you know, the point I was trying to emphasize is that it's really a nice, clean GUI which gives you the power to deal with this extra complexity. And then the message queue graph. This is just a different message queue graph. I thought I updated that slide. Okay. That was supposed to be an Apple boundary there. You saw, like, some sort of Linux desktop. I apologize. Same functionality, but you already saw it.

Okay, scalability. This is something I can't really talk about from the little demo of four processes running on the same machine. TotalView has been around and in the market, in the high-performance computing market, for over 15 years. I think it's actually now up to 17 years. We originally started with BBM Butterfly, which was actually one of the first actual implementations -- kind of a buggy one, as I understand it -- but one of the first actual implementations of a distributed parallel computer.

So we've been doing this for a long time. And what that means for you guys is that this code base, which has been around for this period of time, has been stress-tested by lots of other people ahead of you. So we've had people kicking the tires for a long time. And little internal things like how do you handle displaying -- I didn't even talk about displaying an array, but TotalView has the ability to give you a nice little bit of information. It gives you a nice view of an array.

If you have a million elements in your array, but you just want to kind of look at the first couple things, the first -- the easiest thing that a debugger developer probably would do is they would go get all that data, put it in a buffer, and then they draw the window.

Well, when that involves a thousand processes distributed over a network and a million elements which might be on various different processes, that's going to be a slow operation. So we don't do that. You know, somebody else has already beat us up over the head about that. And we have very efficient mechanisms for little things like that. So you're going to find that TotalView is a very efficient system. So you're going to find that TotalView is a very efficient system.

TotalView scales out very comfortably in terms of performance and responsiveness, in terms of memory usage, if your application is likely to be very huge if you're buying, you know, some sort of large cluster. TotalView is going to very efficiently build up the data structures it needs to hold the information about your debugger or about your process. Status and data representation, all these things have already been stressed out by other people.

Now, that's not to say that there aren't challenges that we're working on in that area and we're continuing to innovate and change and grow the product. But, you know, you shouldn't have something that sort of doesn't scale past 64, doesn't scale past 32. Practical scalability, you should have no problem with tens to 100 processes, pretty much trivially. It's not going to be any kind of major lag.

Thousands of processors, you do have to wait a little while while we gather the information over the network. So we do have customers who are using TotalView on 2,000 processes, 3,000 processes. And they give us a hard time about how long it takes to do it. You know, a synchronized step or something like that. But, you know, there are other people who are doing that and we're responding to them.

So you shouldn't have a problem at 50 or 100. And it's an ongoing concern of ours. Just to talk about, you know, other architectures, we're actually on the Blue Gene machine, which has targets of 10,000s of processes. And so, you know, I don't want to talk about that here because it's not really relevant. But, you know, that is an architecture we have to support. Okay, so Atmos TotalView releases.

We're actually a fairly small company. And we have a fairly ambitious sort of roadmap for each given year. We have two major releases and then also four minor releases. And why is that? I actually had some of the Apple guys giving me a hard time about that at a lunch the other day. And the real reason is because we have a bunch of different platforms. People are releasing compilers. We handle a huge number of compilers. We're releasing new compilers, new operating system versions all the time. And we really need to spin a new version often to those things.

In terms of major features coming in, we really have the two main releases. So TotalView 7 is the one that we're about ready to release. We're in a beta period right now. And then TotalView 7 something, I'm not sure what the name will be, will be out about supercomputing time at the end of third quarter, beginning of fourth.

Okay, that's the overall roadmap, where we're at in versions and what we're doing. The Mac OS is what you guys care about. We're currently in the beta. We're really excited to have some people testing TotalView on the Apple. We're actually, I think that if anyone was really excited out of this group, I can probably still squeeze you in.

We only have a couple of weeks left, but certainly feel free to come up and talk to me afterwards if you'd really like to go and kick the tires of this even before we do release. But the release will be within a couple of weeks, probably before the end of this month.

Support will be for both Tiger and Panther. We will be supporting... We don't support Xcode, as you can tell from the GUI. We're not an Xcode plug-in or anything like that. But I said here we support Xcode because we support the compilers that you can use with Xcode. So you can still build with an Xcode, but then you take the application and you run TotalView on the application. So we support the Apple GCC builds, the AppSoft compilers, and the IBM Excel compilers, both for Fortran and C.

And we support Xgrid, again, the same sort of idea because we support the underlying MPIs, which Xgrid is built on, which is, I understand, MPI-CH and LAN. So we've been working with both those MPI vendors for a long time. Heap memory debugging, I actually didn't sort of talk about, but in your flyers or in the little folders you'll see the really neat feature.

We're really excited about it for our other platforms. And it's something we'd love to bring to Apple. And if it's something that you guys need, if you need a heap debugger, a memory debugger for Apple, that sort of feedback is great, because I can take that back to our product management.

But it's something we're currently sort of analyzing to see whether the architecture we have on the other Unix platforms is going to be able to be ported over to Darwin. I think it will be. I'm pretty confident. And hearing from you guys that it's something you need will help me accelerate that time scale.

And obviously, Intel Darwin was a surprise to me as much as it was to everyone else. So I don't have a really strong story. I know we're talking to Apple, we're talking to Skip, and we're going to watch and see how well this release does to see whether we'll be able to support Intel Darwin.

I'm pretty confident that it'll happen, because I'm enthusiastic and hopeful that lots of people will be wanting to go out and get TotalView for their power Darwin as well. So finally, do feel free to take home those folders back there. And we're currently in the beta, so if you go to Etnis' website, you won't see a whole lot of discussion about Apple. Contact me if you want to get in right now.

We will have a release before the end of the month, which is 7.0. That's going to have support for Apple. And at that point, you'll be able to go to the website and get a 15-day free trial, fully featured version of TotalView. The only thing is you can't run it on a 64-node machine. It's limited to eight processors. But that's the only limitation.

All the other features are there. You can kick the tires, send us feedback, let us know, and then hopefully send us a check. Contact sales at etnis.com for that last operation. If you have any technical support, feedback, you need to tell us that X11 isn't good enough and we really need to move over to the Apple interface standards, that's great feedback. Send that to support at etnis.com or to me individually. So I'd love to hear that feedback. So that's it. I want to allow my co-presenters up here on the stand so we can take questions, and I'm done. Thanks a lot for your attention. Thank you.