Fortran Development and HPC on Mac OS X: 3rd Party Solutions - WWDC 2007

Developer Tools • 54:44

Fortran continues to play a prominent role in scientific and mathematical computing. See the latest Fortran tools, and computing solutions in action on Mac OS X. Learn from compiler vendors and key members of the development community how to use third-party 32-bit and 64-bit Fortran compilers in your development workflow.

Speakers: Dave Kreitzer, Drew McCormack, Wood Lotz

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

So welcome to session 323, Third Party Solutions for Fortran Development and High Performance Computing on Mac OS X. Like gall, this session is divided into three parts. I'll start off talking about the Intel Fortran Compiler and then I'll turn the floor over to David and Drew of Mac research and they will talk about the services they provide to the Mac OS scientific community. They will also go in depth into advancements in the gFortran Compiler and in particular, OpenMP. And finally, Wood from Absoft will come up and speak about their latest versions of their Fortran development tools.

So my name is Dave Kreitzer and for the past ten years, I've been a developer in the Intel Compiler lab and I'd like to spend the next few minutes first, giving you an introduction to the Intel Fortran Compiler and then to illustrate via a brief demo, how the Compiler's advanced optimization techniques can help deliver performance to your applications.

During the ten years that I've been working on the Intel Compiler, the development team has grown significantly. Intel is investing heavily in it's compiler resources and the primary goal has always been to deliver outstanding software performance on Intel architecture and with the advent of technology such as hyper threading and multi core, providing threading features has become a major focus contributing to that goal.

Our industry leading performance has been proven for many years on the Windows and Linux platforms across a wide spectrum of applications. Including those from the scientific and engineering computing fields. And my key message to you today is that Intel's thrilled to have the opportunity to bring that proven performance to the Mac.

And as I'll show you in the demo, the Intel Compiler is designed to integrate easily into the development environment that you're used to using. Intel and Apple have worked closely together to ensure seamless integration into Xcode and the Compiler's inter operable with the other tools that you're used to using such as GCC and GDC.

Additionally the Compiler supports the Fortran standards, so that this all means is that you can take advantage of our outstanding performance without having to spend the time and energy porting your applications to a completely new tool suite. And if you have problems with the Compiler, have questions or just have suggestions about possible future feature enhancements, Intel has an extremely responsive support team of engineers who have the expertise to help you extract the most performance out of Intel architecture for your Fortran apps.

So at this time, I'd like to introduce our latest product, the Intel Fortran Compiler Professional Edition, Version 10.0. This is a single bundle package of three separate tools. The Intel Fortran Compiler, the Intel Math Kernel Library and the Intel Debugger. This is Intel's first 64 bit Fortran Compiler product for the Mac and we're very excited about that.

( applause )

Thank you. The Compiler will help take your Fortran source code and convert it into highly optimized machine code with improved threading behavior and performance. The Intel Math Kernel Library contains a set of math functions that are aggressively multi threaded and meticulously tuned to Intel architecture. MKL contains BLAS functions, LAPACK functions, FFT, vector math functions and vector statistics functions. And finally, the Intel Debugger is designed to work well with the Intel Compiler and it contains features to aid in the debugging of optimized and multi threaded code.

So I'd like to emphasize again, that the primary feature delivered by Intel's Compilers is performance and there are a number of features and underlying technologies that help deliver that performance. Auto-parallelization and OpenMP support help you to take advantage of multi core. Vectorization helps you take advantage of instruction set enhancements such as the streaming SIMD extensions and it's descendants. Advanced co generation optimizations help you to take advantage of low level performance details of our microprocessors.

Inter procedural optimizations or IPO, helps expose to the compiler optimization opportunities that exist across module boundaries and profile guided optimizations or PGO, enables the Compiler to make optimization decisions based on concrete information about the runtime and behavior of your program. And since we integrate easily into Xcode, you can take advantage of all these performance features without having to go through an expensive tools migration.

So what's new specifically in our 10.0 version. Well the big bang new performance feature is HPO. HPO is a high performance parallel optimizer that combines in one phase, vectorization, auto-parallelization and aggressive loop transformations. HPO essentially finds the parallelism that exists in your program and then categorizes it for vectorization on the one hand in the form of SSE instruction or auto-parallelization on the other hand in the form of multi threading for multi-core or perhaps a combination of both. There are new verification and diagnostic capabilities. The Compiler can instrument your code so that at runtime, it can detect stack integrity problems and buffer overflows. Additionally its whole program analysis capabilities help to detect a whole new class of problems at compile time such as OpenMP usage errors.

There's support for a new Fortran 2003 features. Asynchronous I/O is a new multi threaded I/O library that will help to improve your application performance. There are new C interoperability features that help you to build portable mixed language applications. There's improved standard checking and many more features that really are aimed at increasing developer productivity.

Improvements in the Intel Debugger make it easier than ever to debug parallel programs, programs that use MPI and OpenMP and I'd like to emphasize again, that this is Intel's first 64 bit Fortran Compiler for the Mac. You have the opportunity to direct the Compiler at Compiler time to either produce a 64 bit binary or a 32 bit binary or a universal binary that contains both. So at this point, I'd like to go ahead and go to the demo machine.

The application that we're using in this demo is the same square charge application that was used in Monday's tool state of the union if you were here for that and I'll tell you more about what this application does in a moment. But let me first draw your attention to the fact that this Xcode project contains a number of Fortran modules and a number of Objective-C modules.

So it's easy in Xcode to mix C, C++, Objective-C and Fortran altogether in the same project. And here now I'll show you how to select the Intel Fortran Compiler as a default for building Fortran source modules. You just need to open up the inspector for the target, click on the rules tab which is already selected here and then select the Fortran Compiler from the pull down menu. That's it, all there is to it. So now I'd like to go ahead and build this application. And the thing I'd like you to notice right now is these diagnostic messages which are coming from the vectorizer. I did that again.

There. Loop was not vectorized, existence of vector dependents I'll talk a little bit more about that in a moment. But let's go ahead and run the application. Now what this application is doing is a numerical integration. That's a common problem in scientific computing and this particular application was written by our resident physicist and it's from the field of electrostatics and it's computing the electric potential due to a uniformly distributed electric charge in a unit square at various points outside the square but in the same plane. If you were at the tools talk on Monday, you might have heard the more technical terms for what this program was computing. I think it was Electra whosiemawhatsit.

But so anyway, this took about 23 seconds. # Let's go ahead and quit and take a look at the inner loop of this program that failed to vectorize. So this loop right here. Now you'll notice, this loop contains a call to an external function, a function that's defined in a different module. The vectorizer does not have the context to know that the stuff contained inside that function is independent from one loop iteration to the next.

We can give it that context by enabling inter procedural optimizations and for the purpose of this demonstration, I'd like to isolate the effect of IPO and the effect of vectorization so the next configuration that I'm going to select, enables IPO but disables the vectorizer. So now we're going ahead and rebuilding and running.

Now remember the first run that took 23 seconds, that was with your vanilla 02 optimizations and this run has in addition to those vanilla 02 optimizations, inter procedural optimizations but no vectorization and we'll see how long this takes to run. 18 seconds. So a speed up of about 5 seconds or in the range of 20 percent.

So even without vectorization, we're seeing some benefit from inter procedural optimizations and that's because the Compiler is able to in line the code from that integration function, into the inner loop, eliminating the call overhead. Now let's enable vectorization. Okay and the thing to note here is that now the Compiler is telling us that that inner loop has been vectorized and we'll see how that performs.

So you've gone from 18 seconds to now about seven tenths of a second, 30x ish or so speed up. There's one more thing I'd like to do with this application and that is multi thread it and what we've done is taken an outer loop, right here and added an OpenMP directive to it. You actually need a command line option to tell the Compiler to go ahead and process these directives and generate parallel code for the loops that they modify. So that's what this last configuration is doing.

Now you can see that in addition to the message that tells us that loop was vectorized, there's a diagnostic message that tells us that the OpenMP loop was parallelized. And we'll go ahead and see how that performs. So a little under 2 seconds or a speed up of 4x ish. Makes me wonder if this is a 4 core machine and not the 8 core machine we were running on yesterday.

Okay, can we cut back to the slides please. Thank you. So what did we do in this demo? We first showed how easily the Compiler integrates into Xcode. Then we built the square charge application in under a number of different configurations. We saw the diagnostic messages coming from the vectorizer and the parallelizer and most importantly, we saw the improved performance that was achievable by applying the more advanced optimization capabilities of the Compiler. We went from an execution time of about 23 seconds to an execution time of about 17 hundredths of a second. So it's an improvement of over 100x.

Now this demo of course was designed to showcase the optimization capabilities of the Compiler. We don't expect everybody to be able to just take the Compiler and achieve 100x performance improvement on their application, but you can imagine that if you have one hot kernel of your application and are able to achieve just a fraction of the speed up on that kernel, you can achieve some very impressive application level performance improvements just by switching to our Compiler.

So hopefully by this point I've peaked your interest about the Compiler and conveyed the idea that we think it's an invaluable tool for helping build high performing, multi threaded, Fortran applications on the Mac. And if you would like more information or would like to download a free evaluation copy, I'll direct you to the Intel software products website.

And with that I'd like to thank you and ask David to come up.

( applause )

Thank you. So my name's David Gohara and I am with MacResearch.org, it's an online community for scientists who are using Apple hardware and software in their research. MacResearch.org helps scientists by aggregating the community together and providing resources for scientists such as news, reviews, tutorials.

We also have initiatives to help scientists do more with their, with their work on the platform and we also do another important thing, which is we serve as advocates for scientists directly with Apply by meeting with Apple, talking with Apple, engaging with then and letting them know what scientists need on the platform to really use their hardware and software in their work.

So what are some of the resources? Early on, one of the resources we came up with was our script repository, which scientists could take any of the tools that they've developed and we'd essentially just host them for them and they're freely available to any other scientist to use in their own work.

And these could be extensions, applications that scientists might commonly use such as Octave or R for example. And the repository of course is free and it's searchable by programming language or by scientific discipline and it's freely available. You can upload whatever you have and make it available to others.

The other area of that we've really invested heavily our time in is tutorials. These are usually user driven tutorials. People saying well how do I use XYZ technology in Mac OS X, in my work. And so we go out and Drew McCormack has written an excellent series called Cocoa for Scientists, I know you're all sick and tired of hearing about Cocoa this week, but you know, that's one of the one of the series. We also have Apple Script for Scientists, Xcode, tutorials, Xgrade etcetera and so forth. Again, these are all free and not only do we develop these tutorials, but other scientists put them, write their own tutorials and we make them available and host them on the site.

Another recent initiative that we've done is OpenMacGrid. Which is a community Xgrid cluster for scientists and so if, and it's operating right now at about 2.2 THz and if you're a scientist and you have an application that's amendable to grid computing, you can get access to these resources free of charge and you can, if you need extra computational power on demand, but you don't happen to have the computational resources wherever you're located, you can just submit a proposal and if it's approved then you can go ahead and get access to OpenMacGrid.

But the real reason why we're standing up here of course is Fortran and one of the things that we all know is that there's a large body of scientific code that has been developed in Fortran, is maintained in Fortran and continues to get updated in Fortran and so we want to make it easy for scientists to use Fortran within Mac OS X.

Now of course they're the Intel tools and the Absoft tools and there are also open source tools and scientists have different needs and sometimes they just want something right away, sometimes they want a commercial application with well you know, with additional support or what not and so we have focused on gFortran as one alternative for scientists to use. And to make it easy for scientists to use Fortran on Mac OS X, to use gFortran I should say on Mac OS X, we've done two things.

We've made prepackaged installers that are simple double clickable installers, so you can within a matter of seconds, get gFortran installed on your system, power pc or Intel based Mac's and you'll be up and running at the command line ready to go. But at the same time, we also wanted scientists to be able to use a lot of the more advanced tools that come with Apple's developer tools such as Xcode, the graphical debugger and what not and to make that easy for them, we commissioned a contest essentially, that would allow people to use gFortran within Xcode and this plug in works also PowerPC and Intel based Macs.

It's free, it's also part of this installer package. You double click install. It's all set up. It comes with templates, it's ready to go. So you can develop your Fortran projects within Xcode, you can develop mixed C in Fortran and what not and you're ready to go. And of course the plug in is open source as well.

So if you wanna extend the plug in and prove it, make modifications to it, you're free to do that and it's all available at MacResearch.org. So at this point what I'd like to do is turn this over to my colleague, Drew McCormack who will give you an overview of OpenMP and gFortran.

( applause )

Okay thanks Dave. So what is OpenMP?

OpenMP is a parallel programming API, particularly for shared memory architectures. Now it's supported with C and Fortran and it's actually not a new API at all. It's been around, for around 10 years I guess. But it's been used primarily on super-computers, large super-computers, SMP machines such as the SGI Origin series.

But it hasn't been used very much on the desktop and the main reason for that is that you simply haven't had multiprocessor machines on the desktop or at least not significantly many calls. That of course is all changing and that makes OpenMP more interesting to people developing on the desktop and so OpenMP is becoming a bit more relevant to most programmers.

So if you wanna know more about the spec of OpenMP, what OpenMP can do for you, I suggest taking at look at the openmp.org website. But I wanna go through now and sort of introduce you to OpenMP and in particular, how it's used in gFortran. I assume that many of you have heard of MPI. MPI is a popular way of parallelizing programs and I think a good way to introduce OpenMP is to contrast it with MPI for that reason.

So MPI is the message passing interface and basically what you have there is separate processors, so effectively separate programs running with their own memory address space and of course they can communicate with one another with message passing. So sending buffers of data to one another. MPI works very well on distributed memory computers like clusters or grids even, but it can also be used on shared memory machines. Usually there's a small cost in memory, but you can at least run on a, on a say an 8-core octo Mac Pro. That's an advantage of MPI, it runs just about everywhere.

OpenMP in contrast, is a threaded model, so it's not separate processors, but separate threads in one process. It's shared memory so all of those threads of course can read and write the same memory and it doesn't generally run on a cluster, it's generally for SMP machines. Now there's a footnote there because Intel does actually make something called Cluster OpenMP. Unfortunately it's not available on the Mac. So if you'd like Intel to bring that to the Mac, I think it's only on Linux at the moment, if you'd like Intel to bring it to the Mac, I suggest you that you get in contact with Intel.

So let's go into the memory architecture of these things. Imagine we've got a serial program, this is a Fortran serial program. Imagine these blocks here are the different, are different arrays in memory. So a green array and a blue array and a pink array. So what happens when we move to MPI, well then we have of course multiple processes and typically what you have when you parallelize a code with MPI is that you'll duplicate some of those arrays, particularly the smaller ones, you don't wanna distribute those necessarily so you'll have a small memory cost.

You'll have more memory, you'll require some more memory because you'll in this case for example, we'd duplicate the green square or the green array, we'd duplicated the blue one and maybe the pink one was particularly big and so we'd split that in half, put have on CPU one, half on CPU two.

Of course, the accounting for where that data is, when you split an array in half like that, things get a bit more complicated in your program, because you have to do all the accounting for indexing and things. Where is that data? If you need it on a certain processor, how do I get it? You have to request it from another processor. So that can complicate your program quite a lot.

So this is distributed memory, some arrays are duplicated and some are distributed. If we go to OpenMP now, we're back at the serial case basically, we've got the same memory structure as a serial case, but we're two processors or two threads should I say in this case, sharing the same memory and so they can read and write exactly the same logical data. And that means you're using less memory in general and but it also makes complicates things as we'll see a bit later. So memory is shared and each processor can read and write exactly the same address space.

So let's get into how OpenMP actually works. It's basically a loop centric, so your program typically remains largely serial. So when you first compile your program with an OpenMP Compiler, it wall typically, it will just simply run serially. So you don't have to do anything to step to OpenMP.

But then of course you won't get any performance gain because it's just running serially, so what you need menu to do is say profile your code, maybe with Shark or something like that and find out where those hot kernels or hot loops are in your code and then you wanna go in and use OpenMP to parallelize those, only those parts of the code, leaving the rest serial.

So the way that works is you'll if you got some loop, some hot loop, you'll put in some OpenMP directives for that loop and what happens is then the Compiler will insert code into your executable that when it hits that loop, some threads will be spawned and then each thread will work on a certain subset of the iterations of the loop and then at the end of the loop all the threads will exit except for of course, the one master thread.

Now it's very important to realize that this happens implicitly. You never actually say I wanna create now, I wanna create ten threads, you don't do that explicitly like you would in say pthreads, you simply tell the Compiler, this loop can be parallelized in this manner and the Compiler will do it for you, it will inject the code that's necessary to create those threads and delete them at the end.

Now as this here will be about gFortran and OpenMP just come to gFortran, actually if you get a very recent version of gFortran or GCC, you can use it. I think you need version 4.2 or later. And it's very easy to use. You simply compile your code as normal, but you add the -fopenmp option.

And then at run time, you typically have to tell the program how many threads it should spawn when it enters a loop and the way you can do that is just simply set an environment variable, the OMP number of threads. And there are other ways of doing it as well. You can do it actually dynamically inside your program if you like.

So let's get away from all this abstract talk and into an actual example, concrete example. Something that should be familiar to most Fortran programmers. This is a full Fortran 90 program and all it does is do a matrix multiplication. Now of course you wouldn't do it in this way in a real production code, you would hopefully use MKL or you would use accelerate or some high performance library.

But of course everyone knows what a matrix matrix multiplication is so we can use that as an example. So in the middle there in gold, is the actual loops that perform the multiplication. Double loop with a dot_product in the middle. So how do we parallelize this in OpenMP. Well it's actually very easy. You add a single directive just in front of the loop, the first loop.

And an OpenMP directive is just actually in Fortran, it's just a comment, a Fortran comment. So the nice thing about that is if then go and compile this code on a serial compiler that's for serial use, it doesn't understand OpenMP, it will still compile and run serially. So that's a real advantage of OpenMP. You don't have to, you don't have to jump in, you can literally just dip your toe in and just so your program can be run serially or with OpenMP.

Now a directive has the $OMP that indicates to the Compiler that it's a directive and in this case we're doing a parallel do, which means that basically tells the Compiler the loop that's following this one, the one with the I index, parallelize that and the private(j), what that means is that each thread that's produced, should have it's own private copy of the j variable and that's actually important because otherwise all of these threads would be trying to ride into the j variable and you'll get what's called a race condition which I'll talk a little bit about shortly.

Okay, so what happens when you do this, when you run this, run this code? Imagine this is a thread, this blue line is a thread, single thread so your program initially starts off serial, it will run along serial and in the middle hit this loop and then it will spawn multiple threads and these threads will go through the loop and at the end of the loop, it will return to a serial application.

So one really nice aspect of OpenMP is that you can parallelize your code one loop at a time. You don't have to do the big bang change to five hundred thousand lines of code, you can literally go and find those hot spots, parallelize them and leave the rest serial. And over time, as you think oh I need another performance boost, you can go back and do another loop or another section of the code. But you can do it incrementally and that's an important part of it.

Now just to return to what actually happens with these threads, as I said, each thread takes a subset of these directives. Can you imagine this is the matrix C thread one, say we had twenty threads, three thousand rows, then thread one will simply do iterations one to a hundred, thread two will do one hundred and one to two hundred, etcetera, etcetera. So it's, in the default case it's very simple.

They simply split up the work, share the work. Now just to show you that it's not totally, it's very simple, but it can also produce good results, here is a test I did on, of exactly this code on a 3 gigahertz Intel Mac Pro 4 core and what you can see is that on 4 processes or 4 cores, you can get a speed of around 3.6 with this code. Which is actually quite acceptable, quite reasonable for such a simple change. At the bottom there are the Compiler options that I used, but they're not really of that important in this case, we're just interested in the scaling.

So I mentioned before race conditions and I wanna return to that a bit. Here we've got the same loop and imagine that we forgot to include the private(j), then we will have introduced a bug, a race condition and what we'll, as I said, what will happen is, all of those threads will try to write to this j variable at one time, so one might be writing one into j, another might be writing five into J and the results that you get out will be completely dependent on a race between the threads. It's very undesirable and in your best case if you're really lucky, your program will crash.

That's a good situation. You'll know that there's a problem and you'll be able to fix it. In the, in a worst case, you'll simply get the wrong answer and you won't know that it's the wrong answer and in a very bad case, but still possible, it might run well, properly, 999 times and on the thousandth time it will crash or produce the wrong answer. So this makes debugging very difficult in OpenMP. You have to be very careful with what you're doing. And that I guess is the flip side to the ease of use. It's easy to learn, easy to introduce, but the flip side, debugging can become very difficult.

So what else is in there? What else is in OpenMP? I've shown you a very simple parallel loop construction, I'll just go through a few of the other things you can do, just to give you an idea of what's possible. First of all you've got things like scheduling.

You can change the scheduling from static to dynamic, so imagine that you've got a loop where some iteration take a lot longer to run than other iterations. In that case, it would be better to do some sort of dynamic load balancing and that's possible in OpenMP. You can have each of the threads go away, do some work, come back and ask for more work and effectively that balances things out.

You've also got runtime routines, actual subroutines you can call, to query things that the OpenMP system has, such as how many threads are running at the moment and what thread am I, that sort of query. You can also do things like locking. But typically don't need to use these routines. You can get away with the directives for most things, but if you really wanna get into the nitty gritty and low level, low level threading, you can use these routines.

Plus there are a few other things that I'll just demonstrate now. So we mentioned there was a parallel loop that was a single block, but you can actually make these parallel sections go over much wider area of the code. So in this case, there's two loops and a parallel section right around those loops.

So what happens is you come in serially, you hit this parallel section and it spawns these threads, they go through the first loop, do some work and then there's actually a barrier, a synchronization at the end of the first loop that's implicit. You can actually get around that by setting a no wait directive. And then they will continue on do the second loop and before returning to serial.

So the message is there that you can actually make whole sections of your code parallel, not just, not just a single loop. Now you can do the other way around as well. If you've got a very big parallel section, you can also do a serial section embedded inside that.

So here we've got a parallel loop and then we've got this master block and the master block is a serial section of the code. So we wanna do summation just on one thread. So what happens here in this case, the master thread continues, the others just wait. And then we go through the second loop in parallel again and back to serial. So you can have parallel sections in serial sections and you can have serial sections in parallel sections.

Now you might be thinking oh it's a shame to do that summation serially, that's gonna be a performance problem and you can get around that sort of thing with something like a reduction. So a reduction is like an arithmetic operation that you can do in parallel. So this is exactly the same code, but I've replaced that master block with a reduction.

So what happens is you enter the, enter the parallel section, you do the first loop parallel, then there's a barrier and then you go through this reduction serially and what in parallel as well and what happens there is that each thread has it's own sum variable and adds up all of the array elements, but at the end of the loop, all of those partial sums are added together and put back in to the global sum variable.

So this is a way of doing summation in parallel and then we do the last loop as before. So a lesson there is you can do basic mathematical operations in parallel as well, reductions for example. Now I've said that everything's in OpenMP is pretty much loop centric, that's not entirely true. You can do other types of threading as well which are not loop based.

For example, here's a section, a parallel sections, a construction. So in this case what happens is, you split up the loops and then the first thread will do the first section, the second thread will simultaneously do the second section and the third will do the third section and then continue on in parallel.

Now of course, your sections have to be basically independent code. You basically have to have, this will only work if set up A, set up B and set up C are independent of one another and you also notice that this will only scale to three threads before this particular part of the code will not scale any further. So that's, that's a disadvantage, but this could be useful for certain parts of the code, which are not so easy to do with parallelizing loops.

So independent sections of the code can run concurrently. So that's pretty much all I've got to say, just a few points in summary. OpenMP is easy to learn. It only really works on shared memory machines, and so that's certainly something you should consider, but if you're writing desktop apps, now is a good time to learn it. Because you really can take advantage of those 8 core Mac Pros quite easily.

You can convert from serial to OpenMP one loop at a time, iteratively, which is a very useful attribute of OpenMP. It's not the case for example with MPI where typically you need to do a big bang change to your whole code to get it to parallelize. So with OpenMP you can, you can just make one part of the code fast and then three months later you can go back and make another part fast and in the interim time, you can be using that code.

On flip side of course is that race conditions can make debugging very difficult. To use OpenMP and gFortran just use a -fopenmp flag and you'll need a very new version of gFortran, I think it's 4.2. Okay now I'll just pass back to Dave to finish off and talk about the future of Mac research.

( applause )

So just really quickly, a couple points. We are at MacResearch, having a new or redesigning the site from the ground up and to make it easier for people to use to get information and it'll be the same great site, actually it'll be a better site I think, but it'll be easier to find what you want, do what you want on the site and to really contribute to the site as well and you'll notice that real programmers do code in Fortran. So keep that in mind and Fortran is always welcome at MacResearch.

But the other thing I wanted to announce as well is we are partnering up with the guys at Mac ports and to bring a new, new initiative called OpenMacForge, it's gonna be, it's gonna come out in two phases, but the first phase of course is pre-compiled binaries with installer packages and we are hoping to get that rolled out really soon and what it'll allow you to do is if you're a scientist and there are common applications that you'd like to use, but you don't wanna go through the trouble of having to build them all yourselves, we're going to help, we're gonna build them for you and you can just double click, install, get them up and running and get your work going in no time, so and of course it free as well. That service will be provided free as well. So you can look forward to that in the very near future and so with that I will pass it off to Wood from Absoft, from Absoft and he will continue.

( applause )

Thank you very much.

Hello I'm Wood Lotz from Absoft and many of you may know, we've been doing tools for the Mac since 1984. We did have a 64 bit Fortran Cdompiler for the power based machines and today I'm here to talk about a brand new product we have for the Intel based machines.

The product is called Absoft Pro Fortran version 10 and it's a completely new product. It's not just port of our previous product that we had on Power. The two biggest differences is it includes a lot of new compiler technology, which is all oriented towards improved performance and we have introduced a new version of our Fx Debugger, Fx3 which is included with the product.

We began shipping this about three months ago. We've gotten excellent feedback and the keys that we tried to work with to develop for this new product were the performance and reliability and ease of use for the Compiler and the basic tools, but also all Pro Fortran products, at least we consider a complete solution and that means they include not only the standard Compiler and Debugger, but also development environment designed for Fortran programmers. Again, this is shipping now, it installs on both Tiger and Leopard. So what I'm gonna talk about here is an overview of the product and everything it includes and then we're gonna talk about performance.

Okay an overview here summary of the Pro Fortran version 10 product is obviously we wanna generate really fast code, 32 and 64 bit. It's a Compiler that we have been using for quite awhile, uses a lot of technology from Cray Research. We licensed that in 1993 I think and Sun and SGI use licenses as well, so our Compilers are source compatible with Cray and Sun and Absoft across the board with no problems. Since that time we've also added quite a few extensions from all the work stations and big iron and several from Fortran in 2003.

Of note with the new version 10 Compiler, are the optimizers in the back end which we'll get to in a little bit. Support for threading is obviously important and we make that available as well as automatic vectorization and the auto-vectorization is a completely new feature with this product.

Building mixed Fortran and C, as other people have stated as well, is critical to pretty much everybody in the HPC space and so that's been something we've spent a lot of time with and that is easy to do but in the documentation, we also spend a little extra time with examples and instructions on just how to do that in different ways.

The FX3 debugger which we introduced new here like I said, is that we've been doing Fortran debuggers for probably 15 years, starting out with the Fx debugger. Fx3 is the latest iteration and this debugger adds, includes all the functionality that you would need for most Fortran debugging but also supports C as well. So mixed Fortran and C debugging is important.

Something that was important that we did with this product is we use a standard code base across Mac, Windows and Linux and that's really good for you guys because it means we can keep the product reliable, easier and also add features faster across all those platforms. It has the same look at feel on all the platforms. It also is better for us because it supports easier. The development environment that we include is complete and it provides benefits that aren't available elsewhere for most Fortran programs.

Some of those benefits include the fact that it supports multiple compilers both Fortran and C as plug ins. So you can mix and match different compilers and build your codes as you want to. We also include an application framework, which we call MRWE, which stands for Macintosh Runtime Window Environment and that automatically puts a Mac style front end on your application if you choose to do so. The application framework's written in Fortran, all the sources included if you're really daring and wanna do some programming with Mac toolbox from Fortran.

We also support a variety of third party products which can plug into the development environment. We included Programmer's Editor for example, but if you wanna use BBEdit or something that's fine. We also support different math libraries. For example, IMSL will plug in and if you have some underlying core routines you wanna use, you can just select those options and then automatically everything hooks together properly at build time. Just to make it simple for everybody.

And again, this environment looks the same on Windows, Mac and Linux so for people that program in multiple environments, the learning curve is just one, it's simple and easy to do. As far as math libraries, we include the BLAS, LAPACK, atlas, FFT's and things like that. Which are already pre-built, preconfigured and ready to run. We also include some 2D and 3D graphics as well as HDF libraries. So this is all a part of the package. Excuse me.

Okay first, let me take a look just at a screen, so we can get an idea what we're talking about here. Oh, just put nothing else to buy or learn. Cool. Thanks guys. Very good. Okay, here's a screen and you can see across the top, this is a sample.

We support different Fortran Compilers and the GNU C Compiler on the Intel Macs. On the power Macs we support of course the Absoft Compilers, the GNU Apple Compilers and the IBM XL Fortran and XLC Compilers in this environment and depending on which compiler you would choose, you can set a whole bunch of options which makes things simple.

You can build as you can see there, MRWE applications, whatever you wanna do, position independent code. On the right hand side we have a thing called speed math, which is a really aggressive floating point optimizer, so if your applications can trade a little accuracy out there in the far digits for some speed, then you might wanna give it a try.

Also there's command lines available. Options at the bottom so you can do that. Anyway, it's a typical environment but it's designed for Fortran building as well as C and most of the environments are designed primarily for C. So that's why we did it. We'll come back to that slide in a second.

Here is the screen shot of the Debugger and again, it includes standard things that you need for graphical debugging environment. This particular sample is an example of array elements and you can you know, open up a lot of screens, whatever you wanna do for looking at that. Okay, that's a quick summary of the environment itself and the features that we include.

But performance is key in the Fortran world of course so that's something we'll talk about next. And because this is a completely new compiler that we moved to the or built for the Intel machines and we didn't port our other compiler for comparisons, we're gonna be comparing against some other compilers available in the space.

Oops, just a second. Okay, what we're using for the basis of our comparison here is the Polyhedron in Benchmarks and for those of you who may not be familiar with Polyhedron, it's a company in the UK, they're a Fortran specialty shop and over the years they've collected a suite of sixteen programs which represents a good benchmark that a lot of people use and basically, you run the programs, take the geometric mean and that's in seconds, so less time is better. So as you can see here on 32 bit, we're looking at Tiger here, the Absoft version 10 ran in 16 seconds and change.

Right next to that we have our compiler version 9 from Power running under emulation, we just included that for a comparison for people in the event that some people are still using Rosetta to run power developed apps on their Intel boxes. There's a huge performance gain available if you switch to native compilers. And then we used gFortran and G95 for comparison here as well.

In the 64 bit space, we used Absoft Intel, that's 9.1 and also gFortran and again, we're looking good. Caveat supply of course because this is just a benchmark, your mileage will vary and but we did wanna do this to show that the new version 10 Compiler is a very, very good performer. And 64 bit code is what we're gonna be focusing on in the future and that's where most of the continued development will be.

Okay, I've talked about auto-vectorization as a new addition to the version 10 Compiler and again, as was discussed earlier, the goal of auto-vectorization is certainly to execute multiple loops at the same time. We also have a vectorization report which you can print out and it shows loops that were not vectorized.

You can identify them and then you can go back and if you want to and look at the loops and see if maybe a little massaging of code could change the loop so it could be vectorized and improve performance even further. Again, how well does this work? Here's a sample here where we took another benchmark and on the left the unoptimized code and then we optimized it with different compilers and came out very well. So the new version 10 Compiler is a very, very good performer.

So basically that's my snapshot of what we've got for the new Intel based Macs. So to summarize, we have 32 and 64 bit Compiledrs so you can install in Tiger or Leopard, just runs, auto-parallelization, auto-vectorization so you can take care, take advantage of the multi-core machines. We have a brand new debugger we introduced for this. Fortran and mixed C is easy to do. The IDE is common across a variety of platforms and the price starts at just $299.

Another thing we wanted to announce, actually just happened this week, so I'm happy to announce it here is that in addition to shipping the first 64 bit commercial Compiler for Intel Macs, we will be shipping IMSL beginning next week and I believe we're the first company to do that as well.

And in addition, for anybody who has the Power based Mac and absoft and IMSL, we will be offering special upgrade pricing. This is something we worked out with VNI to extend to our existing customer base. There'll be details on our site about that shortly. Okay well that's the product that we have and of course we make it, but we want you to try it.

So that's the whole goal and we have, we like to, I'd like to encourage you to come to our website, that's absoft.com. We have a free thirty day demo that you can download, try it on your code, try it on your machine and if you have any questions, email us or call us. We prefer email as the method of support question handling, but we do maintain real live knowledgeable people that will answer questions and help you solve your problems if you wanna call us during business hours. So that's it, thank you very much.