Using Intel Software Development Products for Mac OS X - WWDC 2006

Development Tools • 59:03

The Intel Fortran Compiler and Intel C++ Compiler work with the Intel Math Kernel Library and Intel Integrated Performance Primitives to help produce high-performance scientific and media applications. Discover how these tools integrate with Xcode, allowing you to produce code that takes advantage of the multi-core features of the Intel Core Duo processor. Watch live as sample application is taken through the steps needed to become fully vectorized, connected with key performance routines, and optimized.

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. I'm Joe Wolfe. I'm a manager of the Intel Compiler Software, sorry about that, Technical Consulting and Support Team at Intel. And I'm Elizabeth Schneider, and I'm a Technical Consulting Engineer and Manager in the Intel Compiler Support Team. And welcome to our session. This is Using the Intel Software Development Products for Mac OS X.

You know, just wanted to say that, you know, we are absolutely thrilled to be here. It's the first time attending WWDC, and it's just been really exciting to see, you know, just what you're doing with the new Intel-based Macs and just the progress that you've made. This has just been fabulous for us to see. I'd also like to thank our hosts, you know, several people from Apple that we work with. Thank you for inviting us to the conference and allowing us to speak here.

Yeah, as I said, it's just been a wonderful experience. But let's go ahead and just see what it is that we're going to be talking about today. You know, so, you know, first, I want to give you, you know, an understanding of the software development products for the Mac OS X that we have. And I also want to spend a little bit of time just explaining why Intel is in the software business at all.

So we're going to spend a little time there. Number two, we're going to talk about how you can use these tools to build high-performance, high-performance software. So looking at the unique performance capabilities of the tools and just how you can use them in your software. And then thirdly, we're going to look at multi-threading. You know, with the dual-core capabilities of the Intel Core Duo, Core 2 Duo processors, we want to make sure that you understand some of the threading capabilities that we have in the tools and some of the techniques that you can use to take advantage of that.

Now, the way this is going to work, I'm going to be up here talking and hand-waving and doing some things, and Elizabeth is going to be running demos. She's going to keep me honest, showing you that I'm just not making this stuff up. So, you know, we're going to have a bunch of demos of different of the techniques and the things that we're showing you here. At the end of this, we should have plenty of time for Q&A, and we're looking forward to, you know, interacting with you some more.

We are also in the performance lab downstairs. There's Elizabeth and I, as well as a number of other people from the development team and other of the tools teams, as well as some application engineers that have had a lot of experience helping port and tune applications for the Mac on the Intel-based Macs. So, we're looking forward to talking with you some more there.

Now, to start, I have a question. How many of you knew that Intel is a software company? Wow, okay, that's amazing. A software company in Intel. I wasn't sure if, yeah, I just wasn't sure what the reaction was going to be to that. Now, how many of you knew that we have, you know, Intel compilers, libraries, or software development products? Okay. Oh, very good, very good.

Okay, that's, I was, I had no idea what the response to that question was going to be. That's great to hear that you've heard about us. But, you know, still, I want to give you a little bit of background of just why we're here and why software is important to Intel and why we're in the software tools business. You know, we know our processors very well, and we want to make sure that you get the maximum performance on your application utilizing the Intel processors.

So, we don't want you to leave. Any performance on the table when you're developing your applications, because performance translates into better user experience and just better, you know, just a more pleasant experience with the, you know, with the max. So we want to make sure that you have the tools and support and services that you need to do just that.

We've been in the software business for quite a number of years developing these tools just for that reason. We've also put together, you know, a whole host of other. Services, you know, around, you know, not just the software products, but around the capabilities that we have in terms of tuning and porting experience.

And, you know, so things like the Intel software network, which is a web based community for developers consisting of various user forums and blogs and just trying to get trying to bring developers software developers together for Intel's products or platforms. The Intel software college, which is. A. Training courses that we can do on a wide range of topics around our software around performance tuning for different processors.

To early access programs looking at, you know, early, you know, early Intel hardware and to Intel solution services. This is providing consulting services to help you port and tune your applications. So it's all predicated on the fact that we know our processors well and that we want to make sure that you are getting the most performance.

So we want to make sure that you are getting the most performance possible out of your out of the Intel processors for your applications on the Max. And, you know, to kick this off and just to illustrate what I'm talking about here. I wanted to show our first demo. Elizabeth, if you want to.

Okay, what I have here is I'm going to be bringing up two windows. And on the right, this is the POVray. It's the open source ray tracer, and it's showing an image that's being rendered. And on the right-hand side is showing that rendering under GCC 4.0.1. It's using the best options for the GNU compiler. And on the left-hand side -- oh, it already finished. Well, it's going to restart again.

Took about 20 seconds to run with the Intel C++ Compiler. This is version 9.1. And again, it's using the best options for the Intel C++ Compiler. And as you can see, it's about a 20% speedup in performance running on the Intel C++ Compiler versus the version compiled with the GCC Compiler.

So that was about -- so about 20 seconds for the ICC rendering, and I think I saw it around 29 seconds for the GCC rendering. So, you know, about a 25% or 30% performance speedup. And so if we go ahead and flip back to the slides, you know, that's -- so that's why we're here. We want to make sure that, you know, you don't leave any performance on the table, that you're getting the maximum that you can out of these Intel processor-based Macs.

And, you know, this -- the application we were showing is, you know, it's just one application. We're not saying that it's, you know, indicative of everything that you're going to run into. Your mileage is certainly going to vary from different applications and benchmarks. You know, but we just want to make sure that you have some tools available to make sure that you are getting that performance.

So, now moving into the software development tools themselves, there's really three guiding principles for us. Number one, as I mentioned, performance, just making use of the processor, threading capability is a good example of some of the new capabilities of the dual-core processors. Compatibility is another one. Now, we know that in order for you to adopt our tools or use our tools, it's got to be as easy as possible for you to use them in your development environment.

So we've worried a lot about integration into Xcode and complete interoperability with GCC and interoperability with GDB so that you can even mix and match binaries built with GCC and ICC, ICC being the Intel Compiler, and with very little to no changes to your Makefiles, to your existing Xcode projects.

So compatibility is extremely important to us. And then the third guiding principle is support. You know, we have software experts that are available for training and for support and just making sure that the experience with your tools is what you would expect. So that's why we're in the software business: performance, compatibility, and support.

Now here's the products. So first, the Intel compilers. That's our Intel Fortran as well as C++ compilers. The performance libraries, there's two that we're talking about. This is the Intel Integrated Performance Primitives, set of highly optimized threaded functions dealing primarily with some multimedia applications. I'll get into a little bit more detail later. And then the Math Kernel Library for more scientific applications.

A new, what we're calling a new threading methodology for C++. Now, you notice here there's no name on the box. And, well, as I, you know, we just talked about Intel being a, you know, a hardware and a software company. Well, we have a lot of marketing and legal folks, too, and they haven't come up with a name yet. And they've been worried, trying to decide on one for several months. And, yeah, marketing and legal, you can't live with them, can't live without them, I guess. Anyway, it's all about, again, bringing the performance to the Mac OS X.

So, I'll go into a little bit more detail then on what the compilers are about here. So, first of all, how many people here are Fortran programmers? Okay, there's a, all right, there's, you know, someone waving his hand in the front row. Okay, so maybe about 20 or so. Okay, C++. Okay, pretty much everybody else. All right. Let's see, how about Objective-C?

Okay, a lot there, kind of in between, yeah. All right, okay, just wanted to get a feel for that. So our Fortran Compiler supports all the basic, all the, you know, the common standards now, or the standards now, except we have not implemented all of Fortran 2003. That's going to be a while.

There's some object-oriented features in Fortran 2003 that, you know, we're looking at. We don't know when we'll be able to get some of those implemented. But 77, 90, 95, and then some other common extensions that you see in a lot of scientific computing and various applications. The C++ Compiler is fully ANSI ISO compatible.

[Transcript missing]

Then the next set of optimizations are really oriented around multi-threading. So we'll get in a little bit more detail later in the presentation. Auto-parallelization, OpenMP, the OpenMP standard for the multi-core. And then we also care a lot about making sure that for floating-point applications or floating-point intensive applications, you have some ways of controlling precision versus optimization. And there are some models that you can specify there. Other things that are, two other tools that are in the package in the compiler product.

With our code coverage and test prioritization, what these are, they are, these are tools that will use the profile-guided data, so the runtime feedback that we get through the profile-guided optimization, and be able to pinpoint functions and basic blocks within those functions that have been hit or touched by, you know, the, whatever your QA tests are, for example.

So, depending on the workloads that you provide your application in your testing, we can then provide you information about the coverage, how well you tested, as well as offer advice on which applications or which tests did better or worse, and help you with that prioritization. So, it's a little bit, it's an interesting use of the runtime feedback that we get.

And then finally, the Intel Debugger. It's a highly, well, it's a, it's a debugger, GDB compatible, command line compatible. Intended for, you know, primarily, you know, very good with optimized and threaded applications. So, it's very good for debugging those types of applications. Now, a little bit more on the environment.

[Transcript missing]

And then finally, universal binaries. We are only an Intel processor targeting compiler. We don't support PowerPC. So for universal binaries, you can build the PowerPC version with GCC, build the Intel version with ICC, and then using either Xcode or LiPo from the command line, you can build your universal binary that way. So we work out of Xcode. We are trying to be completely interoperable with GCC in the Xcode environment. And that's some of our guiding principles there.

And then this slide, so this is just an example of some more performance. We're looking at specint, rate, and specfp, the ICC 9.1 compiler on the Core Duo processor on Mac OS X. This is something we measured this past spring, compared to the GCC 4.0 compiler. And we're running at, this is a comparison of what we call base optimization. And really what that means is the best optimizations that we can find on both compilers.

And in specint, specint rate, it's about a 50% faster. In specfp, it's about 47% faster. So, you know, again, this is just a benchmark. Your mileage is gonna vary, but this is, you know, this is something that, you know, that we are concerned about and making sure that we maintain a performance lead and that you get the best performance possible for your applications. And this is an example of how, you know, of where we are today versus GCC. So that was the compilers. Excuse me. So the Intel Performance Libraries. So the two libraries... Excuse me. Yep.

So first, the Intel Integrated Performance Primitives. So as I mentioned, these are the libraries that we produced are intended for some specific application domains. And for IPP, or the Integrated Performance Libraries, these are highly tuned for the function domains of signal processing, digital imaging, cryptography, a whole host of different functional domains. And just like MKL, these libraries are already threaded.

And so at runtime, the libraries will determine what sort of processor you're running on, whether it's a Core Solo or Core Duo, and be able to implement the right thing. And also, even from the Core to the Core 2 microprocessor is the same thing. It can detect which one you're running on, so you don't have to worry about porting your application. It's just the library at runtime will detect which processor and do the right thing there. Now, MKL is targeted for the scientific side.

And so it's a little bit more complex than the other libraries. But it's a little bit more complex than the other libraries. And so at runtime, the libraries will determine what sort of processor you're running on, whether it's a Core Solo or Core Duo, and be able to implement the right thing. And also, even with the core to the Core 2 microprocessor is the same thing.

applications, you know, basic linear algebra libraries, SPAR solvers, FFTs, you know, it's, again, all of these are threaded. The team spends a lot of time just, you know, looking at these guys just love assembly, and it's going to be, it's pretty hard to beat them. As we're going to see in our example, you know, even the compiler with full vectorization and optimization, it's pretty hard to beat, say, MKL in some examples here. So, you know, we'll see that.

And I said they're already threaded. So this is an example of just the scaling that you can see with MKL going from a single threaded to two threads on a Core Duo processor. And you can see that when you get into, this is for a matrix multiply, a double precision matrix multiply. And when you get into the higher matrix sizes, we get almost perfect scaling going from about one and a half gigaflops up to three.

So again, this is, the libraries provide you with a way to get arguably the best and optimal performance as well as threaded performance out of these applications. So with that, I want to go ahead and we'll do a demo of our integration into Xcode. And before we switch to the MacBook, oops, we already switched. Okay.

uh... could we have the presentation back please Thank you. Okay. So before we go to the demo, we just wanted to show you what code we're working on. This is just a simple matrix multiply. You can see we're, you know, at the top or in the middle is a triple nested loop where we're doing the matrix multiply. The arrays are a size about 800.

I think they're all 800. And what we're going to be doing is showing the compiler, or we're going to be showing some things with the Xcode integration. And then as we go through the presentation, we'll be optimizing this with the Compilers Vectorizer and MKL, and we'll see how we do as we go. Okay.

Great. Well, look at this. I brought up the Xcode, and I happen to have that same source that Joe just showed in the presentation. And I wanted to take a look at it. It is the same source. We'll have two different code paths here defined by an ifdef, the MKL. So I want to point this out. We'll be using it throughout the demo this afternoon. I want to first give you a little bit of overview of our integration with Xcode. And I'll go ahead and bring up the target window and take a look at that.

As you can see, we have the Intel C++ Compiler here under the collections. And we have several subsets, and I'll go through just a couple of them. First, there's the general, and here you can set your optimization level. I have it set for dash O2, which is optimized for speed. This is our default optimization.

If you were to compile at the command line and you didn't specify any optimization, you would get dash O2, which is optimized for speed. You can... compile with dash O0 with no optimizations, or O1 for size, or you can add O3, which is the maximized speed with high level optimizations. Now, I also wanted to show you the optimizations. We have a checkbox here that I'll be showing a little bit later in the presentation for parallelization. We also have the preprocessor, where we'll be setting our preprocessor definition for MKL as well.

And finally, one more I want to show you is the diagnostics reports that you can get from the compiler. There's an auto-parallelizer report, and I'll be showing that a little bit later. And then there's a report of vectorization diagnostics. Right now, I have that set to VEC report 2, and I'll be showing those configurations as well. So let's go ahead and run our application. We're going to bring up the build results window. Now, go ahead and do a clean just to make sure there's nothing lurking on the system.

I'm building at the default optimization level, so let me go ahead and do that build. And that build's done. Let's go ahead and run the application. So this is a small matrix multiply. And it doesn't take too long. There we go. It finishes in about 8.93 seconds, so about approximately nine seconds.

I see there's some messages that something didn't get vectorized. Yeah, we'll talk about that in a few minutes, Joe. So hold that thought for just a minute. Well, so are you going to be able to do better than 8.93 seconds, do you think? In fact, yes, I will be able to do better. Okay, okay. All right, well, just take that matrix multiplier. We ought to be able to do a little bit better. Okay. All right, thank you.

So let's go back to the slides. There we go. OK. All right, so now we're going to talk about just how we use the compilers in the libraries to get this performance. Now, the first thing is with the core architecture and the Core 2 architecture, the key for performance is, especially for computationally intensive applications, is utilizing the streaming SIMD extensions.

This is the instructor set architecture that consists of a number of instructions that deal with single instruction, multiple data types, as well as the operations therein, and as well as the register set that holds these data types. Now, the data types that we're talking about are anywhere from integers from chars to short, int, long, as well as even up to 128-bit integer for some logical operations.

And then for floating point, what's encapsulated by the SSE instruction set and register set are the single precision or double precision floating point. So the challenge for the compiler is to convert your loops or different parts of your application into the SSE code, and that's what the process of vectorization is.

Now, there are other methods. And I think on the PowerShell, there's a lot of other methods. PowerPC with AltaVec, you could use Intrinsics or Assembly. Both of those are available with the Intel compiler and on the IMAX today. However, the recommended way is using the Intrinsics or the Vectorizer. We don't want you writing in Assembly unless you absolutely have to. It's just not a portable way to go. So let's look at how vectorization works.

You know, it's really, you know, the case where we want the compiler to do the work. And what you need to worry about with vectorization and what the compiler needs to worry about primarily is just, you know, how to break what's called memory, you know, just memory dependencies, you know, across iterations of the loops or, you know, just amongst the data that one sees or that the compiler sees in the loop.

So the compiler has to worry about, you know, iteration dependence, you know, is the data, is data written or read in one iteration that's used in the next? Or if there's, you know, memory disambiguation or pointer aliasing that will prevent, you know, that will cause the compiler to have to err on the conservative side and just assume that there's a dependence here and not be able to vectorize the loop. You know, we also have to look at, you know, whether there's enough work in the loop to... to warrant doing the vectorization. So there's some things that we have to... that the compiler has to worry about.

You know, there's other things like, you know, well, fortunately with some of the transcendentals, so if you have calls to various intrinsics in your functions like sine, cosine, etc., you know, the compiler can also deal with those and vectorize those through what we call the... our small, our short vector math library that's part of the MKL that actually ships with the compiler.

So let's look at how you can understand what some of these dependencies are when you do see these un-vectorized messages like what we saw in the example. We have what we call our optimization reports that will give you different levels of detail from yes, it vectorized to no, it didn't, and here's exactly why. And we also have some pragmas or directives that you can give to the compiler to give it hints on what to do.

You know, particularly when you're dealing with C++ and dealing with pointer dereferencing or aliasing. You know, in some cases, it's just a reason to program in Fortran, you know. You just don't have to worry about pointers that way. But in reality, though, we do. And so you can use the IVDEP pragma with C++ or with Fortran telling the compiler that without, you know, that I know my code, don't worry about it, you know, go ahead and vectorize my loop.

And then for letting the compiler to disambiguate between pointer references, we can use the restrict keyword. And that will tell the compiler that the data to which this pointer points can only be accessed by this pointer. So it's just restricting the use of the pointer and ensuring there's no aliasing.

Now in the Intel Compiler, as Elizabeth mentioned, the vectorizer is on by default. So at -02 is our default setting, and so the vectorizer will be turned on. You can set different optimization levels. 01 is what we would call, you know, optimize for size or, you know, smaller binaries. We won't do much inlining, for example.

02 is where we'll do some higher-level loop transformations to maybe be able to better vectorize your code or better lay out your code for memory and cache usage. And then, of course, we have the advanced optimization switches of IPO, the inter-procedural optimization I mentioned earlier, the whole program optimization, as well as profile-guided optimization, which is really intended or very useful for branchy code or code devs to make a lot of difference. decisions.

So that's how the compiler does it, and we'll see the example of the vectorization here in a minute. For the libraries, it's pretty simple. You just replace whole sections of code with a call to the library, and you let the library do the work. So here is an example of a matrix multiply. Again, we just replace it to the call to the CBLAS DGEM for the C version of the MKL, or for the C version of the entry point.

And again, the libraries are going to arguably give you the best performance and, of course, the best portability across processors, as well as be already threaded for you. So with that, let's go ahead and see how we can do on, we were at, let's see, 8.93 when we last left. That's right, Joe. And we had this message here. Loop was not vectorized.

Consistence of vector dependence. Wow, that, I wonder if we can do better in terms of a diagnostic. Well, it turns out that we can. Let me bring up the target window again in diagnostics. And instead of vec report 2, I'm going to go to vec report 3, which is going to give me some dependency information diagnostics as well for those loops that did not vectorize. Let me go ahead and close this and go back to my build results. And I'll do a clean because I didn't actually touch the code. And now let me build.

Oh, I've got another message now. Vector dependence assumed anti-dependence between size 2, line 27, and z, line 28. I think it's time to take a look at the code. So I'll bring that up here. And if you take a look at size 2, it's here. And here's z that it's talking about. And then size 2, oh, it happens to be a global variable. And we're passing in z. So the compiler is going to play it conservative.

It's not going to vectorize that loop. But hey, wait a second. I see an integer here that's locally defined. Let's change. Let's use z2 instead of size 2 and see what happens. So let me go ahead over here and build. And of course, I want to save. OK, so then let's go ahead and run this now that I've made that change. Oh, look. And our message is now we've got that loop vectorizing. So great. All right. All right. So let's go ahead and run. Good, good. That looks better.

So, 4.78 seconds. So once we vectorize that main loop in our application, we went from about 9 seconds to a little less than 5 seconds. Now, if you remember the code, we've got this set up so we could also call MKL instead of our hand-coded matrix multiply. So I'd like to try that now. So I'm going to go back to here, the targets, and I'm going to set the preprocessor to have MKL. Okay, and then go back here and do another clean and a build. And then let's run it now.

Oh, I think it finished on us. Scroll bar up. It did. Let me run it one more time so you believe me, it did run. There we go. 1.75 seconds. That's using the MKL library. So if we compare that with our vectorized version, we were at 4.78 seconds.

So using the MKL call for matrix multiply significantly speeded up this small application. Wow. Okay. So from 8.93. All right. Excellent. Okay, so 8.93 down to about less than five with the compiler's vectorization. Cut that in half, and then that was an amazing speed-up by MKL. Good job, library guys. Yeah. Okay, well, thank you very much. Let's see.

So let's move on to the next topic then. So multi-threading for dual-core. So I'm going to talk about three things here-- auto parallelization, or what the compiler can do automatically for you for parallelism. Then we'll talk about the OpenMP, the OpenMP standard or directives that you can use. And then we'll talk about this unnamed parallel programming model.

So first, auto-parallelization. You know, this is kind of the holy grail of compiler technology, you know, where the compiler will do everything for you, and, you know, including parallelize your loop. We've, you know, over the last several decades, you know, probably back in the '70s, when vectorization in the supercomputing world, vectorization technology, you know, became prevalent and pretty well done, it's always been, you know, the hope that, you know, compilers could do just as well with auto-parallelization or just being able to parallelize the loops as well as a vectorizer can.

And, yeah, I'm not gonna say that our compiler is, has achieved this state, you know, but there are simple loops where the compiler can do this, as we'll see in our example here. And it's something that we're working very hard on. We are putting a lot of effort into our, you know, our parallelization technology, and we wanna know when you do have loops or applications that you think can be parallelized, we wanna know about that to help get the, help move the technology along. So we'll see some, see an example of this in a minute.

But OpenMP, though, is probably our recommended method of parallelizing applications. It's a very well-defined and well-known standard that's been developed over the last several decades. You can see the entire documentation of the standard and exactly how it works at www.openmp.org. But here's some examples of what it looks like. How many here have used OpenMP?

So there's a, okay, a few people. Very good. So it works for both Fortran and C++. And basically what you're doing is providing hints to the compiler to say what sections of code to parallelize. And then it also provides some different primitives, similar to what you would do in any threading program, about locking, defining critical sections, defining what data should be private, or shared amongst tasks. And it's a fairly simple language or set of directives that you can add to your application.

And underlying it is a runtime library that will do the scheduling and take care of all of that for you. So that was the example of a parallel, the basic operations. It also defines other operations like sum reductions or product reductions, a common loop construct. And I'll show you a little bit more about that in a minute. in various applications. I said earlier, you know, there are constructs for critical sections or defining what data and when you want to lock it.

[Transcript missing]

And if you notice here, we have some parallelization going on in the demo. Let me make sure it did. I was looking for one more message. Probably did not like me closing out everything like that. So let me make sure we got that set over here. There we go. We want that switch on as well, and I think I forgot to do that. So let's do one more time through this build.

There we go, that's what I wanted to see. Loop was auto-parallelized. So let's go ahead and run this now. - Now it also says that we vectorized the loop too, huh? - That's right, we vectorized our inner loop and we auto-parallelized the outer loop of our matrix multiply.

All right, that's just the way it's supposed to work, okay.
So let's go ahead and run that. 5.04 seconds. Now as I recall, it was 4.93 vectorizing. Hmm, I think I want to go out here to my X window and I want to actually time the run of it instead.

Okay, now we see that our real time was 2.55 seconds. So the clock function that we were using within Matrix Multiply is showing the time it took to execute on both threads combined, but our real clock time was only 2.55 seconds. So there's a little bit of difference there. Okay. Now I want to show you the MKL version. So let me get back over here and turn back on MKL.

So that's an important safety tip. When you are timing multi-threaded applications on the Core Duo or the dual-core processors, you need to think about what the system clock is reporting in terms of clock ticks versus the real user time. So that's what we showed here. What the system clock reported was the total of the clock ticks from both threads.

In reality, though, the user time or the real time was half or close to half from the actual performance. Okay, now I've built it to go to the Math Kernel Library routine. Now notice that we didn't get any vectorization or parallelization here because we're calling the Math Kernel Library. But back in our other routine, we did have some initialization of our data, and there we did get auto-parallelization and the vectorization taking place.

So let's go ahead and run. I'm going to go ahead and run out here because we already know we'll have that clock problem if we run out of Xcode. Whoa, it's already done, but it says 1.89 seconds, and I think we were at 1.75 seconds. Hmm, okay. So what's going on here?

Well, it turns out that by default, the Math Kernel Library is running single-threaded because the Math Kernel Library does not want to conflict if you're using something like a single-threaded library. Okay. So it's something like the MPI for doing your threading. So by default, it's single-threaded. But let me set a variable here.

And then let me run again. Oh, it's done, and it's just one second. So if you recall, our non-vectorized matrix multiply was about nine seconds. When we vectorized, I think it went to about five seconds. Our auto-parallelized version was about 2.6 seconds. And now when we threaded MKL, we get down to a second for our matrix multiply. Excellent. Wow, so nine seconds down to basically one second with MKL and the compiler. Excellent. I think that'll work. That'll work. Thank you, Elizabeth.

Okay, so one more thing to talk about for threading, and this is our new parallel programming model for C++. So what this is is a template-based runtime library. You know, we're providing the runtime library that manages scheduling and various other things, as we'll see, as well as the header files and et cetera that define all the constructs and the concepts here.

And what it allows you to do for C++ programs is define the, is to define your parallelism in terms of, you know, data, or in terms of the tasks or patterns that you want to operate on that data, as opposed to looking at, you know, worrying about, you know, just how to thread it or looking at more of a functional decomposition of your threading.

So, you know, it's similar to, like, the standard template library. You know, if you look at different... You know, if you look at different iterators in there where you can operate on, you know, generic data types, you know, it works a lot in the same way, as we'll see in a minute.

So it's not necessarily intended to provide, you know, greater performance, perhaps, than, say, OpenMP or the Pthreads, but what it is intended to do is help you with, you know, just greater productivity, ease of use, compared to Pthreads, as we'll see in these examples. So with... And since you are programming in more of a data parallel manner, or since you are programming in a data parallel manner, as opposed to functional, you know, that is how you get greater scalability, and you can get performance that way as well.

So this is just an example of one of the concepts that we've defined in the threading library. This is the parallel four, fairly synonymous with the OpenMP parallel do loop or do constructs that we saw. But what this is doing is very similar, as I said, to the standard template library.

You define what data you want to work on, and then you think about how you want to, you know, the functions or the operations that you want to work on that data. And just let it, let the parallel four or the threading runtime library worry about the scheduling and how it's going to allocate the data. work on your data.

[Transcript missing]

Now, this is, so that was in a one-dimensional iteration space. Now, if you think back again to the Pavre example, a quicker way to do, or a more optimal way to do the rendering is to break it up into larger chunks and just farm out more and more chunks or larger chunks of the image to be rendered by the different threads.

And to do that with POSIX, you're going to have to write a fair amount of code, but with the threading library, again, you let the runtime library do the work for you because you've already defined this for the data or for the different, for the data sections or the data that you want the iterator to operate on.

In this case, the only difference between this and the one-dimensional iteration space was the use of this blocked range 2D, and that's where you tell the, the runtime library that you are iterating over a two-dimensional space. And there are other constructs or other concepts in the library that, you know, for different data types and different dimensions, et cetera.

Now, you know, again, what we're showing here, it's not necessarily, you know, we're not after, you know, necessarily a performance improvement, but a productivity savings and the ease of use that you can get from doing this. So let's go ahead and take a quick look at this Tachyon demo. You just saw it. You saw the code that we were showing, and I think Elizabeth will show us how it looks.

So what we're first going to do is we're going to run the serial version of this Tachyon demo. So let me go ahead and do the run. So notice how it's painting. It's just doing it line by line, and it will see how it paints with the new parallel programming model in just a second, and we'll also get a timing here of how much this is taking for the serial version.

Okay, that's completed, and it took about 16 seconds. So there's a little pause in here, and then it'll start painting again. Now, notice how it's painting. It's doing it in blocks, and as it completes, now we get another timing. Oh, 7.25 seconds, so a significant speed-up using the new Intel Parallel Programming Model. Let me do this one more time, just the threaded version.

And notice again how it paints in blocks. It's doing the 2D, as Joe was showing the source for this up on the screen. And notice at the very end, it grabs an extra bit of work ahead of the block below it, and that's because of how the scheduling is done. So it's actually able to do a little bit of speed-up that way that isn't possible in the serial version. And I don't believe it's possible in the P-Threads version either. Right. So.

So again, certainly you can see the speed-up from threading going from a serial to the parallel version. There is an implementation of the tachyon using the POSIX threads, and it's a similar performance to what we see with the threading library. But you can see the productivity or the coding savings that you get using that.

Okay, now, so we've talked about the, you know, so we talked about the, you know, the compilers and the basic functionality or the performance capabilities of the compilers of the libraries and the threading. So, you know, one last thing we want to talk about is, you know, just a little bit on how we are supporting the new Mac Pros, you know, utilizing the 64-bit Core 2 Duo processors.

so You know, really, the way we're looking at this is, you know, we're maintaining the compatibility and providing interoperability with GCC. That's still our guiding principle here. You know, obviously, we are worrying about performance, and we're going to deliver that, and we'll show another example running on the 64-bit system in a minute.

But the way the compilers are laid out is, you know, we're going to offer one compiler binary for 32-bit, or for the core processors as, you know, on the iMacs today. And then there'll be a different binary for the 64-bit for the Mac Pros, and, you know, that we'll be coming out with.

We'll probably be starting a beta later in the year, and, you know, I'll give you some more details about that in a minute. But so separate binaries for 32-bit. So the 64-bit targeting compiler will have these switches, and this is just what you can see in GCC or with Xcode, you know, the Xcode that's on the Mac Pros that you can see down in the lab today.

There's an M32 and an M64 model switch to determine which you're going to, you know, which model you're going to run in. Universal binaries, of course, are the same. You have the Intel compiler using either Xcode or LiPo. You know, however you want to run it. However you want.

Now, one thing that we're providing in the Intel compilers, and, you know, that's proven, you know, very valuable for us in the past in, you know, helping people port to 64-bit applications. So we implemented this WP64 option. You know, we kind of call it our code clean option. You know, what that means is helping you, you know, giving you diagnostics for when you're using, you know, maybe mix-matched 32- and 64-bit pointers. You know, it's a common error that you will run into.

As you're porting, it will give you diagnostics for those types of situations and others. So it's very, you know, very useful. We found it very useful in the past. And then as far as the ABI for LP64 or longs and pointers being 64, you know, we're implementing the same ABI as what's been implemented in GCC.

And the key things to note here in the ABI, I think tomorrow in the brown bag by Dilip Bandekar, I think he'll go into more detail. But I think it's important to note that we're not going to be doing a lot of the same stuff. We're going to be doing a lot of the same stuff. We're going to be doing a lot of the same stuff.

But the key thing for performance is really the, you know, not in 64-bitness, but in the fact that there's eight more XMM registers or the SIMD, single instruction multiple data registers on the core two versus what's on the core processor. So that allows us to do more things in the calling conventions. We have faster calling conventions. It gives the compiler greater flexibility for register allocation and optimization there. So, you know, that's really where you see, where you can see some significant performance benefits.

Okay, so with that, I think we have one last demo. All right, I'm going to move over now to the Mac Pro, and what I've got this set up to do is run the POV bench. It's going to run in just command line version. We did try to do some work here at the conference to get it working with graphics, so you can see the same demo that you saw on the 32-bit earlier in the presentation, that we just had some technical problems and weren't able to do that for you.

But let me go ahead and run it over here on the right. And this is running GCC version 4.0.1. This is the Apple computer build of the GCC. And it's just running the POV bench benchmark scene. And it's taking 4.959 seconds. And over here... We're running the Intel C++ Compiler for Mac OS, and it's taking about 4.144. I want to point out, this is our 64-bit compiler. You can see it's version X.X, so this is a pre-beta compiler that we're showing here. And it will go into beta later this fall. Okay, so about a 20% performance difference.

But again, these are early copies of both compilers for the 64-bit. And as I mentioned, it will be going to beta with the Intel Compiler and the Intel Libraries later in the year. And I'll give you a website for how you can get in that. So, wow, very good. Thank you very much, Elizabeth.

So I think we're, yeah, we're at the end here, and I just want to summarize what we've done. So we've given you an introduction to the software development products, the compilers, libraries, our new threading library. And we've talked about, you know, just, you know, why Intel is in the software business and why it's important for us.

And also showed why it's important for you to be using the best tools possible to get the maximum performance out of your applications running on the Intel processor-based Macs. You know, we also talked about, you know, some techniques for you to thread your applications using AutoParallel, OpenMP, and this new threading runtime library. And then, and we talked also about the guiding principles, you know, the, obviously performance, but we want to maintain that compatibility.

With the development environment that you use today, you know, Xcode, interoperability with GCC, et cetera. And then, you know, the other guiding principle is that, you know, we back all this up with, you know, with expert support and training. So we want to make sure that you have what you need to, you know, to be successful. So this is our website.

One more thing, I just want to let you know that, as I think I said this earlier, but several of us, including several engineers, are in the performance lab downstairs on the first floor. And we'll be here through, you know, all the way until 10 o'clock tonight or, well, later if you want to buy a beer, but that's a... And then tomorrow morning, there is a, there's another talk by some of our performance engineers on looking at a case study of some performance tuning they did for the, you know, for the Macs. And that'll be a very interesting talk tomorrow at 10.30. We mentioned Delete Bandercar is giving a brown bag tomorrow at lunch. And then, you know, we also want you to, you know, go to our website and sign up for future betas.

We don't have any dates that we could give you now about our beta program for 64-bit tools. But you could go here and put your name down, and we'll be able to get in touch with you. And then one final thing I didn't put on the slide is that, but for attendees at WWDC, we do have some coupons for the compilers and libraries, if you're interested, for a 50% discount. So you can come up and see us afterwards.