Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2006-312
$eventId
ID of event: wwdc2006
$eventContentId
ID of session without event part: 312
$eventShortId
Shortened ID of event: wwdc06
$year
Year of session: 2006
$extension
Extension of original filename: mov
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: ...

WWDC06 • Session 312

Using Intel Software Development Products for Mac OS X

Development Tools • 59:03

The Intel Fortran Compiler and Intel C++ Compiler work with the Intel Math Kernel Library and Intel Integrated Performance Primitives to help produce high-performance scientific and media applications. Discover how these tools integrate with Xcode, allowing you to produce code that takes advantage of the multi-core features of the Intel Core Duo processor. Watch live as sample application is taken through the steps needed to become fully vectorized, connected with key performance routines, and optimized.

Unlisted on Apple Developer site

Transcript

This transcript was generated using Whisper, it may have transcription errors.

Good afternoon. I'm Joe Wolfe. I'm a manager of the Intel Compiler Software, sorry about that, Technical Consulting and Support Team at Intel. And I'm Elizabeth Schneider, and I'm a Technical Consulting Engineer and Manager in the Intel Compiler Support Team. And welcome to our session. This is Using the Intel Software Development Products for Mac OS X. I just wanted to say that we are absolutely thrilled to be here. It's the first time attending WWDC, and it's just been really exciting to see just what you're doing with the new Intel-based Macs and just the progress that you've made. This has just been fabulous for us to see. I'd also like to thank our hosts, several people from Apple that we work with. Thank you for inviting us to the conference and allowing us to speak here. As I said, it's just been a wonderful experience.

But let's go ahead and just see what it is that we're going to be talking about today. So first, I want to give you an understanding of the software development products for the Mac OS X that we have. And I also want to spend a little bit of time just explaining why Intel is in the software business at all. So we're going to spend a little time there. Number two, we're going to talk about how you can use these tools to build high-performance software. So looking at the unique performance capabilities of the tools and just how you can use them in your software. And then thirdly, we're going to look at multi-threading. With the dual-core capabilities of the Intel Core Duo, Core 2 Duo processors, we want to make sure that you understand some of the threading capabilities that we have in the tools and some of the techniques that you can use to take advantage of that.

Now the way this is going to work, I'm going to be up here talking and hand-waving and doing some things, and Elizabeth is going to be running demos. She's going to keep me honest, showing you that I'm just not making this stuff up. So we're going to have a bunch of demos of different of the techniques and the things that we're showing you here. At the end of this, we should have plenty of time for Q&A, and we're looking forward to interacting with you some more. We are also in the performance lab downstairs. There's Elizabeth and I, as well as a number of other people from the development team and other of the tools teams, as well as some application engineers that have had a lot of experience helping port and tune applications for the Mac on the Intel-based Macs. So we're looking forward to talking with you some more there.

Now, to start, I have a question. How many of you knew that Intel is a software company? Wow, okay, that's amazing, a software company in Intel. I wasn't sure if, yeah, I just wasn't sure what the reaction was going to be to that. Now, how many of you knew that we have, you know, Intel compilers, libraries, or software development products? Okay, very good, very good. Okay, that's, I was, I had no idea what the response to that question was going to be.

That's great to hear that you've heard about us. But still, I want to give you a little bit of background of just why we're here and why software is important to Intel and why we're in the software tools business. We know our processors very well, and we want to make sure that you get the maximum performance on your application utilizing Intel processors. So we don't want you to leave any performance on the table when you're developing your applications, because performance translates into better user experience and just a more pleasant experience with the Macs. So we wanna make sure that you have the tools and the support and services that you need to do just that. We've been in the software business for quite a number of years, developing these tools just for that reason. We've also put together a whole host of other services around not just the software products, but around the capabilities that we have tuning and porting experience. And so things like the Intel Software Network, which is a web-based community for developers consisting of various user forums and blogs and just trying to bring software developers together for Intel's products or platforms. The Intel Software College, which is training courses that we can do on a wide range of topics around our software, around performance tuning for different processors, to early access programs, looking at early Intel hardware, and to Intel solution services. This is providing consulting services to help you port and tune your applications.

So it's all predicated on the fact that we know our processors well, and that we want to make sure that you are getting the most performance possible out of the Intel processors for your applications on the Macs. And to kick this off and just to illustrate what I'm talking about here, I want to show our first demo. Elizabeth, if you want to-- Okay, what I have here is I'm going to be bringing up two windows. And on the right, this is the POVray. It's the open source ray tracer, and it's showing an image that's being rendered. And on the right-hand side is showing that rendering under GCC 4.0.1. It's using the best options for the GNU compiler. And on the left-hand side -- oh, it already finished. Well, it's going to restart again. It took about 20 seconds to run with the Intel C++ compiler. This is version 9.1, and again, it's using the best options for the Intel C++ compiler.

And as you can see, it's about a 20% speed up in performance running on the Intel C++ compiler versus the version compiled with the GCC compiler. So that was about 20 seconds for the ICC rendering, And I think I saw it around 29 seconds for the GCC rendering. So about a 25% or 30% performance speedup. And so if we go ahead and flip back to the slides, so that's why we're here. We want to make sure that you don't leave any performance on the table, that you're getting the maximum that you can out of these Intel processor-based Macs. And the application we're showing is just one application. We're not saying that it's indicative of everything that you're going to run into. Your mileage is certainly going to vary from different applications and benchmarks. But we just want to make sure that you have some tools available to make sure that you are getting that performance.

So now moving into the software development tools themselves, there's really three guiding principles for us. Number one, as I mentioned, performance, just making use of the processor. Threading capability is a good example of some of the new capabilities of the dual core processors. Compatibility is another one. We know that in order for you to adopt our tools or use our tools, it's got to be as easy as possible for you to use them in your development environment. So we've worried a lot about integration into Xcode and complete interoperability with GCC and interoperability with GDB, so that you can even mix and match binaries with built with GCC and ICC, ICC being the Intel compiler. and with very little to no changes to your make files, to your existing Xcode projects. So that's, compatibility is extremely important to us. And then the third guiding principle is support. We have software experts that are available for training and for support and just making sure that the experience with your tools is what you would expect. So that's why we're in the software business, performance, compatibility, and support.

Now here's the products. So first, the Intel compilers. That's our Intel Fortran as well as C++ compilers. The performance libraries, there's two that we're talking about. This is the Intel integrated performance primitives, set of highly optimized threaded functions dealing primarily with some multimedia applications. I'll get into a little bit more detail later. and then the math kernel library for more scientific applications.

a new, what we're calling a new threading methodology for C++. Now, you notice here there's no name on the box. And well, as I, you know, we just talked about Intel being a, you know, a hardware and a software company. Well, we have a lot of marketing and legal folks too, and they, they haven't come up with a name yet. And they've been worried about trying to, trying to decide on one for several months. And yeah, marketing and legal, Can't live with him, can't live without him, I guess. But anyway, it's all about, again, bringing the performance to the Mac OS X.

So I'll go into a little bit more detail then on what the compilers are about here. So first of all, how many people here are Fortran programmers? Okay, there's, all right, there's, you know, someone waving his hand in the front row. Okay, so maybe about 20 or so. Okay, C++. Plus. Okay, pretty much everybody else. All right. Let's see, how about objective C?

Okay, a lot there, kind of in between, yeah. All right, okay, just wanted to get a feel for that. So our Fortran compiler supports all the basic, all the common standards now, or the standards now, except we have not implemented all of Fortran 2003. That's going to be a while. There's some object-oriented features in Fortran 2003 that we're looking at. We don't know when we'll be able to get some of those implemented, but 77, 90, 95, and then some other common extensions that you see in a lot of scientific computing and various applications. The C++ compiler is fully ANSI ISO compatible.

And the way that the compilers really deliver this performance like what we saw in the Povray example is through some of our advanced optimizations. Probably the most important one on the core two, on the core duos and the core two processors is vectorization. So taking advantage of the SSE or the streaming SIMD extensions in structure set architecture of these processors. We also have some other optimizations called whole program, where we're looking at inlining and optimization across functions and across files, and profile guided optimization, where we can use runtime feedback to guide the optimization decisions that the compiler makes. So it's kind of a three-step compile, where you compile to instrument your code, run it to collect some data on what was actually executed, and then recompile using that data to guide what decisions or what optimizations the compiler will apply and where to apply them.

Then the next set of optimizations are really oriented around multi-threading. So we'll get in a little bit more detail later in the presentation, auto-parallelization, OpenMP, the OpenMP standard for the multi-core. And then we also care a lot about making sure that for floating point applications or floating point intensive applications, you have some ways of controlling precision versus optimization, and there's some models that you can specify there. Two other tools that are in the package in the compiler product With our code coverage and test prioritization, what these are, they are, these are tools that will use the profile guided data. So the runtime feedback that we get through the profile guided optimization and be able to pinpoint functions and basic blocks within those functions that have been hit or touched by, you know, the, whatever your QA tests are, for example. So depending on the workloads that you provide your application in your testing, We can then provide you information about the coverage, how well you tested, as well as offer advice on which applications or which tests did better or worse and help you with that prioritization. So it's a little bit, it's an interesting use of the runtime feedback that we get. And then finally, the Intel debugger. It's a highly-- well, it's a debugger GDB compatible, command line compatible, intended for primarily very good with optimized and threaded applications. So it's very good for debugging those types of applications. Now, a little bit more on the environment.

So we released the 9.1 compiler this past spring in April. It supports Tiger 10.4.4 and Xcode 2.3 currently. Whenever there is a new Xcode release, we're going to evaluate it as quickly as we can. And we'll make whatever updates are needed as soon as we can. As far as the integration goes, we'll be showing you an example in a moment. The C++ compiler is fully integrated into the IDE, and Elizabeth will be showing us that in a minute. Fortran, we're still calling a preview feature. There's some issues that we're working on and we're working with Apple on about Fortran support for, like, Fortran files and syntax that we have to work on. So it's not completely integrated there, but it is fully useful from the command line as well. Now, earlier on in the guiding principles, I said compatibility is a big deal. So this is what I mean by that, or what it means to the compiler. In terms of compatibility with GCC and G++, we intend to be command line compatible, as source compatible as possible. There are some extensions added to GCC sometimes that we may not support. But we're going to evaluate those as we run into them or as you see them and review that. But primarily, we want to be binary compatible. So you can mix and match objects, libraries, et cetera, built with either compiler.

Now, we don't support Objective-C. So that's something that we're looking at. We don't have plans right now. But we are concerned about binary compatibility with Objective-C so that you can mix and match binaries built with GCC for Objective-C and with ICC. And if you're running out of Xcode, actually, if we detect an Objective-C file, we can invoke GCC on that file as well.

And then finally, universal binaries. You know, we are only an Intel processor targeting compiler. We don't support PowerPC. So for universal binaries, you can build the PowerPC version with GCC, build the Intel version with ICC, and then using either Xcode or LiPo from the command line, you can build your universal binary that way. So we work out of Xcode. We are trying to be completely interoperable with GCC in the Xcode environment. And that's some of our guiding principles there.

And then this slide, so this is just an example of some more performance. We're looking at specint rate and specfp, the ICC 9.1 compiler on the core Duo processor on Mac OS X. This is something we measured this past spring compared to the GCC 4.0 compiler. And we're running at, this is a comparison of what we call base optimization. And really what that means is the best optimizations that we can find on both compilers. And in specintrate, it's about a 50% faster. In specfp, it's about 47% faster.

So again, this is just a benchmark. Your mileage is going to vary. But this is something that we are concerned about and making sure that we maintain a performance lead and that you get the best performance possible for your applications. And this is an example of where we are today versus GCC. So that was a compiler's-- excuse me-- So the Intel performance libraries. So the two libraries-- excuse me.

So first, the Intel integrated performance primitives. So as I mentioned, these are the libraries that we produced are intended for some specific application domains. And for IPP, or the integrated performance libraries, these are highly tuned for the function domains of signal processing, digital imaging, cryptography, a whole host of different functional domains. And just like MKL, these libraries are already threaded. So at runtime, the libraries will determine what sort of processor you're running on, whether it's a core solo or core duo, and be able to implement the right thing. And also, even with a core, from the core to the core two microprocessor is the same thing. It can detect which one you're running on, so you don't have to worry about porting your application.

It's just the library at runtime will detect which processor and do the right thing there. Now MKL is targeted for the scientific applications, basic linear algebra libraries, sparse solvers, FFTs. Again, all of these are threaded. The team spends a lot of time just looking at, these guys just love assembly, and it's gonna be, it's pretty hard to beat them. As we're gonna see in our example, even the compiler with full vectorization and optimization, It's pretty hard to beat, say, MKL in some examples here. So we'll see that.

And I said they're already threaded. So this is an example of just the scaling that you can see with MKL going from a single threaded to two threads on a core duo processor. And you can see that when you get into, this is for a matrix multiply, a double precision matrix multiply. And when you get into the higher matrix sizes, we get almost perfect scaling going from about one and a half gigaflops up to three. So again, the libraries provide you with a way to get arguably the best and optimal performance as well as threaded performance out of these applications. So with that, I want to go ahead and we'll do a demo of our integration into Xcode. And before we switch to the MacBook -- oops, we already switched. Can we have the presentation back, please?

Thank you. OK. So before we go to the demo, we just wanted to show you what code we're working on. This is just a simple matrix multiply. You can see at the top or in the middle is a triple nested loop where we're doing the matrix multiply. The arrays are a size about 800. I think they're all 800. And what we're going to be doing is showing the compiler-- or we're going to be showing some things with the Xcode integration. And then as we go through the presentation, And we'll be optimizing this with the compiler's vectorizer and MKL. And we'll see how we do as we go.

Great. Well, look at this. I brought up the Xcode and I happen to have that same source that Joe just showed in the presentation. And I wanted to take a look at it. It is the same source. We'll have two different code paths here defined by an ifdef, the MKL. So I want to point this out. We'll be using it throughout the demo this afternoon. I want to first give you a little bit of overview of our integration with Xcode. And I'll go ahead and bring up the target window and take a look at that. As you can see we have the Intel C++ compiler here under the collections and we have several subsets and I'll go through just a couple of them. First there's the general and here you can set your optimization level. I have it set for -02 which is optimized for speed. This is our default optimization if you were to compile it the command line and you didn't specify any optimization, you would get -02 which is optimize your speed. You can compile with -O0 with no optimizations or O1 for size or you can add O3 which is the maximize speed with high level optimizations. Now I also wanted to show you the optimizations. We have a checkbox here that I'll be showing a little bit later in the presentation for parallelization. We also have the preprocessor where we'll be setting our preprocessor definition for MKL as well. And finally one more I want to show you is the diagnostics reports that you can get from the compiler. There's an auto-parallelizer report and I'll be showing that a little bit later. And then there's a report of vectorization diagnostics. Right now I have that set to VEC report 2 and I'll be showing those configurations well. So let's go ahead and run our application. We're going to bring up the build results window. Now go ahead and do a clean just to make sure there's nothing lurking on the system. I'm building at the default optimization level, so let me go ahead and do that build. And that build's done. Let's go ahead and run the application. So this is a small matrix multiply. And it doesn't take too long. There we go. It finishes in about 8.93 seconds, so about approximately nine seconds.

I see there's some messages that something didn't get vectorized. Yeah, we'll talk about that in a few minutes, Joe. So hold that thought for just a minute. Well, so are you going to be able to do better than 8.93 seconds, do you think? In fact, yes, I will be able to do better. Okay, okay. All right, well, because I just take that matrix multiplier, we ought to be able to do a little bit better. Okay. All right, thank you.

So let's go back to the slides. There we go. OK. All right, so now we're going to talk about just how we use the compilers in the libraries to get this performance. Now, the first thing is with the core architecture and the core 2 architecture, the key for performance is, especially for computationally intensive applications, is utilizing the streaming SIMD extensions. This is the instructor set architecture that consists of a number of instructions that deal with single instruction, multiple data types as well as the operations therein and as well as the register set that holds these data types.

Now the data types that we're talking about are you know for anywhere from you know integers from chars to short, int, long as well as even up to 128-bit integer for some logical operations and then for floating point you know what's encapsulated by the SSE instruction set and and register set are the single precision or double precision floating point. So the challenge for the compiler is to convert your loops or different parts of your application into the SSE code. And that's what the process of vectorization is. Now, there are other methods. And I think on the PowerPC with Altivec, you could use intrinsics or assembly. Both of those are available with the Intel Compiler and on the IMAX today. However, the recommended way is using the intrinsics or the vectorizer. We don't want you writing an assembly, unless you absolutely have to. It's just not a portable way to go. So let's look at how vectorization works.

It's really the case where we want the compiler to do the work. And what you need to worry about with vectorization and what the compiler needs to worry about primarily is just how to break what's called memory dependencies across iterations of the loops or just amongst the data that one sees or that the compiler sees in the loop. So the compiler has to worry about iteration dependence. Is data written or read in one iteration that's used in the next?

Or if there's memory disambiguation or pointer aliasing that will cause the compiler to have to err on the conservative side and just assume that there's a dependence here and not be able to vectorize the loop. We also have to look at whether there's enough work in the loop to warrant doing the vectorization. So there's some things that the compiler has to worry about. There's other things like-- well, fortunately, with some of the transcendentals-- so if you have calls to various intrinsics in your functions, like sine, cosine, et cetera, the compiler can also deal with those and vectorize those through what we call our short vector math library. That's part of the MKL that actually ships with the compiler.

So let's look at how you can understand what some of these dependencies are when you do see these un-vectorized messages like what we saw in the example. We have what we call our optimization reports that will give you different levels of detail from yes it vectorized to no it didn't and here's exactly why.

And we also have some pragmas that you can, pragmas or directives, that you can give to the compiler to give it hints on what to do, particularly when you're dealing with C++ and dealing with pointer dereferencing or aliasing. In some cases, it's just a reason to program in Fortran. You just don't have to worry about pointers that way. But in reality, though, we do.

And so you can use the IvyDepth pragma with C++ or with Fortran, telling the compiler that without-- that I know my code, don't worry about it, go ahead and vectorize my loop. And then for letting the compiler to disambiguate between pointer references, we can use the restrict keyword. And that will tell the compiler that the data to which this pointer points can only be accessed by this pointer. So it's just restricting the use of the pointer and ensuring there's no aliasing.

Now in the Intel compiler, as Elizabeth mentioned, the vectorizer is on by default. So at dash O2 is our default setting, and so the vectorizer will be turned on. You can set different optimization levels. O1 is what we would call optimize for size or smaller binaries. We won't do much inlining, for example. to O3, where we'll do some higher level loop transformations to maybe be able to better vectorize your code or better lay out your code for memory and cache usage. And then, of course, we have the advanced optimization switches of IPO, the interprocedural optimization I mentioned earlier, the whole program optimization, as well as profile-guided optimization, which is really intended or very useful for branchy code or code to have to make a lot of decisions.

So that's how the compiler does it, and we'll see the example of the vectorization here in a minute. For the libraries, it's pretty simple. You just replace whole sections of code with a call to the library, and you let the library do the work. So here is an example of a matrix multiply.

Again, we just replace it to the call to the C BLAS DGEM for the C version of the MKL, or for the C version of the entry point. And again, the libraries are going to arguably give you the best performance and of course the best portability across processors and as well as be already threaded for you.

So with that, let's go ahead and see how we can do on, we were at, let's see, 8.93 when we last left. That's right, Joe. And we had this message here. was not vectorized, existence of vector dependence. Wow, I wonder if we can do better in terms of a diagnostic. Well, it turns out that we can. Let me bring up the target window again in diagnostics and instead of VecReport2, I'm going to go to VecReport3 which is going to give me some dependency information diagnostics as well for those loops that did not vectorize. Let me go ahead and close this and go back to my build results and I'll do a clean because I didn't actually touch the code and now let me build. Oh, I've got another message now. Vector dependence assumed anti-dependence between size 2 line 27 and z line 28. I think it's time to take a look at the code. So I'll bring that up here and if you take a look at size 2, it's here and here's z that it's talking about and then size 2, oh, it happens to be a global variable and we're passing in Z. So the compiler is going to play it conservative. It's not going to vectorize that loop. But hey, wait a second. I see an integer here that's locally defined.

Let's change, let's use SZ2 instead of size 2 and see what happens. So let me go ahead over here and build. And of course I want to save. Okay. So then let's go ahead and run this now that I've made that change. Oh, look. messages now we've got that loop vectorizing so great all right so let's go ahead and run good good that looks better So 4.78 seconds. So once we vectorize that main loop in our application, we went from about nine seconds to a little less than five seconds. Now if you remember the code, we've got this set up so we could also call MKL instead of our hand-coded matrix multiply. So I'd like to try that now. So I'm going to go back to here, the targets, and I'm going to set the preprocessor to have MKL. Okay, and then go back here and do another clean and a build. And then let's run it now.

Oh, I think it finished on us. Scroll bar up. It did. Let me run it one more time so you believe me. It did run. There we go. 1.75 seconds. That's using the MKL library. So if we compare that with our vectorized version, we were at 4.78 seconds. So using the MKL call for matrix multiply significantly speeded up this small application. Wow. OK. So from 8.93-- All right. Excellent. Okay, so 8.93 down to about less than 5 with the compiler's vectorization. Cut that in half, and then that was an amazing speedup by MKL. Good job, library guys. Okay, well, thank you very much. Let's see.

So let's move on to the next topic then. So multi-threading for dual core. So I'm going to talk about three things here, auto parallelization or what the compiler can do automatically for you for parallelism. Then we'll talk about the OpenMP, the OpenMP standard or directives that you can use. And then we'll talk about this unnamed parallel programming model.

So first, auto parallelization. This is kind of the holy grail of compiler technology, where the compiler will do everything for you, and including parallelize your loop. We've, over the last several decades, probably back in the '70s when vectorization in the supercomputing world, vectorization technology became prevalent and pretty well done, it's always been the hope that compilers could do just as well with auto parallelization or just being able to parallelize the loops as well as a vectorizer can. And I'm not going to say that our compiler has achieved this state. But there are simple loops where the compiler can do this, as we'll see in our example here. And it's something that we're working very hard on. We are putting a lot of effort into our parallelization technology. And we want to know when you do have loops or applications that you think can be paralyzed, we want to know about that to help move the technology along. So we'll see an example of this in a minute.

But OpenMP, though, is probably our recommended method of parallelizing applications. It's a very well-defined and well-known standard that's been developed over the last several decades. You can see the entire documentation of the standard and exactly how it works at www.openmp.org. But here's some examples of what it looks like. How many here have used OpenMP? So there's a, okay, a few people, very good. So it works for both Fortran and C++, and basically what you're doing is providing hints to the compiler to say what sections of code to parallelize, and then it also provides some different primitives, similar to what you would do in any threading program about locking, defining critical sections, defining what data should be private or shared amongst tasks. And it's a fairly simple language or set of directives that you can add to your application. And underlying it is a runtime library that will do the scheduling and take care of all of that for you. So that was the example of a parallel, the basic operations. It also defines other operations like sum reductions or product reductions, a common loop construct in various applications. I said earlier there are constructs for critical sections or defining what data and when you want to lock it.

And then we also have defined, this is something that we've extended to the standard, what we call work queuing or task queues. And, you know, in C and C++, you know, it's common to do a lot of operations on link lists or, you know, doing or operating on data in a recursive manner. And so we've defined the task queuing construct so that you can do that. And basically, all you're doing is telling the compiler that the compiler in the OpenMP runtime library, that we're operating on this list or this set of data in this list or recursive manner. And then it will worry about the-- you also tell it what needs to be shared or what needs to be private. And then it'll worry about the underlying details. So just as a quick example of how you can-- what OpenMP looks like, what auto parallelization looks like. and what you can do and what the compiler can offer you there. We already mentioned that the libraries are fully threaded. So with that, let's go ahead and go back to our matrix multiply demo. And let's see, we were at 4.5 or so with the compiler and 1.7 with MKL. So let's see if we can go even further. All right. Well, first I want to do is try auto parallelization of our hand coded matrix multiply. So I need to go back to our targets window and just for a few moments, take off the MKL preprocessor definition. So we'll go through our hand coded matrix multiply. I'll close this. Whoa. No, not that much. Sorry about that. Hit the wrong button.

All right. Nice of it to come back. Let's see, bring up the build results window. I also want to set a diagnostic, so let me go ahead and set some more diagnostics on. We can have, if you recall, there's the auto parallelization diagnostics. I'm going to set this to the PAR report 2, which is auto parallelizer report plus loops unsuccessfully parallelized. Now I'll just close that and go back to the build results, do a clean and a build.

And if you notice here, we have some parallelization going on in the demo. Let me make sure it did. I was looking for one more message. Probably did not like me closing out everything like that. So let me make sure we got that set over here. There we go. We want that switch on as well. And I think I forgot to do that. So let's do one more time through this build.

There we go, that's what I wanted to see. Loop was auto-parallelized. So let's go ahead and run this now. - Now it also says that we vectorized the loop too, huh? - That's right, we vectorized our inner loop and we auto-parallelized the outer loop of our matrix multiply. - All right, that's just the way it's supposed to work, okay. - So let's go ahead and run that.

5.04 seconds. Now as I recall it was 4.93 vectorizing. I think I want to go out here to my X window and I want to actually time the run of it instead. Okay, now we see that our real time was 2.55 seconds. So the clock function that we were using within matrix multiply is showing the time it took to execute on both threads combined, but our real clock time was only 2.55 seconds.

So there's a little bit of difference there. Okay. Now I want to show you the MKL version. So let me get back over here and turn back on MKL. So that's an important safety tip. when you are timing multi-threaded applications on the core duo or the dual core processors, you need to think about what the system clock is reporting in terms of clock ticks versus the real user time. So that's what we showed here. What the system clock reported was the total of the clock ticks from both threads. In reality though, the user time or the real time was just half or close to half from the actual performance. Okay, now I've built it with the math, to go to the math kernel library routine. Now notice that we didn't get any vectorization or parallelization here because we're calling the math kernel library. That back in our other routine, we did have some initialization of our data and there we did get auto parallelization and the vectorization taking place. So let's go ahead and run. I'm going to go ahead and run out here because we already know we'll have that clock problem if we run out of Xcode. Whoa, it's already done, but it says 1.89 seconds. And I think we were at 1.75 seconds. So what's going on here? Well, it turns out that by default, the Math Kernel Library is running single-threaded because the Math Kernel Library does not want to conflict if you're using something like the MPI for doing your threading. So by default, it's single-threaded. But let me set a variable here.

And then let me run again. Oh, it's done, and it's just one second. So if you recall, our non-vectorized matrix multiply was about nine seconds. When we vectorized, I think it went to about five seconds. Our auto-parallelized version was about 2.6 seconds. And now when we threaded MKL, we get down to a second for our matrix multiply. Excellent. Wow, so nine seconds down to basically one second with MKL and the compiler, excellent. That's, I think that'll work, that'll work. Thank you, Elizabeth.

OK, so one more thing to talk about for threading. And this is our new parallel programming model for C++. So what this is is a template-based runtime library. We're providing the runtime library that manages scheduling and various other things, as we'll see, as well as the header files, et cetera, that define all the constructs and the concepts here. And what it allows you to do for C++ programs is to define your parallelism in terms of data or in terms of the tasks or patterns that you want to operate on that data, as opposed to worrying about just how to thread it or looking at more of a functional decomposition of your threading. So it's similar to the standard template library.

If you look at different iterators in there where you can operate on generic data types, it works a lot in the same way as we'll see in a minute. So it's not necessarily intended to provide greater performance perhaps than say OpenMP or the Pthreads, but what it is intended to do is help you with just greater productivity, of use compared to P threads, as we'll see in these examples. So, and with, since you are programming in more of a data parallel manner, or since you are programming in a data parallel manner as opposed to functional, you know, that is how you get greater scalability and you can get performance that way as well.

So this is just an example of one of the concepts that we've defined in the threading library. This is the parallel for, fairly synonymous with the OpenMP parallel do loop or do constructs that we saw. But what this is doing is very similar, as I said, to the standard template library. You define, you know, you define the, you know, what data you want to work on and and then you think about how you want to, you know, the functions or the operations that you want to work on that data and just let it, let the parallel floor or the running or the threading runtime library worry about the scheduling and how it's going to allocate the work on your data. So, This is a simple example. This is looking at a rendering application or some code out of a rendering application called Tachyon. It's an open-source application.

And on the top is the POSIX thread code that you would need to thread this in a one-dimensional parallelization or a one-dimensional iteration space. And what that means is if you think about, if you look back at, say, that Pavre example we showed at the beginning, You know, you are, in order to thread that, you have to think about how you're going to allocate, you know, different parts of rendering the image to threads. You know, one dimensional space, you know, it's looking at it in terms of scan lines, and so we're gonna be allocating one line of the image to different threads, and that's, and what you see on the top here is the POSIX thread code that you'll need to do to schedule that work. You know, whereas on the bottom with the threading library, You just define the parallel for, and it's already associated with the data that you're operating on. And then all of the scheduling and all of that is handled in the runtime library.

Now, this is, so that was in a one-dimensional iteration space. Now, if you think back again to the Pavre example, a quicker way to do or a more optimal way to do the rendering is to break it up into larger chunks and just farm out more and more chunks or larger chunks of the image to be rendered by the different threads. And to do that with POSIX, You're going to have to write a fair amount of code. But with the threading library, again, you let the runtime library do the work for you. And because you've already defined this for the data or for the different-- for the data sections or the data that you want the iterator to operate on. In this case, the only difference between this and the one-dimensional iteration space was the use of this blocked range 2D. And that's where you tell the runtime library that you are iterating over a two-dimensional space. And there are other constructs or other concepts in the library for different data types and different dimensions, et cetera.

Now, again, what we're showing here, we're not after necessarily a performance improvement, but a productivity savings and the ease of use that you can get from doing this. So let's go ahead and take a quick look at this tachyon demo. You just saw the code that we were showing and I think Elizabeth will show us how it looks. So what we're first going to do is we're going to run the serial version of this tachyon demo. So let me go ahead and do the run. So notice how it's painting. It's just doing it line by line and it will see how it paints with the new parallel programming model in just a second. also get a timing here of how much this is taking for the serial version.

Okay, that's completed and it took about 16 seconds. So there's a little pause in here and then it'll start painting again. Now notice how it's painting, it's doing it in blocks and as it completes, now we get another timing, oh, 7.25 seconds. So a significant speed up using the new Intel parallel programming model.

Let me do this one more time, just the threaded version. And notice again how it paints in blocks. It's doing the 2D as Joe was showing the source for this up on the screen. And notice at the very end it grabs an extra bit of work ahead of the block below it. And that's because of how the scheduling is done. So it's actually able to do a little bit of speed up that way that isn't possible in the serial version. And I don't believe it's possible in the P-Threads version either. Right. So again, certainly you can see the speed up from threading going from a serial to the parallel version. There is an implementation of the tachyon using the POSIX threads. And it's a similar performance to what we see with the threading library. But you can see the productivity or the coding savings that you get using that.

Okay, now, so we've talked about the, you know, so we talked about the, you know, the compilers and the basic functionality or the performance capabilities of the compilers of the libraries and the threading. So, you know, one last thing we want to talk about is, you know, just a little bit on how we are supporting the new Mac Pros, you know, utilizing the 64-bit Core 2 Duo processors. So, Really, the way we're looking at this is we're maintaining the compatibility and providing interoperability with GCC. That's still our guiding principle here. Obviously, we are worrying about performance, and we're going to deliver that. And we'll show another example running on the 64-bit system in a minute.

But the way the compilers are laid out is we're going to offer one compiler binary for 32-bit or for the core processors as on the iMacs today. And then there'll be a different binary for the 64-bit for the Mac Pros that we'll be coming out with. We'll probably be starting a beta later in the year. And I'll give you some more details about that in a minute. So separate binaries for 32 and 64. The 64-bit targeting compiler will have these switches. And this is just what you can see in GCC code, the X code that's on the Mac Pros that you can see down in the lab today. There's an M32 and an M64 model switch to determine which model you're going to run in. Universal binaries of course are the same. You have the Intel compiler using either X code or lipo, however you want. Now one thing that we're providing in the Intel compilers and that's proven very valuable for us in the past in helping people port to 64-bit applications. So we implemented this WP64 option. We kind of call it our code clean option. What that means is giving you diagnostics for when you're using maybe mix-matched 32 and 64-bit pointers. It's a common error that you will run into as you're porting. It will give you diagnostics for those types of situations and others. So it's very useful. We found it very useful in the past. And then as far as the ABI for LP64, or longs and pointers being 64, we're implementing the same ABI as what's been implemented in GCC. And the key things to note here in the ABI, I think tomorrow in the brown bag by Dilip Bandekar, I think he'll go into more detail of the architecture and some of the underlying nuances there. But the key thing for performance is really not in 64-bitness, but in the fact that there's eight more XMM registers, or the SIMD, single instruction multiple data, registers on the Core 2 versus what's on the Core processor. So that allows us to do more things in the calling conventions, have faster calling conventions. It gives the compiler greater flexibility for register allocation and optimization there. So that's really where you can see some significant performance benefits.

Okay. So with that, I think we have one last demo. All right. I'm going to move over now to the Mac Pro. And what I've got this set up to do is run the POV bench. It's going to run in just command line version. We did try to do some work here at the conference to get it working with graphics. So you can see the same demo that you saw on the 32-bit earlier in the presentation that we just had some technical problems and weren't able to do that for you. but let me go ahead and run it over here on the right. And this is running GCC version 4.0.1. This is the Apple computer build of the GCC. And it's just running the POV Bench benchmark scene. And it's taking 4.959 seconds. And over here-- We're running the Intel C++ compiler for Mac OS, and it's taking about 4.144. This is our 64-bit compiler. You can see it's version x.x. So this is a pre-beta compiler that we're showing here. And it will go into beta later this fall.

OK. OK. So about a 20% performance difference. But again, these are early copies of both compilers for the 64-bit. And as I mentioned, we'll be going to beta with the Intel compiler and the Intel libraries later in the year. And I'll give you a website for how you can get in that. So wow, very good.

Thank you very much, Elizabeth. So-- So I think we're at the end here. And I just want to summarize what we've done. So we've given you an introduction to the software development products, the compilers, libraries, our new threading library. And we've talked about just why Intel is in the software business and why it's important for us. And also showed why it's important for you to be using the best tools possible to get the maximum performance out of your applications running on the Intel processor-based Macs. We also talked about some techniques for you to thread your applications using Auto Parallel, OpenMP, and this new threading runtime library. And then we talked also about the guiding principles, obviously performance. But we want to maintain that compatibility with the development environment that you use today, Xcode, interoperability with GCC, et cetera. And then the other guiding principle is that we back all this up with expert support and training. So we want to make sure that you have what you need to be successful. So this is our website.

One more thing, I just want to let you know that, as I think I said this earlier, but several of us, including several engineers, are in the performance lab downstairs on the first floor. And we'll be here through all the way till 10 o'clock tonight or later if you want to buy a beer. But that's a-- And then tomorrow morning, there's another talk by some of our performance engineers on looking at a case study of some performance tuning they did for the Macs. And there will be a very interesting talk tomorrow at 10:30.

We mentioned delete bander car is giving a brown bag tomorrow at lunch. And then we also want you to go to our website and sign up for future betas. We don't have any dates that we could give you now about our beta program for 64-bit tools. But you could go here and put your name down, and we'll be able to get in touch with you. And then one final thing I didn't put on the slide is that for attendees at WWDC, we do have some coupons for the compilers and libraries if you're interested for a 50% discount. So you can come up and see us afterwards.