Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2008-936
$eventId
ID of event: wwdc2008
$eventContentId
ID of session without event part: 936
$eventShortId
Shortened ID of event: wwdc08
$year
Year of session: 2008
$extension
Extension of original filename: m4v
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: [2008] [Session 936] Intel's Mul...

WWDC08 • Session 936

Intel's Multi-Core Software Vision

Tools • 52:51

A software revolution is underway, triggered by the shift to multi-core hardware architectures. Software capable of running tasks in parallel has become critical for scalability across multi-core systems. Intel's James Reinders, Chief software Evangelist and Director with Intel Software Products, will share tips and lessons learned through open-sourcing Intel Threading Building Blocks.

Speaker: James Reinders

Unlisted on Apple Developer site

Downloads from Apple

SD Video (644 MB)

Transcript

This transcript was generated using Whisper, it may have transcription errors.

Well, good morning. My name is James Reinders, and I work for Intel. I have the pleasure this morning of starting off a couple of talks on how to use parallelism, some things about parallelism. I'm gonna talk this morning first about, um, how to think about parallelism a bit. And I'm gonna definitely end up with some Q&A time at the end, so hopefully you've got great questions. If you have a question that really, really can't wait, you can try asking it in the middle of the talk, and if that doesn't get out of control, we'll go with that.

But otherwise, I'll definitely leave time at the end. I've been working with parallelism for a long time, for more than a couple of decades. I've worked on small-level parallelism, large-level parallelism, and I guess I didn't think that the day would come when all of us would get a chance to work with parallelism. So I'm having fun with it. I have two quad-core machines at home already, and I expect that number to keep going up, and it's fun to have some parallelism even at home. Of course, my wonderful Apple laptop's a dual-core, so that's pretty cool. So hopefully what you get from my talk this morning is a way to judge the different ways to use parallelism, the likelihood of being successful with it for now, and carrying it into the future.

Herb Sutter likes to talk about the concurrency land rush. I think that's something he blogged about, and it amused me because you're just starting to see a lot of announcements. Hey, try this for parallelism. Use this. Use that. So hopefully I can give a little bit of a perspective on the questions you can ask back about, is this the solution for me? What will work? And what's fun is there are a lot of great solutions today already available, but there's many more to come.

So let me start off and sort of level set what we're talking about here. So beginning off the week, I heard a comment that's pretty common. People often just say, hey, processors are going multi-core because of power. That's a good reason to keep in mind, but there's actually three problems microprocessors were having with ever-increasing clock rates. And to me the reason it's important to know that there were three reasons and not just one is that we aren't going to get a breakthrough one day on one of these and suddenly go back to single core processors. So the three reasons are power is one. So every time we doubled the frequency on a processor, the power consumption quadrupled. Well, we would shrink it in half, so it only doubled. So every time we increased the frequency, the power doubled.

But the other thing is we were getting into more and more trouble where when you increase the frequency, what are you trying to do with it? You're trying to execute more instructions in parallel, instruction-level parallelism. And that was getting harder and harder to find, at least at the clock rates we're talking about. And then memory's not getting faster. So lo and behold, the faster we make a processor, the more we wait for memory. These three walls... We're putting the damper on, increasing clock frequency. So we're doing-- multi-core processors as a solution. And we see that in several different domains. And one thing that is useful to think about is GPUs, as they get talked about, are actually multi-core processors. So there were some graphs earlier this week that drive me a little nuts. Some of you may have seen them. CPUs flatline, GPUs going faster. Well, that's kind of cool. But the GPUs were growing at Moore's law, doubling about every 18 months, and the CPU was flat. And why was that? Well, it was because they weren't counting the multicore aspects. Unfortunately, when you say that GPUs are speeding up, you're giving them credit for their multicore aspects. When you flatline CPUs, you're not giving them credit. Meh, you know, maybe a nit, but it may overstate how much fun GPUs are compared to multicore. But I'll get back to that about programming because there's a lot of fun in architecture ahead of time.

So when we were doubling the clock frequency every 18 months or so, programs just got faster, sort of. I never quite saw a program run twice as fast on a 2 gigahertz machine as it did on a 1, but close enough. But now with multi-core, obviously we need to see parallelism at some level. Multiple programs running, one program using multiple threads, whatever. So this has been called the free lunch is over. Pretty basic stuff.

And how fast is it happening? Well, you know, Intel didn't invent multi-core processors. Putting two cores on a die or multiple threads has been around a while, but if you look at Intel architecture at x86, dual cores come out in 2005, followed by quad cores in 2006. We'll be shipping six-core processors this year. Eight-core processors will be not far behind. The trend is there, and it's not very difficult to get a four or an eight core machine or even 16 core these days 'cause you just put a few of the processors together.

So parallelism is really, really here. In fact, to go a little further and, well, we demoed an 80 core research chip, but going a little further and say, a prediction. This is not a roadmap announcement. And of course, it's an NDA event, so don't run off and, you know, blog too much. But within two years, you're gonna be able to walk down to your favorite store. And I just listed a bunch.

And you'll be able to buy machines with more than 16 cores. And I'm not talking exotic, super expensive machines. In fact, I'll go further. I actually think that inside three years, it's gonna be closer to 40 cores. Now, that may sound a little audacious, but connect what I said about GPUs earlier. GPUs and CPUs with lots of cores are fundamentally, from a programming standpoint, can be equivalent, and they're actually more programmable the more equivalent they are. So I think that graphics is going to drive a desire for multicore. I think you're already seeing hints of that with the interest in using GPUs, but I think it's actually long-term going to be a question of multicore CPUs.

So, here's, you know, your basic graph. Free lunch is over. It means we've got to do some concurrency. Now, a couple of things I want to point out, the way I draw this graph is we're going from a gigahertz error to a multi-core error. Multi-core to me is two, four, eight processors. But then I have this mysterious term on here, mini-core. While we're all trying to figure out how to take our applications and start using two, four, eight cores, out there looming is this idea of what I call many core. And many core to me is more than 16 cores.

And I'm going to go through some of the fundamental issues we face with parallelism, and one of them is scalability. And let me tell you, when you get past 16 cores, you really have to have your act together with parallelism. You can't get away with Band-Aids. And this many-core, or I actually called it the Terra Scale here. It's another term that's popularly thrown around for this. Terra Scale, more than 16 cores, it's going to be a reality, and it's going to be a reality before the whole industry is embracing multi-core.

To me, that's very exciting, but it really drives home the point. We need to worry about parallelism. In fact, within a decade, being a programmer and saying, "I don't do parallelism," really bad idea. Really bad idea. You might as well just go find another profession in 10 years if you don't know something about parallelism. So are we ready for this? So it's Friday morning. It's been a long week. So I thought I'd have a little fun. Let's grab some mail. And hopefully this is not too corny for us all, but wake you up a little bit. So I've changed the names to protect the innocent.

So I completely rewrote my code again for octa-core. It ran great on dual-core, but it ran terrible on octa-core. Actually, I think I saw something like this on the Apple performance mailing list. And I also don't understand Joe's code. So it's easier -- my new code is easy for me to read, but no one else will understand it. This is a very common thing in parallelism is, you know, you get something tuned on a few cores, it doesn't work on a few more. I can't read your code. I call it spaghetti threading. It's a term that seems to make sense to people. And when you see spaghetti threading, you know it. This is when you are tweaking all these things. You're playing with P threads, and you're just getting a little too exotic, a little too smart for yourself. So in any case, code that looks like spaghetti, code that's been crafted really close to the hardware and so forth. Very difficult to debug, hard to scale. The first key point I want to make is you really need to look for ways to abstract your parallelism.

One of the things I think that most people have come to agree is that we need to program in task and not threads. In other words, programming P threads or Windows threads or whatever your thread of choice, boost threads, bad idea. You're programming at the wrong level. And I can go on for hours about why that is. But we need to look at some abstraction. So I took a look at Merriam-Webster and abstraction is apparently a term from the 1500s, but it didn't quite capture what I had in mind when I talk about abstraction for programming. I had things in mind like using Fortran instead of assembly language. That seems like an abstraction to me, you know, circa 1954. This may not sound revolutionary to you, but we're undergoing a revolution right now with parallelism where for a long time we've gotten used to writing code at a low level for parallelism. And if you ask somebody that's been doing parallelism a long time, how do I do it? When they start talking about MPI or P threads or boost threads, it would be like being a programmer in the '50s or '60s and having someone tell you that assembly language was the way to go. It's just too low level. I mean, in assembly language, we worried about data placement. We worried about which data content went in which register. If you're using P threads, you're worrying about which activity do I put in which thread and which thread runs on which processor.

Bad idea. So some of the things like threaded libraries, OpenMP, threading building blocks, NS operation, or now it's Grand Central Dispatch, these sort of ideas are saying, "Hey, tell us about your task, and something will map those onto processors." As programmers, we need to be doing something higher level.

So I do have my three favorite things to talk about when people say, what should I use for parallelism? Threaded libraries. No particular threaded library. I just like the concept that if my work can get done in parallel and someone else can write the code, I might as well let them do it. So there's some excellent examples. People can use different math libraries. Intel has a math kernel library. If you're doing animation, you can rely on the Apple core animation capabilities and let Apple do the work getting those to run in parallel, and you call them. So it's kind of funny.

The reason I put this one first is it's the easiest to do. It doesn't apply to a lot of your program necessarily, but it's also often overlooked. It's really a lot of fun to take a program and make a few calls to a better-threaded library and have it run a lot faster. Don't overlook it. Another capability is OpenMP. This is available in many, many different compilers. It's a C and Fortran construct. It's been around for about 11 or 12 years now. And it's hints to the compiler. Compilers aren't quite smart enough to run stuff in parallel.

You put a few hints before a loop, and off you go. Again, really easy to use, very practical. It scales. It tends to keep you away from programming bugs, so forth. Finally, we've got Intel Threading Building Blocks. Been a very successful way to extend C++ for parallelism. Very aimed at C++ programmers. Addresses key issues of thread-safe data structures, how to program in task, how to do scalable memory allocation. Definitely worth a look.

Threading building blocks, full disclosure here, yes, this is my O'Reilly book on threading building blocks. It's a really aggressive, fun way to thread C++ code. We've had some fantastic programs come out over the last year. It's been ported to many, many platforms. It's been on Mac since the first days. on G5 machines as well as Intel-based machines. It's been ported to Xbox. It's been ported to Spark machines. It's widely available on many, many different processors and many operating systems, so it's quick becoming a very common way to get parallelism in C++. Very worthwhile looking at if you're a C++ programmer looking to add parallelism.

Hmm, more mail. My program crashes mysteriously, but only some of the time. And it always works when I run it inside the debugger. What shall I do? Signed into omittant. Um, I'd love to get some customers up on stage of our tools, uh, 'cause Intel does a variety of tools, including some that can find race conditions and deadlock. Um... This is a very common problem, and it's worth talking about a little. You write a parallel program and then it becomes intermittent. It doesn't run the same all the time. And the two key issues are race conditions and deadlock.

A race condition happens when you don't synchronize the way that you should, and deadlock is when you're over synchronized or your one part is waiting for another part. But what really is important here is there are actually some ways to program that are more likely to run into these problems and other ways not. If you're calling a threaded library, if you're doing OpenMP, if you're doing threading building if you're doing NSOperation or Grand Central Dispatch, if you're calling the Core Animation Library, you're probably doing things that will avoid, in general, causing these problems.

If you're diving into Pthreads and doing your mutexes yourself, you're doing a general-purpose attempt at parallel programming, you're much more likely to run into these problems. I'm not saying that you have to use techniques that completely avoid these, But the more that you use techniques that can occur these, the more you need to pay attention to how are you going to debug these. So there are getting to be some excellent tools in the marketplace, including some from Intel, that can help find race conditions and deadlock.

I expect to see a lot more in the future. I don't think there's nearly enough of this currently in the marketplace to help with these. But it's really important to look at this. And again, if you use a higher-level abstraction, you're less likely to hit these problems. So when you see different solutions for parallelism advertised, you should think about is it abstract and does it help me avoid these parallel programming bugs? So when people ask my opinion about different parallel languages and things being touted and committees being formed to go work on things, I commonly come back to this and say, it either is going to help us solve this problem or it's not. And I'm not a big fan of new parallelism initiatives that don't help solve this problem. I think that they just don't help us get more parallelism in applications. questions. All right, I think I got one more letter. My program actually runs slower on an octa-core than on a quad-core machine. And someone said scaling was a factor. No, it doesn't have to do with fish.

This is how I look at scaling. You'd like a program, perhaps, to run eight times as fast on an octa-core as it did on one core, but it's not going to. That would be called ideal scaling. But did you expect a machine to run eight times as fast on an 8 MHz processor as it did on a 1 MHz processor? Probably not. So multi-core is not really new in that what you're trying to do is write a program so that it speeds up as you add cores, but you don't need to be hung up on making it ideal. I mean, if something runs four times as fast on octa-core, but it also runs eight times as fast on 16 cores and 16 times as fast on 32 cores, you're in really good shape, a lot better than most people think. The killer, though, is that if you write your program in a way that doesn't scale, this is a really common problem as well. This is a real example of a 3D ray tracing program. We took some work. We had one of our experts do a very nice job and hand thread the code using P-threads.

And if he only had a quad-core machine to run it on, it looked pretty darn good, because what you've got is you've got speedups on quad-core of 3.76. On one example, the hand-coded was 3.47. So a 3.5x speedup on quad-core sounds pretty good, and that was the hand-coded program here. The problem is that somewhere around five or six processors, the speed up tapered out. In fact, if you keep running at a more and more cores, it gets slower.

And it's a global bottlenecking problem. It was a very nice little program the way that it was written to scale by hand with P threads, but he used a central computation and divided the work up evenly. And it turns out that that works pretty well until you get to a higher number of cores. Brilliantly written program has to be completely rewritten once you get on an octa-core machine because it just doesn't scale. Now, this particular example, Um... I did with threading building blocks. I can get about -- I get identical operations on this example with open in P. So really the key here is abstraction.

Now what's really, really frustrating about this is that the code on the left using P threads shows that I had to add a whole lot of code to get it to run in parallel. The code on the right shows I barely had to write any -- add any code at all to get my my application to run in parallel using threading building blocks. And the reason is I've circled the code. I know you can't read it, but it's just a little loop. It does the ray tracing. That's the core algorithm. It's a few loops. It does the operations.

All I want to say to my machine is, run that in parallel. Just do it. You know, go. And that's basically what you do in a good abstraction. Go run it. Now, this is, threading building blocks, it's mostly include statements. I think I had to add 17 lines of code to the entire program to get it to work. On the other hand, the complete transformation to pthreads was almost 200 lines of code. And it's not hard code. you know, create some P threads, create some mutexes, compute some bounds, kick them off, wait for them to finish, shut down the P threads, shut down the mutexes, send them all the way. But it just isn't value-added code. So when you start hearing things suggested, threading building blocks, or you look at the Grand Central Dispatch, the block T's and the NS operation things that were talked about this week, those are aimed at giving you an abstraction layer where you just say, run it in parallel.

Just go. If you're spending your time writing a lot of code setting this up, I can promise you that's not the way of the future. That's not what parallel programming is going to look like. So you might as well find the abstractions that avoid doing that and look for them really hard, even if you have to invest a bit learning and so forth. You don't want to be writing code where you've got hundreds of lines of code just setting up to run a few lines of code in parallel. Now remember, this is the application I had that doesn't scale past octa-core, so that's really frustrating. It added a couple hundred lines of code, and the program didn't scale. And the reason is because threading building blocks uses a very sophisticated algorithm to divvy up the work and load balance it across multiple cores. The same thing you can expect to see from multiple operating system vendors in the future as well. We're really at a point where everybody's trying to solve the same problem. They'll eventually come together and get unified. We don't have specific plans right now for threading building blocks to sit on top of Grand Central Dispatch, but my prediction would be in the future that's the sort of thing we would do. All tool vendors would move on top of that. There's equivalent things going on in Linux and things going on in Microsoft OS as well, where they're saying, hey, we need to provide this core functionality where we distribute the work and we're responsible for load balancing. And that can lend to the scalability that you want. So you need to look to not program that hard into your program. If you're looking and starting your program up and saying, how many cores are there? I'm going to kick off a bunch of threads. You've already lost. No. If I haven't convinced you just thinking about octa-core and so on, this stretches further into GPUs. So an ideal programming language that load balances across CPUs and GPUs wouldn't worry about exactly what they can do differently or how powerful they were. It would just say run this in parallel. And something in your runtime would dispatch and divvy things up. I'm quite confident that's the way programming is going to be. um Even forgetting GPUs for a moment, future CPUs, if and when Intel builds a 100-core CPU, it's not going to look like our quad cores. Our quad cores are four identical cores. Powerful, out-of-order execution engines, just lots of cash. The day that we wake up and build a 100-core CPU, most of them are going to be itty-bitty cores with out-of-order and so on, because they're more efficient. When you write a program that can use 100 cores, you don't need each of those 100 cores to be big, fat, power-hungry things doing a lot of out-of-order scheduling to try to speed you up. What you'd rather do is use that silicon area to have a processor that was lean, mean, and ran fast. And if I can give you two or three of those cores in the same die area, you'd rather have that because your program scales. So there's no way we're going to build a 100-core machine without all out-of-order engines like we do for quad-core. So if we build a 100-core machine, or when we do-- and that's not going to be tomorrow-- you can count on the fact there'll probably be a few big out-of-order engines, but there'll be a bunch of smaller ones, maybe specialty ones. So again, go back to this example. How should I write this example? Well, I should write this example to say, run it in parallel. And then in 10 years, when it's running on a machine that has 80 little cores and 20 big cores and maybe 10 specialty cores, I don't have to rewrite my program again.

Sometimes we use the term future proofing, which maybe is a little bit more of a promise than anything, but you definitely want to write your program so that you're not down mucking in the details of exactly how to dispatch things. And if your program starts up and says, how many cores are there, and then you divide the work up evenly across the cores, you're going to fail for multiple reasons. Okay?

One reason is you don't have exclusive use of the machine, and a few of the cores are going to get busy doing something else, and your whole program's going to run only as fast as the weakest link. You'll already see that on a quad-core machine. If you divide up the work evenly among four cores, I think there were some really great animated graphics on Monday that showed this, you know, with little ping-pong balls going in troughs and moving around depending on the workload. Again, the key idea there was write your program in terms of ping-pong balls and throw them into these troughs, Don't write your program in terms of there are-- how many cores are there? Divide the work up evenly. Not a good idea. So I wanted to show you a little bit. We've done some surveying of developers. And-- At the end of last year, we went to companies that had ad parallelism, and I was delighted to see that about half of them said that they were getting their work done in parallelism without using native threads, P threads, Windows threads, Boost threads. This is quite a diverse audience that we talked to about how they got parallelism, and a lot of them were using threaded libraries. OpenMP was quite popular. Threading Building Blocks was in there, but they were giving credit to getting their parallelism from something other than explicitly hard-coding it in raw threads. This is a trend I expect will continue. In fact, I'm quite certain of it. 50% are using raw threads that we surveyed last year. That number's going to go down. It's going to take a while, but I feel it's very similar to surveying people in the early to mid-'90s about use of assembly language. Right now, you'd be very difficult to find people that are doing a lot of assembly language. Even operating systems are generally written in C or some other higher-level language these days. And that used to be the place where you'd find people saying, you'll never do that, I'll always use assembly language. Same thing on threads. In fact, this, to be honest, this surprised me, that we were already to 50% using abstractions. And so I've heard all these doom and gloom. I get to read all these wonderful articles.

Everybody in Intel thinks they should forward every article on parallelism to me, so I get to see these. There are so many doom and gloom articles, you'd think that nobody was getting any work done. Quite the contrary. Lots of people get in parallelism. The other thing that was pretty amazing is we went and asked the people who weren't doing parallelism, why not?

And if I had done that a couple years ago, the people would have said, it's too hard, we don't see the need for it. Only 27% of the people we talked to, and we talked to some pretty good developers here, very influential group, only 27% were willing to say that they didn't think it was needed.

Now, I think there's a little bit of shyness here. I think that people are starting to think, oh, my gosh, multicore is coming. Even if I don't know what I'm going to do, I'm not going to tell someone who's surveying me that I don't need parallelism. Okay, I understand. There's some of that going on. But for three-quarters of the developers to be saying, "Hey, we're gonna do something," more than 50% of them blaming it on schedule, saying, "I just haven't figured out how to fit it in my release schedule, how to allocate people for it."

One other fuzzy detail, at Intel, we try to track how many applications, influential applications, that's ones that we think sell silicon or cause people to buy machines. By our estimates, the number of applications on the market at the end of last year that used parallelism was about twice what it was at the beginning of the year, which is rather phenomenal, because at the beginning of the year, those were all the applications using parallelism that had been developed since, you know, the beginning of time, basically. It doubled last year. We're just seeing a tremendous rush towards this. And, yeah, if you're curious, this was a round-the-world phenomenon. This happened to be the distribution of the people we talked to. But when we looked at the data, we didn't see a difference in the trend in any particular geography. Now.

I promised I'd try to keep things reasonably high level and The talk right after me, they said not to promise too much coding, but I know they're going to dive down a bit and show some examples and things, 'cause that's very important. But I'm constantly amazed when we work with companies Some companies that, you know, again, have fantastic programmers in them, that they get detached from these three key things. They get enamored with a particular technology. They're going to rush off and implement something with P threads or whatever. And we often bring them back and say, look, you've got to look at the fundamentals here. And the three fundamentals are scaling. Do you have a plan, as we continue to throw more and more cores at your application, do you have an idea how it's going to scale?

A second one is do you have something you're doing to keep the debugging under control? Because intermittent programs are bad. I know one company in particular that shipped an app, hugely popular app. Again, they won't come on stage and say, "Hi, I'm here from such and such company. We wrote a bad app." But they had customers saying that the program ran 90 times out of 100.

They still love the app so much, but they just begged them, would you please make this thing reliable? And then future-proofing, you know, are you adopting a technique that isn't just a fad, something that's tightly tied to today's hardware, doesn't really liberate you as a programmer, doesn't really give you the opportunity to expect to take care of a very rich future in hardware? I'm not just talking about Intel hardware here. You know, if Intel doesn't build really interesting chips that we all use, someone else will. Nothing is going to stop the chips getting very, very interesting in the future. So we might as well as programmers not go hug the hardware and write in a language or a technique that's just specific for today's hardware or not invest a whole lot in that.

So I wanted to shift gears just a little bit and talk about it from a different angle, because... I hope you'll give the thought to the scaling, the debugging, the correctness. You know, when you're working on your applications, when you're advocating techniques, when somebody comes and gives you a talk, you should use this technique instead of that one. You know, I give lots of talks on threading building blocks. I say, you know, it solves everything. Well, when I do a talk like that, you should be thinking, you know, James, can you explain to me why you think this helps with scaling or why this is future-proof? Can you describe how it would be used on a future machine with hundreds of cores? And the answer is yes, yes, yes. And I can try to do that for any technique being offered out there, and so I'd encourage you to think about that with scaling, correctness, and the future-proofing.

But shifting gears a bit, I wrote an article last year for Dr. Dobbs that was pretty popular What I did is I tried to sit down and write out eight tips. And this is a pretty easy article to find online, Google for my name and Rules for Parallelism. It's a short article. And so when you're rushing to write code, instead of giving these really abstract scaling and correctness and future-proofing, can you give me something a little bit more specific I can sink my teeth into? So the first thing-- As programmers, we need to think about where the parallelism is. You need to fundamentally get used to figuring out what is parallel in your program.

Now, one thing that's real interesting to me about that is one of the things I know we've learned is that nested parallelism is fantastically important. When you stare at one part of your program and say, how can I make this run in parallel, your eyes will pop out of your head. It's so frustrating.

But if you apply that at a high level of your program and then lower levels have parallelism and you can keep expressing the parallelism wherever it happens, you get a lot better scaling, a lot better performance. There are not very many programming techniques out there right now that will help you with nested parallelism. Thank you.

I guess that would be one reason I advocate threading building blocks a lot. OpenMP has added a few things, and there are other programming techniques that need to look at this. If you look at all the GPU languages that are being proposed out there, whether it be CUDA or OpenCL, you'll find that a thread or a task cannot, in general, create more. And that's a mistake, because it means that you can't embody nested parallelism. So it's a critique of quite a few languages, what I just said. And again, my prediction in the future is that they either, all techniques either need to correct that or they'll die off. So we've really learned that nested parallelism is way more important, I think, than any of us thought it would be in practice, that if you want a scaling app, it needs to be able to express parallelism. So after you kick off a task, if it realizes it's got a lot of work to do, it can break itself up into more tasks. So my tip started off with, you know, think about the parallelism, know where it is. And the reason I talked about nesting is if your brain starts saying, hey, well, I've got this to run in parallel, but some parts of what I'm thinking about might be parallelism themselves, good. No problem. You should be able to code that. The other thing is you need to be able to program using abstractions. I already talked about why. very important, and you should program at a level of task, not threads. To me, a thread is, you know, when you're programming saying, hey, I'm going to have one thread for each processor and I'm going to figure out what each thread does. That's what I mean by threads. Don't do that. Liberate yourself and say, hey, I want to do this, I want to do that, I want to do that. You know, I find it very compelling to talk about task and get going, And you saw some great animations at the beginning of the week that sort of illustrate how you think about it. And these block T extensions to Objective C try to capture that. The C++ standard undoubtedly will add lambdas and will add futures. So you'll see this trend in many, many languages and so forth to say let's start talking in task, not threads.

Now, a surprising one, or one that I cannot overemphasize how much important I think this turns out to be in practice is, don't write a parallel app that can't run sequentially. Once you get into parallelism, it's really cool. You can write applications that can't be run sequentially. Well, you might as well shoot yourself in the head now. It's just not a lot of fun to debug an app that can only be debugged in parallel. Frankly, You know, programmers that aren't perfect like myself, I occasionally have to fix a loose pointer or a mistake or something in my code that's just, you know, a no-brainer. Really easy to do when the program's running sequentially.

I get a memory fault or whatever, I can figure it out. I can run it 10 times, it keeps failing the same way. When I run it in parallel and it starts doing bizarre things, I start thinking, it's a parallel programming bug. No, most of my bugs are not parallel programming bugs. They're still the ordinary bugs.

And I find it much easier to debug them if I'm running the program and it's not expressing itself in parallel, at least the part I'm trying to debug. So I pay attention, and I think programmers, experienced programmers pay attention to, okay, I've done all this work to make it run in parallel. How can I force it to be sequentially debuggable? Because I want to get past this debugging phase. I don't want to spend a week tracking down a bug that I thought was a race condition, and it turned out to be the use of the wrong pointer, something that I used to be able to debug quickly.

So I had a few other tips, but I want to zoom ahead. You want to understand locks. You want to get cool tools. You want to use scalable memory allocators. You read my article, think about that. But I wanted to go ahead to another one that is a subtle T that I think as experts in parallel programming, you need to be really well versed in.

So what if someone walks up to you, now this has happened to me, and says, "Amdahl's Law proves that this is all a bad idea." You can't write a parallelism, you know, parallelism's gonna die. Boy, have people told me this. It's pretty surprising 'cause I worked on a 9000 processor machine and it had pretty good parallelism on it. And it took people a long time to figure out what was going on, or at least to be able to articulate it.

So what you want to do is you want to make sure that as you're thinking about parallelism, you pay attention to the fact that we're constantly giving more workload to our applications. I'll come back and give you some examples on that in a little bit, but let me start by talking about Amdahl's Law.

Because this is, you know, this is one of those topics at least over beers sometimes. So let's assume we have an application. It has five parts. They each take 100 seconds to run. They run one after the other. Let's say that I know how to make a couple of them run in parallel. Well, if we take a look at Amdahl's Law, when I get to dual core, I can get a 25% speedup in my program, and when I go to quad core, I can get a 40% speedup. In fact, if I keep making those two parts run in parallel, I can eventually get them effectively to run in no time. Unfortunately, my program still takes 300 seconds to run. I only get a 70% speedup even if I use this, you know, one million core machine on a couple parts of my program. Okay, so this is the high-level proof that multi-core is doomed. Nobody's gonna succeed. We'll all just go home. Amdahl's Law.

So this is based on a paper written by Omdahl in 1967. What he actually said is that the effort spent on making parts of your program run faster and faster are gonna have diminishing value unless you do something about the sequential parts. And so for years, this was used as, this is why parallelism won't work on a large scale. And not everybody took them completely seriously, but no one really articulated well why this, why this doom and gloom scenario that I just showed wasn't correct.

21 years later, John Gustafson said, "Hey, you really ought to measure the speed up, "not just by fixing the problem size, "but by scaling the problem size up." So let me illustrate. What if when I run in parallel, I'm able to give more data, more work to these parallel sections? Now this is not-- This is not a completely contrived example. Imagine that these sections are the loops that are walking through my image, and I'm increasing the image size or the number of images I'm processing or something, and that the other sequential work is mostly not a lot of work. And I increase the amount of work to do. Now I've got a 40% speedup just on dual core, at least ideally. 2.2x speedup if I go to quad core. In fact, this just continues to scale indefinitely. It's about--I don't remember-- 60% efficient or 70% efficient. You can do the math. But this program scales. Throw as many cores at it as you want, and you can keep throwing more data at your application, and it'll go faster. So... This one escapes a lot of people, and it's not... It needs to be intuitive. I used to have a... 40 megahertz laptop. I think my laptop now is 2 point something gigahertz. That's at least a 50x improvement in clock rate. I'm trying to figure out what runs 50 times faster on my laptop, it doesn't feel that much faster. But what happened to that performance? Where'd it go? Oh, I'm doing Wi-Fi negotiations.

I'm doing encrypt, decrypt. I'm doing smooth fonts. I'm doing smooth scrolling. My screen has, you know, four 8X the pixels that it did before. That's a lot more processing. You know, thank goodness for HDTV, right? Because all these applications we wrote to run on PAL and NTSC now need to scale their data. So Intel needs to build faster processors to help us play with high definition. And audio's doing the same thing, you know? Give you more processing power, we'll process more audio. It's not that I'm going to process my DVD in iDVD faster, it's I'm going to add more effects, I'm going to do high definition content, so on. So as you look to adding parallelism in your program, think a little bit forward, what would you like to do to process more?

That may be the easiest place or the most beneficial place to put parallelism into your application. There's a balance. You often look at a program and say, wow, I can speed this up and my program will run faster. But the other place to look for it with phenomenal results is where can you add things that process more data, look ahead, do things that you wouldn't have considered before because it was too expensive, But now you can add with the power of parallelism.

That's what we're really going to see. That's why my 40 megahertz laptop's not fast enough for me anymore, and I need a two gigahertz, is I'm too used to the barrage of extra data, extra processing that my machine's doing for me, not that any one program You can just help me. My mail program, for instance, is not running 50x faster. That's for sure. I'm quite confident it was faster before.

So let me show you, this is a foil that our CTO uses. I've seen foils like this from other people. They show, oh, the world's gonna be taken over, you know, more processing power and so on. What's really interesting is, The graph here, and this is a very common sort of graph, shows data size and performance going up. There's a reason for that. You really don't just make the performance go up. The performance goes up as long as you assume the data sizes get bigger. But thankfully for HDTV and high-definition cameras, even 5-megapixel cameras on cell phones and things, we've got plenty of data streaming into our computers. But this is all about why Amdahl's Law doesn't doom us. Psalm.

I wanted to highlight a couple of websites, some places that I think are useful to getting more information. One of them's for dealing with today. Again, I'll plug threading building blocks. We've got some forums there and so forth. It's a fun place to go and sort of chat if you're a C++ programmer, I'm recommending this for. Open source, it's... And it's just an interesting place to start. C++ programmers, I highly recommend taking a look at this.

Now, if you're more into the, wow, I wonder where the world's going to go. James didn't talk about really exotic technologies. Yeah, I usually try to emphasize practical things because, you know, if you're going to run out and try to write a program today that runs on lots of machines, you really need to look at these things that may sound a little boring. I kept emphasizing threaded libraries, OpenMP, threading building blocks. But that doesn't mean that there isn't a lot of exciting work going on saying how could we really change the world. One of them is something called transactional memory or software transactional memory. In short, you know, I hate locks, so let's do something different. Databases seem to have figured out how to, you know, do transactions.

Wouldn't it be cool if when I have a data structure and I'm going to update it in order to be thread safe, if it was updated in a transactional fashion? This concept's not going to go away. We're going to keep beating our heads against the wall until we figure out how to put this in software and in hardware.

But frankly, nobody knows how to put it in hardware yet. Nobody really knows how to get it right in software completely. There are some nice little implementations. Intel's had a project to put software transaction memory into C and C++, and we have a free version of our compiler on our website that does that. I am not advocating going and downloading it and using it in your next program that you're going to ship or put in production. Please don't. But we put a website together where we're putting some of our experiments up for people to give feedback, to play community feedback. So we've got some experiments on futures and spawn. We've got this experiment on transactional memory. We've got some exotic adaptive library techniques that I think are extraordinarily promising. And I love to peruse things like this. You know, IBM's had their AlphaWorks site. Intel's got whatif.intel.com. It's fun that there are places you can go and whatif has a lot of stuff on parallelism. That's what makes it really a little more unique, although it's got some other things not to do with parallelism as well. Very cool place to go if you want to kind of get a flavor of what researchers are looking at, what problems we might solve that make parallelism even easier. But please don't go to this website and download stuff and put it into production use. This is stuff that's definitely not ready for prime time, but it's fun to know, you know, where you might go if that interests you.

My hope was by walking through these things at a simple level, instead of diving down to code and pitching a particular solution, that I'd really emphasize to you something that's really critical, which is that you get thinking about scalability, correctness, and this future-proofing, this maintainability of code. and when you're jumping into parallelism that you're able to look at a barrage of people, including myself, that will come to you and say, use this technique because I love it so much, you're just going to be inundated with that. It's just barely begun. And to step back and say, what really is going to end up mattering to me as I write my application, I'm hoping that's what I was able to start you on. And I certainly enjoy talking about this topic, and there's a lot of nuances as you start thinking about scaling, like the thing I said about nesting, that won't immediately be obvious until you get into it, and then you'll realize, oh, my gosh, this is another way to get scaling. And as you think about scaling as something you need to be an expert on, you'll start to realize when you focus on that, you're more likely to get your parallelism to work well than if you were focused on a particular syntax that you were being told to memorize or learn because someone was touting it. This is the path to thinking parallel. And the future of hardware is very, very rich. I love computer architecture. It's one of the reasons I'm at Intel. I love thinking about how you build lots of cores and hook them together. And frankly, I think processors were getting, from my perspective, architecturally, were getting boring, and multi-core has brought it to life again. So it's wonderful to see all sorts of ideas being touted. let's go throw this in the system, throw that, let's do this with cores. And I promise you that the variety of hardware in the future is going to increase that our programs need to run on. To me, that's very exciting, but to some programming techniques, it should be very frightening. So you have to be careful what you're investing in. So I think we can learn about parallelism, think about it, teach it. And with that, I'm going to end a little early and take questions. And if you don't like the answer, like, you know, you ask a question, you were trying to get an Intel person to really answer, and you don't like the answer, ask again. Just tell me I didn't answer it, and I'll stick with you.