Intel's Multi-Core Software Vision - WWDC 2008

Tools • 52:51

A software revolution is underway, triggered by the shift to multi-core hardware architectures. Software capable of running tasks in parallel has become critical for scalability across multi-core systems. Intel's James Reinders, Chief software Evangelist and Director with Intel Software Products, will share tips and lessons learned through open-sourcing Intel Threading Building Blocks.

Speaker: James Reinders

Unlisted on Apple Developer site

Downloads from Apple

SD Video (644 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Well, good morning. My name is James Reinders, and I work for Intel. I have the pleasure this morning of starting off a couple of talks on how to use parallelism, some things about parallelism. I'm going to talk this morning first about, um,

[Transcript missing]

for more than a couple of decades. I've worked on small-level parallelism, large-level parallelism, and I guess I didn't think that the day would come when all of us would get a chance to work with parallelism.

I'm having fun with it. I have two quad core machines at home already, and I expect that number to keep going up. It's fun to have some parallelism, even at home. Of course, my wonderful Apple laptop's a dual core, so that's pretty cool. So hopefully what you get from my talk this morning is a way to judge the different ways to use parallelism, the likelihood of being successful with it for now and carrying it into the future.

Herb Sutter likes to talk about the concurrency land rush. I think that's something he blogged about and it amused me because you're just starting to see a lot of announcements. Hey, try this for parallelism. Use this, use that. So hopefully I can give a little bit of a perspective on the questions you can ask back about is this the solution for me, what will work. And what's fun is there are a lot of great solutions today already available, but there's many more to come.

So let me start off and sort of level set what we're talking about here. So beginning off the week, I heard a comment that's pretty common. People often just say, hey, processors are going multi-core because of power. James Reiners: That's a good reason to keep in mind, but there's actually three problems microprocessors were having with ever-increasing clock rates.

And to me, the reason it's important to know that there were three reasons and not just one is that we aren't going to get a breakthrough one day on one of these and suddenly go back to single-core processors. So the three reasons are power is one. Every time we doubled the frequency on a processor, the power consumption quadrupled. Well, we would shrink it in half, so it only doubled.

So every time we increased the frequency, the power doubled. But the other thing is, we were getting into more and more trouble where when you increase the frequency, what are you trying to do with it? You're trying to execute more instructions in parallel, instruction level parallelism. And that was getting harder and harder to find, at least at the clock rates we're talking about. And then memory's not getting faster. So lo and behold, the faster we make a processor, the more we wait for memory. These three walls, We're putting the damper on, increasing clock frequency.

So we're doing... James Reiners: So, we're going to talk about multi-core processors as a solution. And we see that in several different domains. And one thing that is useful to think about is GPUs, as they get talked about, are actually multi-core processors. So, there were some graphs earlier this week that drive me a little nuts. Some of you may have seen them. CPUs flatline, GPUs going faster. Well, that's kind of cool, but the GPUs were growing at Moore's Law, doubling about every 18 months. And the CPU was flat. And why was that? Well, it was because they weren't counting the multi-core aspects.

Unfortunately, when you say that GPUs are speeding up, you're giving them credit for their multi-core aspects. When you flatline CPUs, you're not giving them credit. Eh, you know, maybe a nit, but it may overstate how much fun GPUs are compared to multi-core. But I'll get back to that about programming, because there's a lot of fun in architecture ahead of time.

So when we were doubling the clock frequency every 18 months or so, programs just got faster, sort of. I never quite saw a program run twice as fast on a 2 gigahertz machine as it did on a 1, but close enough. But now with multi-core, obviously we need to see parallelism at some level. Multiple programs running, one program using multiple threads, whatever.

So this has been called the free lunch is over. Pretty basic stuff. And how fast is it happening? Well, you know, Intel didn't invent multi-core processors. Putting two cores on a die or multiple threads has been around a while, but if you look at Intel architecture at x86, dual cores come out in 2005, followed by quad cores in 2006. We'll be shipping six-core processors this year. Eight-core processors will be not far behind.

The trend is there. And it's not very difficult to get a four- or an eight-core machine or even 16-core these days because you just put a few of the processors together. So parallelism is really, really here. In fact, to go a little further and -- well, we demoed an 80-core research chip. But going a little further and say, you know, James Reiners: A prediction.

This is not a roadmap announcement. Of course, it's an NDA event, so don't run off and, you know, blog too much. But within two years, you're going to be able to walk down to your favorite store, and I just listed a bunch. James Reiners: And you'll be able to buy machines with more than 16 cores.

And I'm not talking exotic, super expensive machines. In fact, I'll go further. I actually think that inside three years, it's going to be closer to 40 cores. Now, that may sound a little audacious. James Reiners: So I think that graphics is going to drive a desire for multi-core. I think you're already seeing hints of that with the interest in using GPUs, but I think it's actually long-term going to be a question of multi-core CPUs.

So here's, you know, your basic graph. Free lunch is over. It means we've got to do some concurrency. Now, a couple of things I want to point out, the way I draw this graph is we're going from a gigahertz era to a multi-core era. Multi-core to me is two, four, eight processors.

But then I have this mysterious term on here, many-core. While we're all trying to figure out how to take our applications and start using two, four, eight cores, out there looming is this idea of what I call "many core." And many core, to me, is more than 16 cores.

And I'm going to go through some of the fundamental issues we face with parallelism, and one of them is scalability. And let me tell you, when you get past 16 cores, you really have to have your act together with parallelism. You can't get away with band-aids. And this many-core, or I actually called it tera-scale here, it's another term that's popularly thrown around for this, tera-scale, more than 16 cores, it's going to be a reality, and it's going to be a reality before the whole industry is embracing multi-core. To me, that's very exciting, but it really drives home the point we need to worry about parallelism. In fact, within a decade, being a programmer and saying, I don't do parallelism. Really bad idea.

Really bad idea. You might as well just go find another profession in 10 years if you don't know something about parallelism. So, are we ready for this? It's Friday morning. It's been a long week. So I thought I'd have a little fun. Let's grab some mail. And hopefully this is not too corny for us all, but wake you up a little bit. So these are, I've changed the names to protect the innocent.

James Reiners: So I completely rewrote my code again for octa-core. It ran great on dual-core, but it ran terrible on octa-core. Actually, I think I saw something like this on the Apple performance mailing list. And I also don't understand Joe's code. So it's easier, my new code's easy for me to read, but no one else will understand it. This is a very common thing in parallelism is, you know, you get something tuned on a few cores, it doesn't work on a few more.

I can't read your code, I can't read your code. So I call it spaghetti threading. It's a term that seems to make sense to people. And when you see spaghetti threading, you know it. This is when you are tweaking all these things, you're playing with P threads, and you're just getting a little too exotic, a little too smart for yourself. So in any case, code that looks like spaghetti, code that's been crafted really close to the hardware and so forth, very difficult to debug, hard to scale.

So The first key point I want to make is you really need to look for ways to abstract your parallelism. One of the things I think that most people have come to agree is that we need to program in tasks and not threads.

[Transcript missing]

So I do have my three favorite things to talk about when people say, what should I use for parallelism? Threaded libraries. No particular threaded library. I just like the concept that if my work can get done in parallel and someone else can write the code, I might as well let them do it.

So, you know, there's some excellent examples. Scientific people can use different math libraries. Intel has a math kernel library. If you're doing animation, you know, you can rely on the Apple core animation capabilities and let Apple do the work getting those to run in parallel and you call them. So it's kind of funny. The reason I put this one first is it's the easiest to do.

It doesn't apply to a lot of your work. It doesn't apply to all of your program necessarily. But it's also often overlooked. You know, it's really a lot of fun to take a program and make a few calls to a better threaded library and have it run a lot faster. Don't overlook it.

Another capability is OpenMP. This is available in many, many different compilers. It's a C and Fortran construct. It's been around for about 11 or 12 years now. And it's a... It's hints to the compiler. Compilers aren't quite smart enough to run stuff in parallel. You put a few hints before a loop and off you go. Again, really easy to use. Very practical.

It scales. It tends to keep you away from programming bugs, so forth. Finally, we've got Intel threading building blocks. Been a very successful way to extend C++ for parallelism. Very aimed at C++ programmers. Addresses key issues of thread separation. Safe data structures. How to program and task. How to do scalable memory allocation. Definitely worth a look.

Threading Building Blocks, full disclosure here, yes, this is my O'Reilly book on Threading Building Blocks. It's a really aggressive, fun way to thread C++ code. We've had some fantastic programs come out over the last year. It's been ported to many, many platforms. It's been on Mac since the first days. It's on G5 machines as well as Intel.

It's been ported to Intel-based machines. It's been ported to Xbox. It's been ported to Spark machines. It's widely available on many, many different processors and many operating systems. So it's quick becoming a very common way to get parallelism in C++. Very worthwhile looking at if you're a C++ programmer looking to add parallelism.

Hmm, more mail. My program crashes mysteriously, but only some of the time. And it always works when I run it inside the debugger. What shall I do? Signed into omittant. I'd love to get some customers up on stage of our tools, because Intel does a variety of tools, including some that can find race conditions and deadlock. This is a very common problem, and it's worth talking about a little. You write a parallel program and then it becomes intermittent. It doesn't run the same all the time. And the two key issues are race conditions and deadlock.

A race condition happens when you don't synchronize the way that you should, and deadlock is when you're over-synchronized or your one part is waiting for another part. But what really is important here is there are actually some ways to program that are more likely to run into these problems and other ways not.

If you're calling a threaded library, if you're doing OpenMP, if you're doing Threading Building Blocks, if you're doing NSOperation or Grand Central Dispatch, if you're calling the Core Animation Library, you're probably doing things that will avoid, in general, causing these problems. If you're diving into Pthreads and doing your mutexes yourself, you're doing a general-purpose attempt at parallel programming, you're much more likely to run into these problems.

I'm not saying that you have to use techniques that completely avoid these, but the more that you use techniques that can occur these, the more you need to pay attention to how are you going to debug these. So there are getting to be some excellent tools in the marketplace, including some from Intel, that can help find race conditions and deadlock. I expect to see a lot more in the future. I don't think there's nearly enough of this currently in the marketplace to help with these, but it's really important to look at this.

And again, if you use a higher-level abstraction, you're less likely to hit these problems. So when you see different solutions for parallelism advertised, you should think about, is it abstract, and does it help me avoid these parallel programming bugs? So when people ask my opinion about different parallel languages and things being touted and committees being formed to go work on things, I commonly come back to this and say, you know, it either is going to help us solve this problem or it's not.

And I'm not a big fan of new parallelism initiatives that don't help solve this problem. I think that they just don't help us get more parallelism in applications. All right, I think I got one more letter. My program actually runs slower on an octa-core than on a quad-core machine. And someone said scaling was a factor. No, it doesn't have to do with fish.

This is how I look at scaling. You'd like a program perhaps to run eight times as fast on an octa-core as it did on one core, but it's not going to. That would be called ideal scaling. But did you expect a machine to run eight times as fast on an eight megahertz processor as it did on a one megahertz processor? Probably not. So multi-core is not really new in that what you're trying to do is write a program so that it speeds up as you add cores, but you don't need to be hung up on making it ideal.

I mean, if something runs four times as fast on octa-core, but it also runs eight times as fast on 16 cores and 16 times as fast on 32 cores, you're in really good shape. A lot better than most people think. The killer, though, is that if you write your program in a way that doesn't scale, this is a really common problem as well. This is a real example of a 3D ray tracing program. We took some work. We had one of our experts do a very nice job and hand thread the code using P-threads.

James Reiners: And if he only had a quad-core machine to run it on, it looked pretty darn good, because what you've got is you've got speedups on quad-core of 3.76. On one example, the hand-coded was 3.47. So a 3.5x speedup on quad-core sounds pretty good, and that was the hand-coded program here. The problem is that somewhere around five or six processors, the speedup tapered out. In fact, if you keep running it on more and more cores, it gets slower.

James Reiners: And it's a global bottlenecking problem. It was a very nice little program the way that it was written to scale by hand with P-threads, but he used a central computation and divided the work up evenly. And it turns out that that works pretty well until you get to a higher number of cores. Brilliantly written program has to be completely rewritten once you get on an octa-core machine because it just doesn't scale. Now, this particular example, James Reiners: I did with threading building blocks. I get identical operations on this example with OpenMP. So really the key here is abstraction.

Now what's really, really frustrating about this is that the code on the left, using Pthreads, shows that I had to add a whole lot of code to get it to run in parallel. The code on the right shows I barely had to add any code at all to get my application to run in parallel using Threading Building Blocks. And the reason is I've circled the code. I know you can't read it, but it's just a little loop. It does the ray tracing. That's the core algorithm. It's a few loops. It does the operations.

All I want to say to my machine is, run that in parallel. Just do it. You know, go. And that's basically what you do in a good abstraction. Go run it. Now, this is, Threading Building Blocks, it mostly includes statements. I think I had to add 17 lines of code to the entire program to get it to work.

On the other hand, the complete transformation to Pthreads was almost 200 lines of code. And it's not hard code, you know, create some Pthreads, create some mutexes, compute some bounds, kick them off, wait for them to finish, shut down the Pthreads, shut down the mutexes, send them all away.

But it just isn't value-added code. So when you start hearing things suggested, Threading Building Blocks, or you look at the Grand Central. The Grand Central Dispatch, the BlockTs and the NSOperation things that were talked about this week, those are aimed at giving you an abstraction layer where you just say, run it in parallel.

Just go. If you're spending your time writing a lot of code setting this up, I can promise you that's not the way of the future. That's not what parallel programming is going to look like. So you might as well find the abstractions that avoid doing that and look for them really hard, even if you have to invest a bit learning and so forth. You don't want to do that. You don't want to be writing code where you've got hundreds of lines of code just setting up to run a few lines of code in parallel.

Now, remember, this is the application I had that doesn't scale past octa-core, so that's really frustrating. It added a couple hundred lines of code, and the program didn't scale. And the reason is because Threading Building Blocks uses a very sophisticated algorithm to divvy up the work and load balance it across multiple cores. The same thing you can expect to see from multiple operating system vendors in the future as well.

We're really at a point where everybody's trying to solve the same problem. They'll eventually come together and get unified. We don't have specific plans right now for Threading Building Blocks to sit on top of Grand Central Dispatch, but my prediction would be in the future, that's the sort of thing we would do. All tool vendors would move on top of that. There's equivalent things going on in Linux.

There's equivalent things going on in Linux and things going on in Microsoft OS as well where they're saying, "Hey, we need to provide this core functionality where we distribute the work, and we're responsible for load balancing." And that can lend to the scalability that you want. So you need to look to not program that hard into your program. If you're looking and starting your program up and saying, "How many cores are there? I'm going to kick off a bunch of threads," you've already lost. If I haven't convinced you just thinking about octa-core and so on, this stretches further into GPUs.

So an ideal programming language that load balances across CPUs and GPUs wouldn't worry about exactly what they can do differently or how powerful they were. It would just say, "Run this in parallel." And something in your runtime would dispatch and divvy things up. I'm quite confident that's the way programming is going to be. Thank you.

Even forgetting GPUs for a moment, future CPUs, if and when Intel builds a 100-core CPU, it's not going to look like our quad cores. Our quad cores are four identical cores. Powerful, out-of-order execution engines, just lots of cash. The day that we wake up and build a 100-core CPU, most of them are going to be itty-bitty cores with out-of-order and so on because they're more efficient.

When you write a program that can use 100 cores, you don't need each of those 100 cores to be big, fat, power-hungry things doing a lot of out-of-order scheduling to try to speed you up. What you'd rather do is use that silicon area to have a processor that was lean, mean, and ran fast. And if I can give you two or three of those cores in the same die area, you'd rather have that because your program scales.

There's no way we're going to build a 100-core machine without all out-of-order engines like we do for quad-core. If we build a 100-core machine, or when we do, and that's not going to be tomorrow, you can count on the fact there'll probably be a few big out-of-order engines, but there'll be a bunch of smaller ones, maybe specialty ones. Again, go back to this example.

How should I write this example? I should write this example to say run it in parallel. And then in 10 years, when it's running on a machine that has 80 little cores and 20 big cores and maybe 10 specialty cores, I don't have to rewrite my program again.

Sometimes we use the term future-proofing, which maybe is a little bit more of a promise than anything, but it's, you definitely want to write your program so that you're not down mucking in the details of exactly how to dispatch things. And if your program starts up and says, how many cores are there, and then you divide the work up evenly across the cores, you're going to fail for multiple reasons.

One reason is you don't have exclusive use of the machine, and a few of the cores are going to get busy doing something else, and your whole program's going to run only as fast as the weakest link. You'll already see that on a quad-core machine. If you divide up the work evenly among four cores, I think there were some really great animated graphics on Monday that showed this, you know, with little ping-pong balls going in troughs and moving around depending on the workload.

Again, the key idea there was write your program in terms of ping-pong balls and throw them into these troughs. Don't write your program in terms of there are, how many cores are there? Divide the work up evenly. Not a good idea. So I wanted to show you a little bit.

We've done some surveying of developers, and...

[Transcript missing]

James Reiners: And if I had done that a couple years ago, the people would have said, "It's too hard. We don't see the need for it." Only 27% of the people we talked to, and we talked to some pretty good developers here, very influential group, only 27% were willing to say that they didn't think it was needed.

Now, I think there's a little bit of shyness here. I think that people are starting to think, "Oh, my gosh, multi-core is coming. Even if I don't know what I'm going to do, I'm not going to tell someone who's surveying me that I don't need parallelism." Okay, I understand.

There's some of that going on. But for three-quarters of the developers to be saying, "Hey, we're going to do something," more than 50% of them blaming it on schedule, saying, "I just haven't figured out how to fit it in my release schedule, how to allocate people for it."

One other fuzzy detail, at Intel we try to track how many applications, influential applications, that's ones that we think sell silicon or cause people to buy machines. By our estimates, the number of applications on the market at the end of last year that used parallelism was about twice what it was at the beginning of the year, which is rather phenomenal, because at the beginning of the year, those were the applications using parallelism that had been developed since, you know, the beginning of time, basically.

It doubled last year. We're just seeing a tremendous rush towards this. And, yeah, if you're curious, this was around-the-world phenomenon. This happened to be the distribution of the people we talked to, but when we looked at the data, we didn't see a difference in the trend in any particular geography. Now.

I promised I'd try to keep things reasonably high level and The talk right after me, they said not to promise too much coding, but I know they're going to dive down a bit and show some examples and things, because that's very important. But I'm constantly amazed when we work with companies Some companies that, you know, again, have fantastic programmers in them, that they get detached from these three key things. They get enamored with a particular technology.

They're going to rush off and implement something with P-threads or whatever. And we often bring them back and say, look, you've got to look at the fundamentals here. And the three fundamentals are scaling. Do you have a plan, as we continue to throw more and more cores at your application, do you have an idea how it's going to scale?

A second one is, do you have something you're doing to keep the debugging under control? Because intermittent programs are bad. I know one company in particular that shipped an app, hugely popular app. Again, they won't come on stage and say, "Hi, I'm here from such and such company. We wrote a bad app." But they had customers saying that the program ran 90 times out of 100.

James Reiners: They still love the app so much, but they just begged them, "Would you please make this thing reliable?" And then future-proofing, you know, are you adopting a technique that isn't just a fad, something that's tightly tied to today's hardware, doesn't really liberate you as a programmer, doesn't really give you the opportunity to expect to take care of a very rich future in hardware? I'm not just talking about Intel hardware here. It's, you know, if Intel doesn't build really interesting chips that we all use, someone else will.

Nothing's going to stop the chips getting very, very interesting in the future. So we might as well, as programmers, not go hug the hardware and write in a, you know, in a language or a technique that's just specific for today's hardware, or not invest a whole lot in that.

So I wanted to shift gears just a little bit and talk about it from a different angle because I hope you'll give the thought to the scaling, the debugging, the correctness. You know, when you're working on your applications, when you're advocating techniques, when somebody comes and gives you a talk, you should use this technique instead of that one. You know, I give lots of talks on threading building blocks. I say, you know, it solves everything.

Well, when I do a talk like that, you should be thinking, you know, James, can you explain to me why you think this helps with scaling or why this is future-proof? Can you describe how it would be used on a future machine with hundreds of cores? And the answer is yes, yes, yes. And I can try to do that for any technique being offered out there. And so I'd encourage you to think about that with scaling, correctness, and the future-proofing.

But shifting gears a bit, I wrote an article last year for Dr. Dobbs that was pretty popular in What I did is I tried to sit down and write out eight tips. And this is a pretty easy article to find online, Google for my name and Rules for Parallelism. It's a short article and so, you know, when you're rushing to write code, instead of giving these really abstract scaling and correctness and future-proofing, can you give me something a little bit more specific I can sink my teeth into?

So the first thing, As programmers, we need to think about where the parallelism is. You need to fundamentally get used to figuring out what is parallel in your program. Now one thing that's real interesting to me about that is one of the things I know we've learned is that nested parallelism is fantastically important. When you stare at one part of your program and say, "How can I make this run in parallel?" Your eyes will pop out of your head. It's so frustrating.

But if you apply that at a high level of your program and then lower levels have parallelism and you can keep expressing the parallelism wherever it happens, you get a lot better scaling, a lot better performance. There are not very many programming techniques out there right now that will help you with nested parallelism.

I guess that would be one reason I advocate threading building blocks a lot. OpenMP has added a few things, and there are other programming techniques that need to look at this. If you look at all the GPU languages that are being proposed out there, whether it be CUDA or OpenCL, you'll find that a thread or a task cannot, in general, create more.

And that's a mistake, because it means that you can't embody nested parallelism. So it's a critique of quite a few languages, what I just said. And again, my prediction in the future is that they either, all techniques either need to correct that or they'll die off. So we've really learned that nested parallelism is way more important, I think, than any of us thought it would be in practice. That if you want a scaling app, it needs to be able to express parallelism. So after you kick off a task, if it realizes it's got a lot of work to do, it can break itself up into more tasks.

So my tip started off with, you know, think about the parallelism, know where it is. And the reason I talked about nesting is if your brain starts saying, hey, well, I've got this to run in parallel, but some parts of what I'm thinking about might be parallelism themselves, good, no problem. You should be able to code that. The other thing is you need to be able to program using abstractions. I already talked about why.

Very important. You should program at a level of task, not threads. To me, a thread is, you know, when you're programming saying, hey, I'm going to have one thread for each processor, and I'm going to figure out what each thread does. That's what I mean by threads. Don't do that. Liberate yourself and say, hey, I want to do this, I want to do that, I want to do that.

You know, I find it very compelling to talk about task and to get going. And you saw some great animations at the beginning of the week that sort of illustrate how you think about it. And these block T extensions to Objective C try to capture that. The C++ standard undoubtedly will add lambdas and will add futures. So you'll see this trend in many, many languages and so forth to say let's start talking in task, not threads.

Now, a surprising one, or one that I cannot overemphasize how much important I think this turns out to be in practice is, don't write a parallel app that can't run sequentially. Once you get into parallelism, it's really cool. You can write applications that can't be run sequentially. Well, you might as well shoot yourself in the head now.

It's just not a lot of fun to debug an app that can only be debugged in parallel. Frankly, James Reiners: I'm a programmer. Programmers that aren't perfect like myself, I occasionally have to fix a loose pointer or a mistake or something in my code that's just a no-brainer. Really easy to do when the program's running sequentially.

I get a memory fault or whatever, I can figure it out. I can run it ten times, it keeps failing the same way. When I run it in parallel and it starts doing bizarre things, I start thinking, "It's a parallel programming bug." No, most of my bugs are not parallel programming bugs, they're still the ordinary bugs. And I find it much easier to debug them if I'm running the program and it's not expressing itself in parallel, at least the part I'm trying to debug.

So I pay attention and I think programmers, experienced programmers pay attention to, "Okay, I've done all this work to make it run in parallel, how can I force it to be sequentially debuggable?" Because I want to get past this debugging phase, I don't want to spend a week tracking down a bug that I thought was a race condition and it turned out to be the use of the wrong pointer. Something that I used to be able to debug quickly.

So I had a few other tips, but I want to zoom ahead. You want to understand locks, you want to get cool tools, you want to use scalable memory allocators. You read my article, think about that. But I wanted to go ahead and do another one that is a subtle tea. That I think as experts in parallel programming, you need to be really well versed in.

So what if someone walks up to you, now this has happened to me, and says, "Amdahl's law proves that this is all a bad idea."

[Transcript missing]

So what you want to do is you want to make sure that as you're thinking about parallelism, you pay attention to the fact that we're constantly giving more workload to our applications. I'll come back and give you some examples on that in a little bit, but let me start by talking about Amdahl's Law.

James Reiners: This is, you know, this is one of those topics at least over beers sometimes. So let's assume we have an application. It has five parts. They each take 100 seconds to run. They run one after the other. Let's say that I know how to make a couple of them run in parallel.

Well, if we take a look at Amdahl's Law, when I get to dual-core, I can get a 25% speedup in my program, and when I go to quad-core, I can get a 40% speedup. In fact, if I keep making those two parts run in parallel, I can eventually get them effectively to run in no time.

Unfortunately, my program still takes 300 seconds to run. I only get a 70% speedup even if I use this, you know, one million-core machine on a couple parts of my program. Okay, so this is the high-level proof that multi-core is doomed. Nobody's going to succeed. We'll all just go home. Amdahl's Law.

So this is based on a paper written by Amdahl in 1967. What he actually said is that the effort spent on making parts of your program run faster and faster are going to have diminishing value unless you do something about the sequential parts. And so for years, this was used as, this is why parallelism won't work on a large scale. And not everybody took them completely seriously, but no one really articulated well why this doom and gloom scenario that I just showed wasn't correct.

21 years later, John Gustafson said, "Hey, you really ought to measure the speed up, not just by fixing the problem size, but by scaling the problem size up." So let me illustrate. What if when I run in parallel, I'm able to give more data, more work to these parallel sections?

Now this is not... This is not a completely contrived example. Imagine that these sections are the loops that are walking through my image and I'm increasing the image size or the number of images I'm processing or something, and that the other sequential work is mostly not a lot of work.

And I increase the amount of work to do. Now I've got a 40% speedup just on dual-core, at least ideally. 2.2x speedup if I go to quad-core. In fact, this just continues to scale indefinitely. It's about, I don't remember, 60% efficient or 70% efficient. You can do the math. But this program scales. Throw as many cores at it as you want and you can keep throwing more data at your application and it'll go faster. This one escapes a lot of people and it's not, it needs to be intuitive.

I used to have a, 40 megahertz laptop. I think my laptop now is 2 point something gigahertz. That's at least a 50x improvement in clock rate. I'm trying to figure out what runs 50 times faster on my laptop. It doesn't feel that much faster. But what happened to that performance?

Where did it go? Oh, I'm doing Wi-Fi negotiations. I'm doing encrypt, decrypt. I'm doing smooth fonts. I'm doing smooth scrolling. My screen has, you know, 4, 8x the pixels that it did before. That's a lot more processing. You know, thank goodness for HDTV, right? Because all these applications we wrote to run on PAL and NTSC now need to scale their data.

So Intel needs to build faster processors to help us play with high definition. And audio's doing the same thing, you know? Give you more processing power, we'll process more audio. It's not that I'm going to process my DVD and iDVD faster. It's I'm going to add more effects. I'm going to do high definition content, so on.

So as you look to adding parallelism in your program, think a little bit forward. What would you like to do to process more? That may be the easiest place or the most beneficial place to put parallelism into your application. There's a balance. You often look at a program and say, wow, I can speed this up and my program will run faster. But the other place to look for it with phenomenal results is where can you add things that process more data, look ahead, do things that you wouldn't have considered before because it was too expensive, but now you can add with the power of parallelism.

That's what we're really going to see. That's why my 40 megahertz laptop's not fast enough for me anymore and I need a 2 gigahertz is I'm too used to the barrage of extra data, extra processing that my machine's doing for me. not that any one program You can just help me. My mail program, for instance, is not running 50x faster, that's for sure. I'm quite confident it was faster before.

So let me show you, this is a foil that our CTO uses. I've seen foils like this from other people. They show, oh, the world's gonna be taken over, you know, more processing power and so on. What's really interesting is, The graph here, and this is a very common sort of graph, shows data size and performance going up. There's a reason for that. You really don't just make the performance go up. The performance goes up as long as you assume the data sizes get bigger.

[Transcript missing]

I wanted to highlight a couple of websites, some places that I think are useful to getting more information. One of them's for dealing with today. Again, I'll plug Threading Building Blocks. We've got some forums there and so forth. It's a fun place to go and sort of chat. If you're a C++ programmer, I'm recommending this for. Open source. - And it's just an interesting place to start. C++ programmers, I highly recommend taking a look at this.

Now, if you're more into the, wow, I wonder where the world's going to go. James didn't talk about really exotic technologies. Yeah, I usually try to emphasize practical things because, you know, if you're going to run out and try to write a program today that runs on lots of machines, you really need to look at these things that may sound a little boring. I kept emphasizing threaded libraries, open MP, threading building blocks.

But that doesn't mean that there isn't a lot of exciting work going on saying, how could we really change the world? One of them is something called transactional memory or software transactional memory. This, in short, you know, I hate locks, so let's do something different. Databases seem to have figured out how to, you know, do transactions.

Wouldn't it be cool if when I have a data structure and I'm going to update it, in order to be thread safe, if it was updated in a transactional fashion? This concept's not going to go away. We're going to keep beating our heads against the wall until we figure out how to put this in software and in hardware.

But frankly, nobody knows how to put it in hardware yet. Nobody really knows how to get it right in software completely. There are some nice little implementations. Intel's had a project to put software transaction memory into C and C++, and we have a free version of our compiler on our website that does that. I am not advocating going and downloading it and using it in your next program that you're going to ship or put in production. Please don't.

But we put a website together where we're putting some of our experiments up for people to give feedback, to play community feedback. So we've got some experiments on futures and spawn. We've got this experiment on transactional memory. We've got some exotic adaptive library techniques that I think are extraordinarily promising. And I love to peruse things like this.

We've got a lot of stuff on parallelism. That's what makes it really a little more unique, although it's got some other things not to do with parallelism as well. Very cool place to go if you want to kind of get a flavor of what researchers are looking at, what problems we might solve that make parallelism even easier.

[Transcript missing]