Threading for Performance using OpenMP and Intel Threading Building Blocks on Mac OS X - WWDC 2007

Mac OS X Essentials • 56:10

Speakers: Phil Kerly, Mike Lewis

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

My name is Phil Kerly and I'm from the Intel Software Solutions group. Today Mike and I are going to be presenting on threading for performance using OpenMP in the Intel threading building blocks on Mac OS X. This is session 102 so hopefully you are in the right spot.

Wanted to give a little bit about some of the motivation for this talk, if you look back in the history of the Apple products the first 20 years we were basically on a single threaded processing execution type CPU. After about 1976 started hittivng Duo Core or dual processors, within about eight years we got into Quad Core which was just two years ago and right now we are hitting eight cores.

Intel is extremely enthusiastic about trying to help software engineers improve their productivity in terms of threading because this is going to keep increasing. We're not going to slow down in terms of the number of threads of execution we are going to be providing to developers. And I talk about threads of process or execution instead of processors because it use to be that we had individual CPU's and then in the old Pentium 4 days we added something called hyper threading which enabled multiple threads of execution. Then we moved that to multiple cores and so there is many kinds of permutations that this really falls into for Intel. But in any case we need to be able to increase our productivity in terms of threading our applications. So that's really what today is all about.

So I'm going to talk a little bit about OpenMP, the specification and I am going to talk a little bit about Intel threading building blocks which we also call TBB. But the reality is in an hour, hour and half talk there is no way that we can cover the full breath and dept of these two technologies.

SO what I am really hoping to do is that by the end of this talk I will have at least covered enough material to influence you to actually go and investigate these technologies, maybe we'll see a little spy kit or downloads on our Intel complier or our TBB product. Hopefully that is what will happen over the next week.

So the agenda that I had for today is really to talk about what I consider to be the fundamental principals of threading for performance, just so that we have a common base line. Cover OpenMP with the specification with a high level overview. And them Mike Lewis is going to come up and do a little demo for us in terms of implementing an application that was already threaded but it was threaded for a functional perspective and apply OpenMP to that application and then compare that implementation to a POSIX treaded implementation and look at the performance and the implementation.

Then we'll move on to TBB and again Mike will give us a little bit of insight into what it took to port that same application to TBB, look at it's performance. And then we'll talk a little bit towards the end about where these tools leave off and where the real engineering begins, the hard part, the work we all have to do.

So I really have four fundamental principals and the first one I have up here is parallel algorithms. Without having parallel algorithms, whether they be functionally decomposed or dated decomposed, if we can come up with a solution for how we would conceptually implement the algorithm then we are not going to achieve performance by threading.

We clearly have to minimize the overhead of threading, if you look at the old days, three years ago. If you threaded your application for a Dual processor platform, you know it's fairly easy to get or to minimize the threading in overhead compared to the amount of work that you are actually going to accomplish. But with eight cores that challenge becomes a lot tougher as we start subdividing or data decomposing our application and our task, then the overhead becomes a little bit harder, becomes a little bit larger compared to that level of work.

Just because we have a parallel algorithm and we minimize the overhead, we also have to balance that workload across those execution units. As we go through the talk today we will go through a few examples where that is a little bit tricky and how both OpenMP and TBB can help with balancing those workloads.

And finally anything that has to do with performance has to taken in to consideration the hardware architecture. And all of these are actually very intertwined with one another. As you try to make your algorithm more parallel and you spread them out among more cores the relative ratio of the work you accomplish for your task verses the overhead that ratio changes, the threading overhead gets larger. As we start subdividing that work and we put it across more cores, more threads of execution, balance in that workload across those multiple threads of execution becomes more challenging.

And as we increase the differences in the hardware and we move from a true symmetric multi processing type of platform to something that's a kin to more of a newma type architecture where not every execution, engine that you are executing on necessarily has the same power or performance as every other core that you might be running on, there's differences between those execution units.

We clearly saw that with hyper threading with today processors or today's Mac Pro platform with eight threads of execution or eight cores, some of those cores share the same cache and some of them sit on the same bus and some don't and so there are differences in the performance capabilities of those individual cores as they relate to one another.

So as I look at this list of fundamental principals for threading for performance, its clear to me that some of the things we do, we tend to do repetitively because there more automatic, there more mechanical and less about the algorithm or what we're trying to achieve. For example using the hardware architecture we have to detect the number of detection units that we are actually going to run on, how many cores or how many processors or how many threads of execution does the hardware actually support. We need to detect that right and that's pretty standard process of doing that and anyone that's doing multithreading for performance is going to have some technique or some natural way of doing that.

If you do any kind of cache blocking where you are going to detect the size of the cache and modify the perimeters by which you are doing your parallel data decomposition you have to detect the size of the cache from the hardware in order to make that, make that change.

So those kinds of things tend to be more automatic, tend to be things that you'll do repeatedly from implementation to implementation. So what I hope to show is that with OpenMP and with Intel threading building block that some of that mechanical stuff that we are use to doing can actually be taken over by these technologies and allow you to actually improve your productivity by focusing on the problem at hand and focusing on the constraints under which your application are running. So let's talk a little but about OpenMP. First its not a standard but actually a specification from a group of industry companies that were interested in promoting threading and they really specifically targeted shared memory multiprocessing, not necessarily symmetric multiprocessing but shared memory multiprocessing.

The other thing they did is that they kind of looked at that motivational slide and said well we have a lot of code out there that is actually serial code and we want to move people and get them moving more and the developers more and more into a threading environment.

So it really was started to try to get encouraged threading serial code incrementally. So it's not a big bang type of approach to threading. You can actually go to the OpenMP org website and actually see the full specification and it will actually allow for vendor specific implementation or extensions to the specification.

OpenMP is something that has to be supported by a complier and in this case Intel's C++ and Fortran compliers do support OpenMP on Mac OS X. SO what is OpenMP, it's really a set of programming directives supported by some library routines. There are some environment variables that help you control the runtime environment, great for debugging and performance tuning. And then of course there are vendor extensions and we'll talk about Intel's extensions as part of this session today.

Like I said before, I'm hoping to just provide an overview and not the complete specification, it would just take much longer then an hour and half talk but kind of the main points to OpenMP and the programming directives are the four that I have listed here. There is what's called a parallel for which allows you to using this pragma, parallelize for loops have a defined start and ending index and parallelize that naturally using the pragma's.

That's more of a day to decomposition approach but OpenMP also supports what's called parallel sections or sections which allow you to do more of a functional type of threading. If you actually wrote serial code in which you called function A and function B and there was really no dependences between the two and you could reorder them, then parallel sections might be a quick way to actually thread that between them.

And then there were some other kind of natural things that could be done with pragma's that did not have to be implemented using function calls that allowed and supported threading as well, such as critical sections and variable in which all threads have to achieve that variable before they can move beyond it.

You augment these work sharing constructs with data sharing clauses appended to them. And here are a few examples, as part of that for loop, the data you have within that loop, you can define those as being shared type data, can be private type data. You can have first private in which all the threads get the first value at initialization is all the same and then after the threads start executing they are actually private to the threads themselves. And then it supports such things as reduction.

Reduction is an example of a reduction would be if you had an array that you just wanted to sum up all the values in the array, you could logically construct a parallel for algorithm that could operate on that array by subdividing it into smaller pieces. Doing subtotals on each of those sections and then at the end doing the total reduction and just taking the sum of the sub values and adding them together. So OpenMP can support that type of paradigm as well.

Talked a little bit about workload balancing and it was clear that not just one type of scheduling algorithm worked for all types of workloads and so Open MP supports three types of scheduling algorithm, one is static another is dynamic and then we have guided and as we go through the talk I will actually show examples of these individual type of scheduling algorithms.

The pragma themselves are not sufficient to actually give a complete or a robust set of tools in order to be able to implement threading even on simple for loops. There's often application specific activities that require more then just a critical section. And so the OpenMP standard or specification allows for some of these threads synchronization type routines and in this case setting lock, unlocking locks and testing locks.

It also allows you to query and set some of the run time execution routines for the values. For example you can actually set the number of threads, if you don't set the number of threads, OpenMP will actually detect it from the hardware and it will implement or use that many threads across the parallel section.

But you can actually change that, not all applications are well suited to complete well balanced environments where the number of threads equaling the number of executions units is sufficient. And then you could also query, you know what thread am I, so you may have some specific implementation in your algorithm that you want to know and assign a particular thread to some particular thread of execution.

And then there are timing routines. If we're threading for performance you know we want to know how long do things take and so there were some standard type of routines that were added to support just actually being able to query the system time and be able to do those types of evaluations.

So the pragma's, the libraries and now is supported by the environment variables so that in addition to what you are doing programmatically you have the option of actually dynamically changing these, some of these parameters without actually having to recompile your code and executing. Great for supporting you know performance evaluation, so here OpenMP schedule basically allows you to define exactly what kind of the three scheduling types, what scheduler do you want and what is the chunk size. The chunk size is basically how do you want to divide up in that for loop that your using, how do you want to divide up that chunk, in terms of how much chunk should each thread operate on.

And we'll cover more about that as we proceed. You can change the number of threads, so we had the routine that will allow you to change it in the code. But we also have an environment variable that allows you to change it on the fly. So you could actually write a script dynamically change the number of threads, gather your performance data and see if that's you know what is the optimal number of threads that you need.

We have dynamic, which is basically enables or allows the OpenMP runtime to dynamically change the number of threads that you are going to use. Some OpenMP implementation support this and some don't. This allows you to enable and disable that type of feature and then nest it. So its conceivable that you can have a pragma on a for loop and then somewhere in the body of that for loop you have another for loop, so that would be a condition of where you have nested OpenMP directives and the question is really, you know if your running it on a eight way system do you really want eight threads of execution or do you want, or is your implementation such that you want to run 16 threads. And so you can actually enable or disable parallel constructs in which that second parallel for would be ignored.

And all of these, so you don't get too concerned that you know if you implement this using OpenMP in your application that somebody can actually go in and start fiddling with the behavior of your application. If you actually write these and use the library routines, they override the environment variables.

So Intel's extension to OpenMP, we allow you to modify and control the stack size. Some people are concerned about the foot print of their application. Some you know are very knowledgeable of exactly how large the stack is and want to be able to control that capability. We have threaded local memory, allocation routines so that you can modify or manage your own memory allocation within the routines.

And then Intel also implemented something called the KMP block time and this is really to control the spin waits of how the locks operate with an OpenMP. And we'll talk a little about that when I show the example. So here's an OpenMP for loop, its fairly basic. It's a for loop, it has a defined starting index and it has a defined end, happens to be a parameter but it is a defined end and then there is a for loop within that. So this could e some serial code, and by simply adding a pragma OpenMP parallel for assignment or pragma, we can actually parallelize this loop very simply.

Now the private's required because if you look at the second for loop, the second for loop is actually updating the J variable and if we allow, if we didn't control access to that J variable then any thread could actually modify that and so here we have to make it private.

Now an OpenMP it's possible that maybe most of your threads or variables are private. So you may want to actually make the default for how variables are handled as shared, or you could make the default as private. Currently the default or the standard default is shared, you can also specify it to none which means that you have to specify every variable and sometimes its actually good to do that because then you can look at the compiler will spit out all of the information about which variables you have not defined as shared or private and you have to make a determination on exactly how you want those shared.

So in this case A, B and N are all shared by default, the variable I which is the index for the parallel for loop is private. It has to be private in that case but what's interesting if you look at this for loop is that its not, it is also dependent on the index I, which means that this for loop is not going to be a well balanced for loop. The work that happens when I is one that means J is going to go from zero to one where as what I is a 100 then that for loop is going to go from zero to 99 or whatever, 100 in this case.

So what would happen if we actually ran this using the Open MP static scheduler which is the default, we would get something that looks like this, we didn't specify this scheduler and we didn't specify the chunk size so OpenMP is going to take the total workload, the total number or indices that are being operated on within I and its going to statically allocate them and divide that by the number of threads that its actually going to use. This example I use for threads.

But because a workload is not balanced what will happen is that threads will execute. First one you know may take a long time to run, the second one a little bit shorter and a little bit shorter. So we already violated the balanced workload fundamental principal for performing, implementing threads for performance.

And actually if you opened up Shark, this is a Shark view of system trace if you haven't used Shark I'd encourage you to do so, it's defiantly very good for analyzing your threading behavior especially if you are not familiar with your application. And what we see in this view, this was taken on an eight way a Mac Pro eight way system, what we see is that all of the threads are busy and what we see you can see the color coding and what the color coding means that, that's a particular core or a particular execution CPU for the OS.

And so in this case you see some hopping around of those threads, sometimes it runs on you know the green runs on green and then it moves lighter or some other core that's a lighter green and it runs on pink and so we see these threads moving around but there fully busy. And yet I just showed you that the workflow wasn't balanced.

So of you use that Intel extension that allows you to modify the KMP block time and you set it to zero. In other words, don't spin at all if you come up with a lock or you hit a point at which you can't make progress, don't spin for a while hoping that lock's going to be released and you are going to be able to move on. Actually go ahead and go to sleep and if you look at that then this is the kind of view you get.

You see their stair stepping a fact and that's because the second for loop wasn't balanced. SO what we could do, is we could still use static OpenMP, scheduler and we could set the chunk size instead of being N divided by the number of CPU's, we could just in this case for example define it as one.

Okay and essentially want this allows it to do is instead of statically allocating all of the indexes to a particular thread it allows the thread to run. When it's done it then goes and pulls the next one. Okay now it's still static in the sense that all of these threads are going to get them in order but at least now we, instead of having all of the long workload running on one single thread of execution it's kind of spread it out over the multiple threads of execution. But its still not extremely well balanced here you still have a little bit of gap towards the end.

So what we can do is we can take that same parallel for with the private J variable and change the scheduling to dynamic and in this case define it as a chunk size of one. We could of chosen different chunk sizes but we know that the second for loop is going to be decreasingly you know less work as we go along there is no reason to define it as anything larger.

And in that case what OpenMP will do is allow the static scheduling to look something like, or the dynamic scheduling to look like this. In which case every thread pulls for the next work unit that it needs, so if it takes longer to execute or even if the thread gets interrupted by the OS to go off and do some other networking task or whatever at least the threads will balance out just by simply adding that one scheduling pragma clause to the pragma statement.

So that was kind of parallel for with the different types of scheduling algorithms, but you could also implement more of a functional decomposition and again simply by just adding the pragma OpenMP parallel sections and then defining the sections you can actually make each of those as independent tasks that will run concurrently.

Now the way OpenMP works is that all of the threads of execution within the parallel sections have an implicit join at the end of the parallel section. It's not always what you want to happen, sometimes these are independent threads and you may actually want to continue on and do something else and not necessarily wait. So OpenMP allows you to add a no wait statement to the end of that pragma directive.

So what we're going to do now is we're going to show a little demo. But first I am going to have Mike come up and explain a little bit about the demo and what we've done in terms of analyzing the application and show a little performance results that of using OpenMP and comparing that to POSIX threads.

( Pause in speaking. )

[Mike Lewis]

Thank you Phil. We put together a demo. I think it shows this technology off really well. I'll first start by showing you an introduction to what or demo does and how it functions so you have a little background before we show you actual code. Here we have some pseudo code. It shows you pretty much what would happen in a typical particle system.

In this loop we have everything attracted to one gravity point it's an order and complexity loop. Even thought there are a lot of operations that go on when it recalculates the momentum and velocity vector, its still order and so you can have a lot of particle's you don't need a whole lot of computing power. But if you want something that's a little more complex, say you wanted to simulate particles in space, there all attracted to each other, that's sort of N squared complexity. You would probably have a loop like this.

In this loop we have a little optimization because Newton's third law we can update two velocity particles at the same time. Unfortunately we wanted to make this thread safe. We wanted to implement OpenMP for it, so we had to make a compromise. If we split this up with an OpenMP parallel for right now particle J could have a raise condition when we decrement it and this isn't good.

There is other algorithms we could split it off and still have this optimization but for simplicity sake we decided to just make it this and refactor it. So it gets about 60 percent of the performance and we can thread it with OpenMP and I'll show you the demo we worked on right now.

( Pause in speaking. )

[Mike Lewis]

So here we have a simulation, you see particles floating around in space and when they collide with each other, they absorb each other and they are all attracted to each other. This is similar to the previous loop I showed you in serial code and I'll show you the code right now.

( Pause in speaking. )

[Mike Lewis]

Here we have the original loop which is similar to the first more complex loop I showed you with the optimization, you have the outer loop right here and we have the inner loop right around, it's right around here. We had to refactor is so we got rid of a potential risk condition here and I'll quickly show you that this is the hot loop by using Shark.

So we open up Shark, we start a time profile, we have our process selected, we have a time limit set to five seconds because if you have a pretty simple application, you can get a really good sample in five seconds. So I start Shark.

( Pause in speaking. )

[Mike Lewis]

And we get a profile right here. Most of the work is done in this manual loop right here as you can see and there's actually other accessor functions for vector class but there all in line so you'll see most of the work in this one loop. The names kind of obfuscated because the optimizations that are going on but if you double click on it you can see some of the assembling. You can see that the square root right here is taking up a lot of processing power on the float division and all that.

So I'll show you how OpenMP can make a really huge performance difference after I show you some performance that we are getting right now with the original loop. So right here you can see our updates for a second, this is the benchmark that we are using is how many times the velocity vectors of each of the particles are updated each second. So they can be updated between frames, its all concurrent and it could be slower then each frame, after each frame, they'll still move around but it won't be as precise if it updates slower. So this is our original code that's not thread safe.

I'll tab to a loop that shows our thread safe code. As you can see the update per second, it's about 60 percent of the original but this is okay because when we're threading for this we're going to be able to get theoretically four or eight times improvement and it's a trade off.

So here's a copy of the thread safe original loop just so I can show the different loops side by side. Simply I'll add a pragma statement as Phil showed. And a reduction since we are incrementing this variable right here each loop we have to add our reduction so it doesn't create a raised condition and try to increment it at the same time.

And we'll build it and here we have our original loop again, getting about 50 updates per second. We have our thread safe original getting about 30 updates per 0second and we have our OpenMP version. We'll just wait a second for the updates to propagate. We hit about 215, 220 and that's a really huge performance increase added one line of code, did the refactoring and get a lot more precision and accuracy in simulations that are real time.

Now you could do this in native pthreads as well and we actually did this just to show you how complex it could be and it wasn't that fun but I'll show you a little of the code so you can see what we went through. So we decided to implement a thread pool and you can see we have pthread, pthread cons, mutexes. We have a job class and we have all this stuff going on. I am not going to go into the nitty gritty because there's too much to go over but I'll show you the size of the code.

And we also have a worker thread that's implemented right here. Still a lot of concurrency that you have to worry about, debugging we had a lot of deadlock and we just have to work out all those bugs. And here's where we set it up, we have a structure we have to make it so it passes a void star pointer to the threads so we have to figure out a good way to pass it so we have to set everything up here and you can get a little more performance out of using native pthread so I'll show you a little bit of the performance that we get.

So we get about 210, 215 with OpenMP and then with our pthreads implementation we squeeze out a little bit more performance, its not significant as about 10 percent and whatever threading model you choose to depends on how important the code is. If you need that 10 percent by all means go for the pthreads, but if you're just looking for an easy fix to get most of there performance possible you can go with OpenMP. Now I am going to bring you back to Phil he has a little more to go over with TBB I think and here he is.

( Applause )

[Phil Kerly]

Thank you, Mike. So one of the things we talked about initially was kind of the four principals and I think you can actually see it, hopefully saw it applied with Mike's demo. First of all he needed that parallel algorithm. The reality is, that the code that he has received which was implemented as threaded because it supported a functional thread in terms of managing a user input and those kinds of threads and the display thread and then a computational thread.

The serial implementation was optimal for a serial implementation but it clearly was an optimal for threading. So even though he had to refactor the code so that it would be more suited to threading and actually went slower. In the serial case he was able to you know focus on the parallelism of the work that he was trying to accomplish and not on the mechanics of implementing the threading. So it's kind of amazing that for comparable performance you know its, you know one line of code, the few changes that had to happen because he had to refactor the code to support parallelism versus hundreds of lines of code to do it in POSIX threads.

So there's huge productivity improvement, a lot fewer changes, less debugging, spent a lot less time on actually figuring out where the synchronization problems were. We had lots of issues with the POSIX's version initially, and he focuses on the behavior okay and the constraints. So initially when we first looked at that threading algorithm we actually took the serial implementation and just threw OpenMP at it and we didn't get the performance that we were expecting, one because it had a bug in it and two because we needed to take care of the fact that we had this extra constraint that we were updating two variables in the array verses the single one.

And so putting any kind of OpenMP synchronization or critical sections around that really impacted performance but still we got to those kinds of decisions much quicker and then we were able to focus on the actual performance and so clearly with OpenMP, okay we only got 90 percent but you know what it took 10 percent of the time more or less and actually got along way to where we needed to be versus the POSIX's implementation.

So let's contrast that a little bit with the Intel threading building blocks. It's another approach to implementing threading. Again tries to hide some of the mechanics much like OpenMP does from the developer and allow you to focus on the problem space. And so one thing it does is because it's because it's an optimized library we're not constrained by just implementing or using pragmas. There's a lot complex threading models that are supported by Intel's threading building blocks that OpenMP.

OpenMP supports the parallel for in parallel sections implementation where as we'll see once we get into TBB that there's a lot more threading models that can be applied much, much easier using TBB. It's C++ so there's no Fortran, so OpenMP is supported by the Intel complier both on C, C++ and Fortran. We do not have the equivalent library in Fortran on TBB. It is based on templates where as OpenMP is based on directives. But in both cases there both optimized for Intel platforms.

But TBB is not limited to Intel compliers. You can use this with any compatible complier, so you can use it with GCC as well. Where OpenMP the specification you know it depends on whether the complier actually supports it. And then there are some developers that need cross OS platform support, in fact Apple delivers Safari now on Windows and so OpenMP is supported on all of Intel's compliers and so is TBB. So TBB is supported across the OS, Windows, Linux and Mac OS X.

You can actually download those compliers and tools from Intel's website. I believe that most of them have like a 30 day trial period. I encourage you to actually go ahead and download and try them. So what is TBB offer? It basically offers parallels constructs as well. It offers the parallel for and the parallel reduction which we actually saw in operation with Mike's demo.

But it also supports a parallel while so we no longer have to have a fix to find termination index point for the parallel while implementation or paradigm. In OpenMP you had to define that last index and it couldn't change, you couldn't change the value within the for loop itself, while, in the parallel while you can actually make those changes, you can have that updated and operated on within the parallel while.

There is also support for parallel scan, sort and then the concepts of liner pipelining so you can actually have multiple threads that are pipelined and add and insert basically filter stages that can be operated on the data. In addition to the parallel constructs its supports container classes, so there is a lot of defined classes that allow you to have threaded safe implementations of something like a hashmap vector or a queue class as well.

TBB supports task space scheduling so vary similar in so respects to Open MP where you could have the parallel sections. In this case you could define the task space somewhat like the task Q in Apple's implementation and then you can define your own task scheduler to operate on that, those tasks.

It supports the synchronization both atomic operation as well as mutual exclusion, there are several flavors of mutual exclusions that are supported. There is the kind of the unfair mutual exclusion, you try it, if you don't get it the your try again and maybe you get it but not necessarily fair your not, just cause you tried it the first time does not mean that you are the first person that's going to be able to get the, in there the second time.

So there's queuing mutexes which allow you actually queue up the request to the mutex so it's a more fair first come first serve type of implementation. There's also read-write mutexes, queuing read-write as well as spin versions of mutexes so when I was talking about OpenMP and the fact that the synchronization all have spin waits to hopefully avoid context switches. You also have spin mutexes here as well in TBB.

And then just like OpenMP there are memory allocation support for thread allocation and timing routines. Now the benefit of theses, you know both of these technologies is that because they are implemented either as a library or as a pragma or a set of library routines across platforms, across OS's is that now if you implement it you get the same functionality across the board. Doesn't matter which OS you happen to be running on. So let's do an example of a parallel for.

This is the same loop that we had before implemented but what we're going to find its going to be a little bit more complex to actually implement TBB but hopefully by the end of this talk you'll see that the complexity that TBB adds to your development pays off in the performance that you could also get out of TBB.

So first of all you have to be able to initialize the library, you do this by including essentially the task scheduler init.h and then using the name space and then defining the task scheduler and that initializes the library because it's a class bases object, oriented based library you actually get the destructor and you can terminate the task scheduler earlier if you like but if you just exit out of the main, it will clean up itself.

And then what we have to do it's we have to basically implement the function okay in terms of an operator that will operate on a variable range of tasks. So now before where we define the full set of the data or the full routine was from I equals 0 to I less then N, we now have to kind of parameterize that so that the task scheduler can now give that function or that routine kind of the indexes on where to start and where to end.

And that can be dynamic and you'll see why we actually do that So here's the code that we actually have to add, we actually put it in a class form and then we have to implement the operator with kind of this variable range class that goes along with it.

And then we actually implement the range class and so this is kind of right out of the TBB's format but this is actually implementing a custom range class. You could actually use the default that is included in TBB and that would mirror very much kind of like what OpenMP does in terms of a dynamic scheduler.

Okay were it basically the default is a divisible will be you know its divisible by one and so it just gives you the next iteration that you are actually going to have to operate on and that's the range. But in this case I specifically picked this particular implementation because it allows me to customize how that gets divided, so I can change the beginning. I can change how we're actually dividing the workload. And so once we define the range class and we define the function, now we can implement the parallel for. Somewhat similar techniques are applied to the parallel while, parallel sort, parallel scan type constructs.

But what makes this a little more different then the OpenMP is because I was able to define it. Now you can see that there's a custom range and I waited it in the implementation range of the custom class instead of it being equal, divided equally among the threads I gave just two of the first two indexes to the first thread and then three indexes to the second thread, four to the third and five, and that kind of help waited or you know did the load balancing my self based on exactly knowing what the calculation was and some of the performance that I expected it to have. So what we are going to do is have Mike com up and show the implementation of the demo using TBB and we'll look at the performance of that as well.

( Pause in speaking. )

[Mike Lewis]

So we have one last demo for you guys that will be pretty quick. I'll first show you the TBB code that it took to implement when I did OpenMP and pthreads. So I'll first show you the problem, the problem is right here, well it's not really a problem.

It's where we put the workload in a split it up, similar to what Phil showed you in the slides. We basically called our update sub range for loop, we called and we have it set up to take the blocked range class. And we call a parallel reduce right here, this right here is similar to parallel for in TBB but it does the reduction as well like in OpenMP.

And we get really pleasing results cause we didn't have to do all the work we had with pthreads but we still had to do a little more then we did with OpenMP. So I showed all this already, I'll skip to the OpenMP where we have the threaded getting decent results there. I'll switch to POSIX threads, we're getting about 241 updates a second, and it'll just take a second to propagate.

We're getting about 248 updates with TBB or 242 it depends just a little bit on what is going on in the screen. But overall we got really good results with TBB. We're not doing all the effort, a few threads it takes care of, the task scheduling, how many threads we have and all that stuff for not that much effort. And I think Phil is going to end.

( Pause in speaking. )

[Phil Kerly]

Thank you, Mike. If we can go back to the slides, so again what we saw was, was a little bit more effort in TBB, certainly understanding the class structure and the implementation for TBB instead of the one or two lines of code to get 90 percent of the performance it probably was another 10 or 20 lines of code.

But we actually got slightly better performance then the POSIX implementation. We actually didn't go back to try to analyze clearly if you want it to work a little bit harder on the POSIX implementation you could certainly implement and get equivalent or even better performance then TBB but the important part is that it's the improvement in the additional productivity that you achieve from using these technologies, over doing the implementation yourself and all the debugging that goes on.

So TBB certainly offers more common threading model options then OpenMP. There's still less debugging, more performance focus, we're still focused on the performance we can implement custom type workload implementations and ranges to help balance out the workload. But there is a higher learning code then OpenMP and again you can achieve the same performance or better with POSIX, it's just going to take a lot more effort.

So clearly there's still software engineering that is required you know you have to modify and think about the algorithm. The first example with the, the example where we had to change the serial code to support better improved parallelism, that part is still required. OpenMP, TBB or any other application or technology is not going to remove that from the equation. You have to understand that data sharing and synchronization that goes on, again OpenMP allows you to express it a little more cleanly and concisely but it doesn't do it for you.

You have to analyze the workload behavior and balance that workload again if you are so focused on the POSIX implementation, you are spending a lot of time debugging and figuring out exactly where your bugs are and not focused on the performance side, you know the productivity is lost to that, so OpenMP and TBB allow you to focus more on that workload balancing behavior.

And certainly you have to take into account the hardware platform constraints. For example, one thing that we didn't encounter in this particular demo is the fact that you know hardware had this certain cache line size. If you have data that you happen to share on the same cache line, even though you may not be updating the data, or modifying the data or reading the data between two threads you can cause implicit false sharing between those cache lines. That's a hardware platform issue that you have to be aware of, OpenMP and TBB or any other technology is not going to help do that, if you laid out your data that way.

So there's still a lot of effort that has to go in but hopefully with something like OpenMP or TBB, the focus is actually on the threading part and so in summary you know we really want to be aggressive in terms of threading the number of threads of execution are going to increase, they have already, its accelerating. It's going to be harder and harder to implement threading.

Open MP and TBB are alternative to improve your productivity and your application performance clearly we have a demo that we showed where we actually exceeded the initial POSIX implementation. Certainly they are all comparable in overall performance but the level of effort and work required to achieve that was significantly different.

And with greater productivity what we're really hoping is that there is more opportunities for threading. So if you have for loops that are only taking 10 or 15 percent of your applications time maybe its time to start investigating those areas as well for opportunities for threading. If you can add a pragma, OpenMP pragma statement to a four loop and reduce 15 percent down to you know down to two, three, four percent you've already gained a 10 percent improvement on your application for very little effort.