Maximizing OpenCL Performance - WWDC 2010

Graphics & Media • OS X • 58:51

Dive deeper into the practical application of OpenCL to accelerate computationally intensive tasks. Discover a variety of algorithms that can harness OpenCL to yield incredible performance gains, and learn to take advantage of the OpenCL execution model and memory architecture. Learn how to mix high-performance rendering using the OpenGL graphics API with the parallel computing capabilities of OpenCL.

Speakers: Benedict Gaster, Vinay Awasthi, James Fung

Unlisted on Apple Developer site

Downloads from Apple

HD Video (231.1 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Hello, everyone. Welcome to the session on maximizing OpenCL performance. In the previous session, we talked about an overview of OpenCL, how to use it, and sharing between CL and GL. In this session, we're going to talk about tuning of your kernels to get the best possible performance on AMD NVIDIA GPUs and Intel CPUs.

First up is a presentation by Benedict Gaster of AMD. Ben is going to talk about doing game physics with OpenCL, in particular, he's going to talk about doing cloud simulation and Bullet Physics using OpenCL and tuning for AMD GPUs. So let me please -- Ben.

Thank you.

[ Applause ]

[Benedict Gaster]

Okay, so as Aaftab said I'm going to talk about OpenCL and accelerating physics. And when I talk about physics here, we're talking about games physics. I'm not trying to solve the world's problems looking where the next black hole is or something like that. I'm going to look at games physics. So what does that mean? So that means there are a number of APIs out there and the one in particular that we've been focusing on is an API called Bullet, which is primarily developed and architected by a guy called Erwin Coumans at Sony.

And it's used in many third-party games. You know, and also films and so forth. It has a Zlib license, so it's completely -- you can do whatever you like with it, it's completely open source. And they're collaborating on accelerating it. Today the implementation is purely CPU and there's been some experiments with SSE and some other things. But it's not really an accelerated API.

So what we're looking to do is how can we give it a real boost. So we looked at cloth, and I'm going to talk about that today, or soft body. So you can think of if you've got your little yellow duck, my kids are always playing with that in the bath, and you're going to squeeze it, how do you represent that, it's kind of a soft body, rather than something like a rock. A cloth is another form.

And the key thing is that it's fully open source. So all the contributions, all the work that we do in OpenCL is going to go back into that, and you'll be able to pick it up and run it on your Mac platforms. So what is Bullet in OpenCL, what does it do?

So on the big picture here standing on the left is something I'm not going to talk about today, but it's rigid bodies, they're very important in physics. They're basically Little rocks or whatever. Here we have a pachinko machine, and 5,000 little things bouncing around hitting off each other, hitting off the pegs, bouncing off the back.

And all being accelerated in OpenCL. This all runs in real time. Everything I'm going to show you today is in real time. You know, that's the key thing about games. It may not be exact physics, but it better give a good response to the user because otherwise they're not going to want to play the game and so forth.

Another thing, fluids is becoming more and more popular, you know, maybe you want to do a river or we have some water that splashes down on top of the character in the game. And at the bottom here the stuff that I'm going to talk about today is the cloth simulation, and so everything -- all the flags, you can see all the posts, every single one of those are cloth -- real cloth being simulated -- not real cloth, but you know, game physics cloth being simulated in OpenCL.

And the little guy if he was dancing around in the middle there, you'd see that he'd have a cape out the back there, and that would all be simulated as a cloth simulation, all done or running in real time. So what is cloth simulation, what does it mean. Really, it's fairly straight forward. Basically you have some masses, so basically think of them as particles and they have some form of mass, and you have some connections. You could have all those springs and things, and they could all be separated.

But that's not how fabric works, you know? You grab it and it doesn't all fall apart and fall to bits. So you need some sort of things connecting them together. So basically you have what's called a mass spring system. Turns out there's lots of these masses, really. You want to simulate a piece of cloth and make it look real you want lots of them, particles.

That's really good for parallelism, because lots of things that you can operate on gives you an idea that maybe you can do something -- if there's only two of them, probably not very good to do. So that's the first thing you want to do when you're thinking about parallelizing them, you want to make sure that there's lots of work to be done.

So we need to connect them together with some constraints, and this is what the springs does. And I show you here you can see that there are three types of springs that we are considering, at least. Maybe there are others. But the main ones -- so if you see the straight lines around the outside, they're not in the inside, these are what we call structural springs, and they kind of provide -- give a body, give a kind of shape to the thing so it all doesn't collapse down. And then you want to be able to at some point, you're going to want to be able to rip that cloth. You're going to need to be able to rip that fabric. So you're going to want to have shearing.

These are the diagonal ones that go across. And then of course a soft body wouldn't be a soft body if it didn't bend and deform and things. So you have these bending constraints that are all connected together. So obviously we want to start, you know, want to be able to parallelize this, we want to be able to do something. And so what we need -- we have lots of these particles.

And what we do, it's very straight forward, we have -- so it's good for parallel programming. And what we need to do is basically at each step through of the simulation we generate some forces, apply them to the different connections into thing -- different springs, so they'll have different -- slightly -- characteristics. But basically it's a set of rules.

You do calculate some impulses, plot it to the masses and calculate the new positions. You know, and the key thing here, if you imagine that there's lots of these things connected together and we could do these somewhat in parallel. It's not quite as simple as that, because the fact is if I have one little bit of cloth and there's another bit connected, if I pull it this way, this one's going to go that way as well. So we need to have some core of locality and connecting things together. So parallelism is not complete. You know, it's not -- everything is not done individually.

So one of the things we wanted to look at is that you know, cloth could interact with rigid bodies, for example, maybe a throw on a chair or a flag, or you'll drop a piece of cloth on to a bottle, so for example things like that, pour some water on to it. But another thing very much independent of all those things is you might want to have some flags flying in the wind. That's an obvious force. It turns out this can be paralleled really easily and really nicely because this is computed per vertex on the cloth.

You know, that's how we're representing each particle is in some sense a vertex. Because all you really want to too is describe a mesh that you're then going render at some point. That's the end goal is to render something that looks like a flag. Doesn't really need to be a flag. And so here we can compute it per vertex, excellent parallel computation, force and density of the air medium, and you against the mass and the normal of the vertex.

You know, so it's a very simple physics calculation. There's nothing rocket science here. What does it look like if we were going to do this on the CPU. So this would be the Bullet implementation today, other physics implementations would take a very similar approach if you're doing it on the CPU.

Well you do a iterative, you know, integration, verlet is a common one to use over the vertex position, and for each spring you compute a force, update the vertices with a new position and repeat N times, where N is basically configurable. The idea here is you know, if you integrate just once, it's not accurate enough.

So you want to run the simulation -- it turns out in physics, game physics at least, 10 randomly -- I couldn't tell you exactly why it does, helps it converge to a reasonable point that gives you a nice simulation. And that turns out not just to be true for cloth, it turns out to be true for fluid and also for rigid bodies.

Just I think ten -- maybe just enough convergence there that gives you the simulation without running so many CPU cycles that you never get an answer. So, the critical thing here is that I'm -- you know, this kind of integration is great, but you're updating everything in place, you're going from -- say you got the results from one iteration being fed back into the next iteration. This is a very sequential algorithm.

Everything can be immediate, everything moves through this over immediately. But it's not a trivial thing to parallelize if you try to just take this as is. So here you can see what this would look like for each iteration. So this iteration is the N that we talked about, the integrate. You simply walk through of the links and do the integration. Really very much straight out of a physics book. It's nothing to be thought about.

So, move into parallel execution, and this is a critical thing, I think the CPU implementation has no atomicity issues. You know, there -- it doesn't matter that you're updating everything in place. You know, I might want to update a single spring, a single mass may be affected by two links, and I could just do one and then I could do the next.

And I could update the same memory location, none of that doesn't matter, because I'm implementing it sequentially, there's no race conditions. You know, that's the key thing. The trouble is, I actually want to do these updates in parallel when using the GPU. So this introduces races. So the critical thing is I can't let you do those ones in parallel, otherwise it wouldn't be a race. They're going to be things -- but of course if it's a node over here that really doesn't effect the one over here, I could probably update those in parallel. So what does that mean? That means we're going to introduce a nation of batching.

And again, this is not something that I've invented or anything, it's a kind of well-known technique that's used in physics simulations and generally in parallel programming as a whole. And the idea is we're going to go through using graph coloring. We're going to color the nodes. And so here we're going to come -- color the links, sorry. And the key thing is that for each -- each batch which is the colored links of the same color, you can -- I can execute them in parallel.

There's nothing there. But between ones where they're different colors, that means I have to synchronize between them. So what that means is I'll execute one batch, do as much as I can in parallel in that batch, execute the next batch, do as much as that in parallel, and so I am serializing the algorithm. But I'm only doing it at certain points.

And you can think of these as kernel launches in your CL program, for example, because that's inherently a sequential serialization point because you're coming back from the parallel to device to a single -- you know, a thread in your CPU or whatever. You know, your host program, depending on how you've designed it. And so this technique is known as batching.

So basically, deriving batches, we need to be able to build up the batches, we go -- for each iteration -- again, we're going to have to do this for each iteration. We're going to step through number -- the size of the batches, and then we find the first and the second, this is a kind of structure that we've got. We pick up the start and the number. So the start is an offset into the batch that we're going to do so we can parallelize it on multiple GPUs if we want to.

This implementation that I'll show you in the demo today is not parallelized multiple devices, but it's designed such that it could be. Okay, so the code I'm going to show you now is this is it, this is all the OpenCL code for the solver. There's other kernels and everything, and this is the host site code. I'm using the C++ API rather than the standard C API. So that's why it might look a little different.

But all the names should be fairly straightforward. But the key thing is here is I get to the kernel itself, get the command key that I want to run it on. We've abstracted that out so it can be an arbitrary device. I set some kernel arguments and then I enqueue the kernel and execute it.

And if we go to the solver on the next side you can see if you check back on the previous slide you'll see that my arguments correspond exactly to the set that I set there, and of course because of the -- in OpenCL, the global work size must be a multiple of the local work size. We check -- we sometimes have to do some rounding. So we check that we can execute inside that with that little thing. So the number links is the actual ones that we're actually parallelizing over.

This over itself is almost identical to the CPU code except we're looking to the global GPU buffers and everything like that. But it's basically just the integration step. You know, we read the vertexes, the masses, and then we compute the new positions. There really isn't that much more to it.

Okay, so what does that mean? That's pretty good, that turned out to be pretty good. We can already start to see a pretty good performance speed up over the GPU. However, I talked about the batching. And originally I've shown you here just nine batches. That's what we did from our original coloring. That's great. Except it turns out that the batches are quite small. You know, and what I mean by that, there's not that much work in the batches. So there's not high density per thread of work.

So we execute all this and notice not much work. So we're not really efficiently using the GPU. So the idea is can we, can we, can we, can we keep increasing the size of the batches so that we can get more done. and basically we reduce -- not only do we create larger batches but of course we reduce the number of serialization points in the program as well. Because that's what different batches introduce. So you can see here, it actually turns out that if we're very careful we can get that down to four batches. And we've doubled the amount of work that's going on.

We're not now completely -- you know, again, it's not real physics here. We've made these observations, we've done some analysis, and realized that we can use these techniques to do that. And in particular, the way that we get around this is that we remove determinism. So what we were saying before, only wanted to update a single link at one time, you know a node. And of course that causes serialization bottleneck.

It turns out that if you're careful and execute per vertex data, so that means recompute some data at the point of execution. So you may compute the same value twice, then actually you can update links in parallel. Now of course we've gotten millions of our users here to access, so recomputing worked, what's wrong with that.

If you're memory-bound then using extra ALUs to recompute the value is not a big issue. So it turns out that this is how we can start to remove determinism. Something that we're looking at moving forward, and this is not a proven point yet we've implemented it and it seems to work but we haven't mathematically verified it's correct yet, is that we can remove determinism all together and have a single batch. And the issue seems to be that it's slower to converge. And so N that I talked about earlier is now not the right value for that. The problem is if we increased N to 20 rather than 10, that would double the amount of work we're doing.

So it's kind of a balance of is that going to give us back still better performance, because we're more effectively using the GPU. And that's an interesting thing, and we don't have the answer to that today. But you can see these are the kind of things that you really want to think about when building a GPU.

Okay, so that's kind of it, I've kind of give a number of points and a number of techniques for parallelizing for the GPU. They work really work on our GPU. Actually, these work well on any kind of parallel problem we kind of solve, and these are really techniques that you could use for OpenCL.

One thing that's really important, I think, with parallel programming and particularly with GPUs, is these are SIMD machines. So what that means is they're very wide vector machines. You know, you are executing a single instruction many times for the same, you know, for the same vector but different elements.

That means if your threads diverge, so if you say AMD, it's a 64 wide vector machine, if only one thread is executing out that 64, you're using 1/64 of a GPU. You can easily do that by saying if get global ID equals 1, say for example, then you're restricted that to do that, or group ID say equals 1, then you're just restricting it to that word group and that SIMD.

But it's very easy to do that. There are some techniques that we'd used in the cloth simulations itself to reduce this. It turns out that if we're willing to not allow an arbitrary complex mess -- so what I might mean about that is I have a piece of cloth that's kind of wrapped up and rolled over and kind of coming through itself, if we say actually that's not used that common in games, you know, we can reduce the mesh to say a nice kind of lace cloth we'll have flapping in the sky like the flags I'll show you in a minute, then actually we can make them regular and we know the shape, we can preprocess them statically, and we know that we don't have any diversions when we're generating that.

So techniques like that, you know, you're not solving the whole world's problems, you're not doing everything, you're saying what is the problem that I want to address in this application. Then you can make trade-offs that allow you to get, you know, reduced divergency, better throughput, better efficiency. I think that's very important, I think in parallel programming. You know, it's very difficult to solve the general problem. If you can reduce and restrict your problem a little bit, you're much more likely to get the performance speed-ups that you hoped to get. Okay, hopefully I can switch to -- okay, that window.

So this is that you just saw those two flags. One of them is very far out, flying. And so this is running purely on the GPU. And just two flags on here at the moment, it's completely being simulated. Just a very simple GL renderer. And I can simply keep pressing here and just keep adding flags. At this point I think I've got 10 flags, you can't see all of them.

But you can spin around and see some are very close up. And you can see that we haven't seen any performance degradation at all. And I've just added these ten flags and each one's adding it, just loading it to OpenCL. And actually if you were to look at this application you'd see that the CPU is basically doing hardly anything. Turns out there's no bottlenecks on here.

The only main thing is, is that because I knocked this out very quickly and didn't have a lot of time to spend on the rendering I'm actually copying the data -- I'm not using the GL sharing that was talked about in the previous talk. And if I did that I could probably just keep adding flags, maybe I have 50 or 60 flags.

And again, not only would the GPU be running, these would be running in real time. You actually wouldn't see any of the CPU performance going up hardly at all. So you can off load that and do this work. And yeah, okay, so that's it. Now I'm going to hand over to Intel for their presentation.

[ Applause ]

[Vinay Awasthi]

Hello, today I am going to talk about how to develop complex kernels for Intel CPUs. So jumping right into it -- and the purpose of this talk is that we want to develop kernels which are used for solving complex problems such as solvers, or dynamic programming related issues and problems. And how do we utilize multiple cores, SIMD instructions, and caches. So first let's go where simple dynamic programming problem called longest common subsequence.

And this problem is essentially pretty simple -- you have two strands of DNA and you want to find out how similar they are. And the way you solve it is you create a table and you align these strands, top and left side, and then you fill in zeros so that you can process from the corners, going from left to right.

And the code that is used to analyze this is on your right-hand side, where any of the data matches, you pick up an element that is diagonal to the element that is being processed. And if it does not match, then you take the maximum of the cells that has talked to it and that are on the left of it. So it's something like this. So you're processing let's say first row. And you went through and A and A matched for last two columns. So you picked up 0 and added 1 to it.

And if it doesn't match, then you take the maximum for the top or left, whichever is the maximum you can fill into that particular cell. And you kind of just go in and fill in this particular table all the way to the bottom. And at the bottom right corner, the number that you end up with is your edit distance.

That is essentially telling you how many edits, that is insertion, deletion, that you need in order to transform one strain to another. And the higher the distance, that shows that these are not very similar DNAs. And usually scientists or NIH tends to process hundreds of thousands of these strand-long DNAs.

And essentially, you want to process them as fast as possible, and sometimes you just want to know the edit distance. At other times you want to know what is the longest common subsequence. So in this case, what you need to do is you have to traverse backwards. And the way you do it, is essentially using the same comparison algorithm and then wherever diagonals are, that's where you have your common element.

So in this case, it's JCTA. So why is it interesting that dynamic programming is used to solve hard problems, NP complete problems. A lot of times they're used in set solvers or areas wherein you want to solve common optimization relative problems. And they're very different than divide and conquer.

Because in divide and conquer you get almost half the size of the problem when you divide it. In case of dynamic programming, your subproblem is not too small from the original problem that you started out with. And in this case, as I said earlier we're going to cover vector data type optimizations, cache optimizations, and multicore optimizations.

So jumping right into vector data type optimizations. So say this is your table that you want to work with. So as I showed you earlier, in case of longest common subsequence, you kind of set up your requirements or -- you need diagonal element, you need top element, and left element in order to start processing. So we just fill in the number first. So now we are ready to process at least the top left corner.

And usually, if you look at any programming task such as CL RS or any one of those standard algorithm test, this is how they describe it. You can go top left corner and process this row by row, one row at a time. Kind of like this. But if you look a little bit deeply, you would notice that after the first element is filled in, you can go diagonally. You can cross next two in this way.

Then you go next three in this form. And four, and then you can have any number of elements being processed in parallel all the way up to the end. So it is up to you how you pick up your set. And then after that you kind of process all the elements in parallel. But there is more to it.

You can also shift your rows to your right. And once you do that, those two elements will line up. And if you organize your table in such a way that they line up horizontally, then in that case you have cache locality working for you. You do not have to have this if statement, the Max part would essentially become mass load.

So you set up your mass based on the matches and then you do a straight store. And that will write whatever element needs to be written. You will be doing two stores, one for more mass that matches, and another one more for mass that it not match. And you can fill in the Max of the elements that you have.

Okay, so how do we go about running it for re-efficiently. So as you notice that essentially you -- your threads were coming out to be first, second, third, til you hit the full length. And then the same way you end up with the problem at the end wherein it's three, two, one. So you do not want to have if conditions in your code. So what you need to do is pad it more.

So on both sides you pad it, now you can process four elements starting from the beginning all the way to the end. And this will run through all the way. And you will not have just four rows. You can have 8 rows, 16 rows, as many as you like. Essentially you are using a lot of vector data types to process as many elements as you can process in parallel.

Okay, so this also allows you to organize your problem in such a way so that you can take advantage of cache locality and mass loads and stores. And you can also do streaming stores. Because once you have your data organized, you can -- you don't need to have regular store, you can do right combining of stores using streaming stores. Anyway, so how would you go about multicore or multithread execution? So for as you saw earlier, that more any set to be processed you need top row and the left column available so that you can process it. And number of transition execute, they change.

So first you can see, you can go with 7. Then you kind of traverse down diagonally as I showed earlier. So next time you can just run one thread, then two, then three, and the same way you kind of exit. Now let's look at it in context. This is our group of elements that we want to process. Each one is running SIMD instruction, they are processing multiple elements in each set. And these groups, we can process them in this fashion in parallel in the first shot.

After that, we can go to the next one. Then you have to wait till that next one is over, then you process next 2, next 3, next 2, and then last one. And that's how you will exit. And there are other schemes to do the same thing. And that allows you to take advantage of the cache hierarchy. So you know how much of a work set can fit in your caches. And you divide your work in such a way so that the problem that you're trying to work on fits in the cache. And these approaches are called cache-oblivious algorithm.

There's a lot of research going on in academic areas where they're trying to find ways to fine-tune algorithms in such a way that they can be cache oblivious. In this case, you kind of divide your problem and do execution in this order. It's important to go from one set to the other to the right so that you are again taking advantage of the caches. If you go columnwise, then you lose prefetching and cache-related advantages that are there, if you go 0, 1, 2, 3, in that fashion.

Okay, so what does the performance gain that you get if you go and do all this work? So once I started out, I was getting about -- let's say 1 performance, and then as I added all this OpenCL execution went up to almost 7 times. And then I tried to do it using threading and hand-tune SSE.

So that also changed quite a bit based on the threaded model that you picked and the best numbers that I got was using GCD and SSE. And that was about 7 to 10% better than OpenCL. So as you can see OpenCL code approach pretty close to the code that you will get using GCD and SSE.

Okay, so what are the lessons learned. One is you have to issue large work items, around 10,000 to 200,000 instructions. Utilize map and map buffers. And the advantage of using that is they just lock the memory, there's no copying happening. You just map and then you get your data, and then you unmap.

Use CL_ALLOC_HOST_PTR so that framework is allocating memory for you. And it is aligning memory for you. You don't have to worry about alignments and things like that. But if you do have a need for it, then you use CLU's pointer and you align it the way you want it.

And try to avoid -- if you can, enqueue buffers or wait for barriers or have atomics, because they would tend to slow down your performance because each kernel is executing hundreds of thousands of times. Now other areas that are there to consider is there is no added advantage of using image-related functions on CPU because there is no specialized hardware that is optimizing image processing in CPUs. And you also do not want to do a lot of edge condition processing in your kernels.

You want to process it either earlier or you want to page your dataset in such a way so that the whole thing could be processed. Use 1D range because then you can process everything in one shot, go linearly. Okay, so here there are a few kernels that we are going to see how we should go about fixing them.

So one is -- a lot of patterns you will see in sample codes and all, they tend to compute your work group size using this kind of pattern wherein they find the global_id and know what is the data size, divide it, and this kind of pattern is doing per item work repeatedly.

And you want to focus in your kernel in such a way so that you can eliminate as many instructions as possible when you're trying to write it, so it is as small as possible, because that particular code is going to get executed a lot of times. So in this case, what you could do is you can either set up a constant or pass that as a parameter to your kernel.

Now here, you will see a lot more of morphing algorithms and lot of other image-processing algorithms. They tend to find -- they tend to work towards -- okay, here's my pixel and here's the line that I want to morph towards. And I need to find my distance so that I know the weight, let's say [Inaudible] algorithm tend to do it, and there are a lot of other image-related processing algorithms that do it. And general code is something like this, you just kind of do A squared plus B squared and do square root. So try to use built in functions because they're already optimized for you.

So in this case, you will use [Inaudible]. In this particular code, we are using int to get the global thread ID. And this particular code will have problem when you're running it on 64-bit machine, because there will be downcasting happening from 64-bit to 32-bit. You want to use size_t.

Here what is happening is the exponentor that is being passed is stopping Compiler to do the loop and rolling for you. So what you want to do is either pass it as a build-time configuration, wherein you can -- when you're running your configuration during the build time you can specify it, or specify it as a constant.

In this particular code, you will see that float2s are being used. And what you would like to do is use the full vector length because Intel CPUs do best when the full lengths are being utilized. In this case you want to do twice the work using float4, and again, don't use uint, use size_t. And this is the way you should do more work, use full-length [Inaudible] vectors or vector data types to get the best performance. Okay, so let's go with the demo.

See in here, my interest was towards using dynamic programming. And it's not so much so to morphing here. My goal is to see how dynamic programming could be used to morph one image to the other dynamically, or without any user intervention. And the way we do it is something like this. So each picture is transferred into a gray scale to find out its edges. And after the edges are detected we do black run analysis.

So you find out how many black runs are there and you try to do match, split, or merge for each one of them, for all the pictures. So think of it all the rows are being analyzed, and then we are separating the separate black runs for each picture.

And then we are analyzing those black runs for the next picture across the rows to find out where is the best match. When we detect the best match that is over 1 point that we want to transform towards in terms of image to image, then we go and execute.

So there's a lot of work that is happening. Usually you would not do this much work for morphing. You would just go and have either users set up the points like [Inaudible]'s, and other algorithms do. There are some kind of work that is involved from the Animator's side to allow you to do it. But here we are just using dynamic programming. And then what we also do, we are creating about 60 frames in between.

And so image -- that juice box to the football, there are about 60 images that are created. And the same way that we are going from soccer ball to bowling pin, again, 60 images. And all of that is just dense, so let's see how do they look. So here is the transformation. And then if we look at the images, you can see how transition is happening.

All the way. Okay.

[ Applause ]

So here are the conclusions. It's very easy to use and write. I wrote so many kernels in the past few weeks, and it was just straightforward. You develop your code on the C side and pretty much transfer it. And if you're using intrinsic, they have functions on OpenCL side which directly map.

And performance is almost on par with GCD and SSE performance. If you're trying to use POSIX thread, then you will see that there is a problem. And GCD tends to outperform POSIX. So those are the lessons that we saw. It scales very well across course, and effectively utilizes SSE, and you can write portable code that will work across devices, so you don't have to worry about whether the same code that you have will have problem with next generation of CPUs. They would automatically be handled behind the scenes for you. And it works across device generations. All right, with that I will ask James to come over from NVIDIA.

[James Fung]

Thanks.

[ Applause ]

[James Fung]

So my name's -- sorry -- my name's James Fung, I'll be talking about optimizing for GPUs using OpenCL. So today we're going to talk about how to get the maximum performance out of the GPU in your OpenCL program. This is the outline of the talk. We're going to start talking with work group size heuristics. How big should you make your work groups, how much should you launch? Then we're going to move on to some instruction optimizations, in particular we're going to talk about divergence.

So you're on a parallel machine. Branching's fine. But there's some things you want to think about when you do branch. Then we're going to move on to memory optimizations. On a parallel processor keeping the data moving through that processor and keeping it fed is really important. And we're going to talk about two key memory optimizations that let you get the maximum throughput. In particular, coalescing and how to handle local memory.

And finally, we're going to talk about using the GP texture hardware, of course GPU is a graphics processor in addition to a compute processor now, but that graphics functionality is still available to you in OpenCL and there's some good features you can use. And we're going to show a demo of optical flow using some of the texturing hardware.

Now just one note. In my talk in OpenCL, you'll be familiar with, it's called work items and work groups. In CUDA, both on the architecture and sort of in the programming language, we call them threads and blocks. I might be using the two interchangeably. But you should know they're interchangeable.

So let's take a look at what's actually on the GPU. So this is a picture of a GTX 285. On the GPU, on this -- you have up to 240 thread processors available to you. These are grouped together into 30 streaming multiprocessors, with up to 4 gigabytes of RAM. So on the entire card you have up to 1 teraflop of peak theoretical single precision. Now that is actually IEEE 754 precision, but it's not fully conformant with the spec in some degree -- in some ways.

In addition to single precision, you also have double precision. And the card provides 87 giga flops in double precision to you. Now in each of these SMs, these are these logical groupings of thread processors, each SM holds 8 thread processors, 1 double-precision unit, and in addition has 16 kilobytes of local memory and 16K worth of registers. So with so many processors available to you, it's really important to be able to keep them all busy.

Now I think the note here is that your work items or your threads are executed concurrently in warps of 32 threads. So a warp is 32 threads. Basically execute in lock step. So however big your work group is, 32 at a time will execute in a lock step in actual physical hardware. Now obviously thread instructions are executed sequentially.

So executing other warps is the only way to hide latency and keep the hardware busy. What we mean by that is if one warp needs to do a fetch from memory and is waiting on memory, then if you have other warps available to run on that SM it will schedule another warp in that maybe can do some arithmetic computation inside the ALUs. And you like to provide the scheduler with as many warps for it to choose from in order to keep the machine busy.

So we think of something here called occupancy. And that's what we define as a number of resident warps divided by the maximum possible number of resident warps. Now obviously you like an occupancy of 1, which means that you've put as many warps on that SM as you possibly can. In practice of course, that's limited by resource usage, in particular, registers and local memory.

Let's look at an example then. So we have a hardware SM or shader unit. And say the specs say we can support up to eight work groups. Each SM has 43 kilobytes total local memory and can support up to 1024 work item maximum. So that's the hardware. Now the second point there, let's say we write our kernel and we define work group size of 128 work items. But we find inside our kernel we need -- for each work item we decide we'd like to use 24 kilobytes of local memory.

Well, you've only got 32 kilobytes of local memory on that SM. That means that the -- it can only fit one of these work groups on that SM because it's limited by the local memory. It would need 48 at least to fit two of those. So what happens now only one work group is going to be available on the SM. And that's going to give you 128 threads on the SM. And that's out of a maximum possible 1024 work items.

So you have a little occupancy. Check your GPU documentation for specific details on your hardware. So what are some heuristics as to how big you should make these work groups and how many work groups you should launch. Well firstly, the number of work groups should be at least greater than your number of multiprocessors so that every multiprocessor has at least one work group to operate on. Secondly, the number of work groups per multiprocessor should be greater than say two, so you have multiple work groups on each multiprocessor.

So multiple work groups can run concurrently in a multiple processor, and then if you have more than one, work groups that aren't waiting at a barrier, say, can be swapped in and keeps the hardware busy. And again, that happens all transparently to you. But as long as you load the machine up it will give you good efficiency. And again, subject to resource availability such as registers or local memory, which is the example we just saw.

Your total number of work groups you'd like to launch say would be -- you want to launch in the order of hundreds. And that's going to let you scale from small devices across to larger devices. These work groups are in pipeline fashion, kind of batched out to the SMs.

If you can even launch thousands of work groups in a kernel, that's going to let you scale across multiple generations. The number of work items in a work group generally should be a multiple of the warp size. So again, a warp is executed physically in a lock step, if -- and is 32.

If only -- if you have some morph that's only like 15 or 2 or 3 threads, then the other threads in the warp aren't used. So basically want to keep all the threads in the warp active. For control flow, the main thing you want to be thinking about is divergence.

This happens when you branch. So branching it fine. But you get divergence when work items in a single warp take different paths. And what happens is different execution paths then need to get serialized. And what that means is the GPU, if you branch within a warp, has to, say, evaluate the if statement or the if part of the branch.

And then go back, evaluate the else part of the branch before it can move on. And so in our first example down below is an example where that's going to happen, is when you're branching within a warp. So if you thread -- say you have some control flow and you're examining your particular thread ID. And in this case, thread 0, 1, and 2 would take the if part, and maybe other ones would take the else part.

In that case, your branch granularity is less than your warp size. So the GPU has to evaluate the if part, then the else part before it moves on. But you can still have -- sorry -- you can still have branching but avoid divergence. If your branch clearing granularity is a whole multiple of your warp size. So in the case below, we have the branch, but it's occurring basically in whole warps.

So if warp 0, 1, or 2, say, take one part of the if branch and then the other warps in that work group take the other end of the branch, the hardware will know that it doesn't need to reality the other end of the branch for that work group because -- sorry, for that warp because they've all taken one -- one leg instead.

Let's look at some examples. So here we have a parallel reduction. In this case all we're trying to do is find the sum of the buffers. And we'd like to use a parallel logarithm in order to use all the processors. So we could launch a kernel and let's look at say one work item. And here we have the values, say stored in local memory. And we have the threads of the work item. And each thread is going it to add its neighbor. But we don't need to have all the threads active all the time.

But we iterate. So basically each thread adds its neighbors, then goes to the next iteration, each thread adds a partial sum, and we proceed in that fashion till finally in the last thread, it's going to add two numbers and comes up with the sum of the buffer. As you can see in this case we're going to have a lot of divergence. We have thread 0 doing one thing, thread 1 in this case is not doing anything, thread 2 is doing something else. So we have this interleaved type of workflow and that'd going to give us divergence.

So let's look at the same problem but solve it in a way that doesn't need divergence. So again we have our array in local memory. Now we know each thread needs to add one neighbor or one partial sum. But there's no reason that it has to be the immediate neighbor, we're simply looking for a sum. So here we've simply restructured the problem.

But we've now created a continuous set of threads that are all doing the same warp. Now this is a smaller problem shown here than you would have on a GPU, warp size on a GPU is 32, but imagine you had a warp size of 8. In this case, in the first step now we have an entire warp doing the same set of work.

And so we can avoid divergence. When you're accessing the main memory on the GPU there's a couple of tricks that are available to you to get of the maximum performance out of it. So you can say the main memory or the global memory, that's kind of the memory that's advertised when you go to the store and buy the card, you know, 512 megabytes or 1 gigabyte card. So referring to the global memory there. And there's some rules that if you follow lets you get the peak data throughputs. And these are rules that kind of vary by generation.

So with -- up to Compute 1.1, we have different compute levels that describe the capabilities of the hardware. But for a GeForce 8600, 8800 GT, 9400 M, 9600 GT, basically if you can have a half-warp, 16 threads in this case, a half-warp can access a contiguous and aligned and in order set of data.

Instead of falling back, instead of having 16 warps fall back to 17 different transactions it can simply do one larger transaction as long as you follow some rules. So the rules are basically -- and they need to be -- the starting address for a region must be a multiple of region size. So there's an alignment issue there.

The kth thread and half-warp needs to access the kth element of the work group. So there is an in order access there. And there's also a contiguous access requirement. There's one exception. Not all threads need to be participating. But let's look at some real examples here. So in these two examples we achieve coalescing.

So in the first upper case, we have a half-warp of threads. And are accessing a contiguous region of memory, it's properly aligned, and each one's accessing it in order. So instead of the 16 threads having to do 16 different memory accesses the GPU coalesces that into one access of larger width.

And so you get that back in one memory transaction. And then in the lower case we see our one exception here. Not all threads need to participate, but as long as the other rules are all followed, and again it can do this in one single transaction. Let's look at some areas, some situations, where we break coalescing.

So again we have our properly assigned contiguous region, but now we've permuted the access here in threads 1 and 2. Now it's going to break our coalescing on the older generation, and it's going to fall back to 16 different memory transactions. It's going to impactor performance quite a bit.

Similarly in the lower case, we're contiguous, we're in order, but we're miss assigned. And so again it's going to fall back to 16 transactions instead of a single coalesced transaction. On the newer GPUs the coalescing rules are much improved. This applies to GeForce GTX 285 and the GeForce 330 M.

Basically what the hardware is going to be doing is it combines addressing within a half-warp into one or more blind segments. And it figures out what's the best way to transfer the number of segments. So all threads with addresses within a segment are serviced by a single memory transaction, regardless now of ordering or alignment within the segment. So let's look at some examples. This is what happens on the GTX 285, for instance. When you go to access your global memory.

In the left-most case, we have a contiguous segment again, it's completely out of order but that's fine, it's all -- it's contiguous and it's aligned. And so the GPU knows that it can service those requests with sun single 64-byte segment. So goes ahead and gets basically get your coalesced access, despite being completely out of order. In the second case there's a misalignment problem, so the GPU can't do a 64-byte aligned memory transaction. But it looks at what you're doing, and it says well, we can do this with a larger 128-byte segment. And so it'll go ahead.

It will do a larger transfer, and so there's still some data transfer that maybe you're not going to be using. But again, it won't suddenly fall back to 16 memory transactions. So you don't fall off a cliff. And similarly, the last case has another misaligned transaction. But in this case, just because of the wait in line the GPU says well, I can do a 32-byte segment and a 64-byte segment. And again you haven't fallen back to 16 different transactions. What happens when you actually do coalescing.

So here's a simple experiment. This is on the 8800 GTX. So it's a previous generation GPU with the more stringent rules. But basically in our kernel all we do is read a float, increment this value, write it back. We're processing 3 million floats, 12 megabytes of data. We average our time over 10,000 runs.

So 12K blocks, 256 threads per block meaning the floats. If we achieved coalesced access, we could do this type of operation in 350 microseconds. In the second case there, coalesce but with some threads not participating. That one exception. Again, the timing is still basically of the same as a fully coalesced access. If we break coalescing, we can see the penalty for that. We've gone from 350 microseconds up to almost 3500 microseconds.

And that happens if we permuted or had some misaligned thread access. So it's almost a 10 times speed difference if you're able to achieve coalescing. In the second case, we can see the same kind of effect. And in this case what if we were reading say, float3's data. That's kind of a -- it's a 3-element data structure.

Doesn't actually follow coalescing rules. But we can do some tricks where we can actually coalesce that through local memory. So you can actually load that in, say -- imagine that it was simply a set of single element floats, put into local memory, treat it as a 3-element float, and then write it back out to global.

The point here though is if you can use your this local memory to achieve coalescing, again you get really good results. So uncoalesced cases give you 3 -- 3300 microseconds. But again, if you achieve coalescing 360 microseconds. So coalescing is probably one of the major performance improvements if you're not achieving it. We talked about local memories, mentioned several times. What is local memory.

Well local memory is basically on our GPU, it's a high-speed, low-latency on chip memory. So this is opposed to global memory which is on the card, but it's not on actually GPU chip itself. Local memory is actually on the GPU and in fact you can see there, it's actually inside the SM.

And the way you can think about it is as a user-managed cache. If you're processing an image, say, a national way to use it would be to load a tile of that image into shared memory. And then all the threads in your SM can actually read and write into that tile data as much as it wants.

When you get your final result in, you can push that back out to main memory. So in that case you can get a lot of quick accesses without paying the penalty of going to global memory. When you're using local memory, it's not coalescing you're worried about, you can actually access it very fast. But there are still some rules that will give you better performance. In particular, something called a bank conflict.

So on the GPU local memory -- local memory is divided into banks. And that's essential to achieve high bandwidth in a parallel application. And so each bank can service one address per cycle. And in general, memory can service as many simultaneous accesses as it has banks. If you have multiple simultaneous accesses to a single bank, the same bank, then you get a bank conflict. And in that case, those accesses need to be serialized.

So let's look at some examples here. So here we have our 16 banks of our local memory on the H address, 32-bit. So imagine you had an array of integers saying local memory. Then the first 16 integers in your array would be a service by banks 0-15 and it loops around again.

Now if your threads go ahead and access it and then the -- in the leftmost case, kind of nice, orderly, it's all in order and are accessing only one bank, there's no bank conflicts. Each bank can service all the threads. Similarly, permeation is fine. The key thing here is as long as those banks don't have multiple requests coming to them, there's no bank conflicts.

So examples where you do get a bank conflict. So on the left we have a two-way bank conflict. So we have multiple threads, in this case, 0 and 8. Both accessing bank 0. We got a 2-way bank conflict, hardware has to service thread 0, then service thread 8. So it's serialized. Then you can have high order bank conflicts.

On the right side we have 8-way bank conflict, where 8 threads were all accessing the same bank and so it's got to serialize all those 8 transactions. Finally, the texture functionality on the GPU is still available to you through OpenCL. And we're going to show an example here. The example is optical flow. Basically, optical flow is calculating the motion of points in an image pair.

So here we have an image. It's actually have standard test dataset. It's a picture of a guy closing his trunk. As you can see him, he's kind of pushing down the trunk. And what optical flow is going to do is calculate the motion between image one and image two. And then down below we have the kind of the results of the optical flow.

And those blue areas, if you kind of zoom in, they're basically small arrows. That tells you the direction and magnitude of the motion that we see in the image pair. This side of the computation is embarrassingly parallel, and compute-intensive, it's a great candidate for the GPU. And it's also used in applications such as inner stabilization, future tracking and computer vision. It's also used in things like video encoding.

Optical flow makes use of texture cache and hardware bilinear triplication. Let's talk about those. Let's look at how do we actually implement optical flow? This is very high-level description. There's lots of many, many different implementations of optical flow. Here we examine the pyramidal Lucas-Kanade optical flow, which is a pretty well known one. Basically, we create an image pyramid first. So we downsample the image, so we have the same image downsampled to different sizes. We do some edge filtering on it to get a vertical or horizontal edges, or derivatives, at each level of the pyramid.

Then we basically put it into some kernels that solve for the optical flow. Solving for the optical flow is basically as easy as you know, computing this G inverse B matrix calculation. And G is 2 by 2. So we can simply precompute G at every point in our image so we have those available to us when we need to do the flow calculation.

And finally, we can go ahead and solve for the flow. And the optical flow processing is a -- is sequentially goes from the smallest level of the pyramid, that kind of gives you course estimates, and you refine it down to larger levels of the pyramid in putting in the flow results to each level until you get the final base of level.

So solving for flow is an iterative logarithm. You estimate the motion, calculate the vector, V, but it's iterative. So you've got to guess for V, then you update your current position, and then you iterate again to get a new guess for V. And so what you end up doing is you have a small window of computation that's kind of slowly sliding across your image. The nice thing is on the GPU you have a texture cache.

So if you keep these images in cache, even though you're sliding variably along the image, and some areas of the image would have large motions and some areas of the image would have only a small amount of motion. Basically you slide around the image, the GPU will manage the cache for you. So you get cached coherent -- I'm sorry -- cache data access, bodies and textures.

Another nice thing about the texture pipeline on the GPU is hardware interpolation. So in our algorithms, subpixel accuracy and subpixel sampling is crucial. Between iterations as you're sliding around, your center position generally is not going to land exactly on a pixel. So rather what you're going to need to do is interpolate. So if you take a closer look at the computations here, where I and J in the upper equation are both the images.

So we're looking at the I at X Y, but we're looking at the second image, J, with some offset, which is our current guess for our flow. And so when we resample that, we need to get interpolated samples, and we use that information in order to compute our B vector from the previous slide.

The nice thing now though is that because we're using the texture hardware, we can apply linear interpolation to it. So we can look up on floating-point coordinates, slightly offset, and we retrieve a bilinear interpolate sample. And that bilinear interpolation happens in hardware, in GP hardware. So that's really useful hardware for you to use.

So we've looked at the pyramidal Lucas-Kanade flow for this demo, and the additional thing here is the visualization that you're going to see is actually done by the GPU by sharing data between OpenGL and OpenCL. So maybe we can switch over. Okay, great. So in the upper left there we see our optical flow.

And if Eric kind of moves around here, you can see he's calculating the motion vectors. So the green is the motion being calculated for each frame. And we're calculating dense optical flow in this case. So it's 640 by 480 input. With older methods they would usually only try to compute to flow at particular points. But with the processing power in the GPU we can compute optical flow at every point. So we're actually calculating at every point.

We're only displaying kind of a subset, a grid, in order to make the arrows visible. This is running on a GTX 285, and these arrows are actually being drawn via buffer sharing. So I can simply compute the two end points of the arrows that I'd like to draw. Send that right into OpenGL. And it can go ahead, we do a GL again, you know, GL lines, basically, strong lines, and that all happens on the GPU. Okay, thanks a lot.

Actually, just going to --

[ Applause ]

Switch it back to this -- okay, and so if we look at our performance across different GPUs, different GPUs have different amount of SMs, and certainly pays to have more SMs. So comparing 8800 GT versus GTX 285, running the same code across two different GPUs with more SMs you can get better performance. Okay, thanks a lot, and I'm going to hand it over to Aaftab.

[ Applause ]

So hope these two sessions gave you a deeper insight into how OpenCL works, how to use GPUs, and some guidelines on writing performance OpenCL programs for CPUs and GPUs. Thank you.