Advanced OpenCL - WWDC 2008

Media • 1:03:26

Dive deeper into the practical applications of OpenCL to accelerate compute intensive tasks. Discover a variety of algorithms that can harness OpenCL to yield incredible performance gains. Understand the details of writing an OpenCL compute kernel, and get deeper insight into the specifics of the OpenCL execution model and memory architecture.

Speakers: Derek Gerstmann, Ian Ollmann

Unlisted on Apple Developer site

Downloads from Apple

SD Video (765.7 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hello and welcome. This is the advanced OpenCL course where we're going to be taking you into the nuts and bolts of how to use OpenCL within your applications. My name is Derek Gerstmann, and let's get started. So I think with the introduction of multi-core processors and especially programmable GPUs, there's sort of been this looming question that I think a lot of us have been asking ourselves.

What if we could use all the compute devices in our Macs to make our applications run faster and do more stuff? Wouldn't that be great? And ultimately, I think OpenCL is one giant step forward in providing a solution and an answer to this overwhelming question. So in this session, what we're going to be getting into is how to develop kernels and applications with OpenCL and how to combine OpenCL with OpenGL for advanced visual processing.

So OpenCL is provided as a framework that fits within the graphics and media layer for OS X. It's a first class citizen and sits alongside all the other frameworks that you're probably already familiar with. In addition, OpenCL has been designed to work efficiently with OpenGL so you can share data for advanced visual processing.

[Transcript missing]

Some basic OpenCL concepts that I'm going to be using throughout my slides here are going to get into address spaces as well as threads and thread blocks. In terms of address spaces, OpenCL describes a memory hierarchy that maps onto current generation compute devices where threads have varying scopes in terms of how they can access and modify memory.

At the highest level, we've got global memory, which is read/write for all threads in a kernel. We also have local memory, which is read/write for all threads running in a thread block. And then finally, we have private memory, which is read/write only for a single thread in its local scope.

And the hierarchy here that we're describing in terms of threads is indicated by this diagram that we've got here on the right, where a thread is a single execution... a single work unit executing on a device, which is included within a group of threads, which is running on a compute device, which can be composed of within a logical grouping called a device group. So a device group consists of multiple compute devices, which have multiple thread groups, which contain multiple threads. And ultimately, this is how OpenCL exposes this. these device resources.

So some basic OpenCL usage. First thing that we need to do is connect to one or more devices, create a device group, create a context which provides us with a command stream for submitting to a specific device, allocate some device memory for storing the results of a computation, load and compile a compute kernel, set our kernel arguments, and then execute the kernel and use the results. And then we can just repeat as much as necessary to get the results of our computation. And then finally, we just need to shut down and release system resources. Thank you.

So to start, let's just go through a very basic example for biasing and scaling some coordinates. You can think of this as generating UV coordinates for texture maps or just a uniform two-dimensional mapping. So here we have a very simple kernel, which is going to be executing across a two-dimensional domain.

Here we've got three parameters, bias, scale, and result. Bias and scale are going to be two constant values that we're going to use to modify our coordinates. Result is an array of two-dimensional float values. You can notice here that we have a global specifier for this parameter, which is indicating the memory address scope for this value.

Global means that it's going to be allocated from the memory address scope. It's from the application side, so that all threads are going to have read/write access. And then finally, you notice the kernel intrinsic here, which is dictating that this method will be exported and can be called from the application side.

And as far as what this kernel is actually doing, we're using the global thread ID in two dimensions and scaling that based on our global domain size. So we're going to use this global thread ID in two dimensions, and scaling that based on our global domain size, and scaling that based on our global domain size, to normalize the coordinates, which are then biased and scaled, and then stored within the result array.

So now we've got our compute kernel. Let's use it. So first we need to connect to our compute device, which in this case, we're going to request a single GPU, create a context for our command stream, and then allocate some device memory. Here we're going to create a 128 by 128 two-dimensional array, which we're going to ask for as read/write usage in terms of being able to both read and modify the array values.

And then finally, we're going to create our program from a source string that's just loaded as a standard C string, currently in host memory, build and compile that executable, and then ask for the entry point for our kernel method that we declared previously, which was that bias_c. and scale core 2D method.

Next, we just need to set our kernel arguments. So these are application-dependent values that we would normally call within just a normal function. We just need to pass these on to OpenCL. So for our bias and scale values, we're going to use minus 1 and minus 1 for our bias, and scale is going to be 2.0 and 2.0.

So this will modify the origin, set it back, and create values ranging from negative 1 to positive 1 across a two-dimensional domain. So we just pack our arg values by passing in the memory location for each one of these variables, and then specify the total size in bytes for each one of these.

So now that we've passed in our arguments, we just need to specify our global domain size, the number of threads that we wish to issue across each dimension. This is handled by the global dimensions and local dimensional arrays that I specify here. So we're just going to keep things simple and issue 128 by 128. And notice the number two there, which is being used to indicate that we are executing a two-dimensional kernel.

And then once this finishes, we can read the data back into host memory and use the results for whatever we wish to within our application. And if we examine this two-dimensional array as a bitmap or an image, we can see here that we've generated a gradient which has values ranging from minus one to minus one, to positive one and positive one.

So let's take this further and actually see if we can use this to generate some Perlin noise. So here we've got a compute kernel, and this is going to be implementing Perlin's improved noise function. And we're going to start off by declaring some arrays of constant data that we're going to reference as basically a lookup table within our compute kernel.

So the first array here is p, which is just Perlin's recommended permutation table, which contains pseudo-random numbers of a size of 256. Next we're going to create an array of gradient directions, which indicate the direction from the center of a cube to its edges. And we're going to extend it to 16 so we can avoid a division by 12.

Next we have our smoothing kernel, which is just a cubic spline smoothing function to combine values that we calculate across lattice cells. Next we're going to have a helper routine for computing this value at each gradient location by passing in a lattice coordinate and its fractional offset to use that to index into our permutation array and calculate a contribution value. Thank you.

To actually evaluate the noise function, we're going to have another helper routine, which basically computes our fractional offset and our lattice coordinate, indexes into all of the surrounding neighboring lattice cells, and computes a contribution value for each one of those based on the gradient directions. And then combines those using the smoothing function that we specified previously to form our final contribution value at that location in space.

So those were helper routines. The actual entry point is going to be our compute kernel here, which we call Perlin Noise 2D. We're passing in the same parameters that I illustrated in my last example, where we've got a bias and a scale factor. And in this example, we're going to store the results in an image to make it easier to visualize.

So you'll notice that everything's basically the same, except we're going to now use our coordinate as an input into our improved Perlin Noise function that we declared previously. And then we're going to use the resulting value to write and update our array. Or sorry, image in this case. So I'm going to go ahead and demo that.

So here we have a Perlin noise function that we're evaluating across two-dimensional domain. This is 1024 by 1024, and on a single GPU, performance is upwards of 450 frames a second. And ultimately, this is because we aren't doing any memory reads whatsoever. Everything is just a lookup table, and we're actually just doing a pure data generation.

So, scaling and biasing. We can zoom in and out and modify our coordinates. And one of the examples that I'm going to be showing in just a second is going to be using this as a basis for doing displacement. So, if we sum up multiple octaves of Perlin noise, we get some higher level detail, and we can use this as an interesting means for textures. And this is what I'm going to be using for my displacement demo in just a minute. So, we can switch back to the slides.

So now that we have some basic understanding of how to execute kernels and what's needed to connect to devices, let's look at using OpenCL with OpenGL. And this really provides us the best of both. We can use OpenGL for rendering for all of our existing geometric data, and then OpenCL for doing simulations and more advanced rendering techniques. One of the great things about OpenCL is that it provides direct access to memory that's allocated and managed by OpenGL. This means that we can reference data on the device without copying it back across the host.

Our general usage here is all we have to do is create and allocate the object or memory in OpenGL and then just reference it in OpenCL. Then we can use attachments as a means for allowing OpenCL to modify the data and then detach so that OpenGL can then use it.

So in this example, we're going to create a vertex buffer object using a standard gen buffer and bind buffer, which you should all be familiar with in terms of OpenGL programming. The main thing that I want to point out here is that we're issuing the GL static draw usage flag, which is very important.

This is indicating that as far as OpenGL is concerned, this data is not going to be updated from OpenGL. From CL, we're actually going to be modifying the values ourselves, but OpenGL, there's not going to be any more data submission. This ensures that there's not going to be an additional copy, and we're just referencing the data.

Then in OpenCL, we just need to create an array passing in the memalloc reference usage flag, which simply states the device is already out -- the memory has already been allocated on the device. There's no need to allocate any more memory. Simply use the memory address for this particular buffer.

And we can then attach the existing VBO using a CL attach GL buffer method, execute our kernel, and then detach to give control back over to OpenGL. And then here we just bind our buffer and then issue a GL draw elements call, and the data's already been updated, and we get the results.

The other usage pattern that we can use with sharing data between OpenCL and OpenGL is to allocate and manage the data in OpenCL. This might be useful if you're doing, you know, if you're generating data in OpenCL, or if you're dealing with unusual formats, or if you simply want to use OpenCL on one device and OpenGL on another device and actually do need to copy the data between devices. So our general use of share is we just create an allocated OpenCL, read and write via the OpenCL kernel, and then copy the results back to OpenGL.

So very similar, except this time we're going to be creating and allocating the memory in OpenCL. We're going to use the mem read write usage flag, which I was using previously. Execute our kernel, and then read the data back into host memory that's already been allocated in our application. That's the data array. And then submit this to OpenGL. So here we are copying from the device back into our application, submitting to OpenGL, which is going to be slower, but it gives us more flexibility in terms of what we can do.

So, as an example of sharing data between OpenGL and OpenCL, I'm going to show you this geometric displacement demo that I was talking about. Here we're going to create a GL vertex buffer object, fill it with some initial positions, attach, and then use within OpenCL. We're going to evaluate several octaves of Perlin noise, move the vertex positions in the direction of the normal, which is the normal outward direction, and then detach and then render with OpenGL. So let's switch over to the demo.

So, like I was saying, we're initially submitting just a sphere, which is a whole bunch of vertices, and then using that as our inputs to our OpenCL kernel, we evaluate that Perlin noise function at that location in space, and then modify the vertex locations by pushing out in the normal direction.

Then we also take three more samples using finite differences to compute a new normal at that location in space, and then all the rendering is being done by OpenGL. So here we've got several different shaders to give us Fong illumination, and then we're also doing perspective-corrected shadow maps so that we get a highly detailed shadow, and we're also jittering the sample at the edges of the shadow map. to give us a nice soft blend. We can also adjust some parameters of the noise evaluation, so we can increase and decrease the frequency, and modify the number of octaves that we're calculating.

Increase the amplitude. And then just as another demonstration of what we can do on the rendering side of things, we can also use this for reflection and refraction. And we're also doing some chromatic dispersion. So we're using the Fresnel calculations for doing the reflection and the refraction coefficients. And then we're also simulating the chromatic dispersion by assuming that each channel of color is a different wavelength of light, which is going to refract differently. That's how you get the highlights to have drastically different colors as the light gets refracted.

Okay, back to slides. Okay, and one other demo that I want to show you here is a smooth particle hydrodynamics fluid simulation. This was actually shown in John's State of the Union for the graphics and media session that you might have seen earlier this week. So SPH is a simulation method for free surface flow. It uses the Navier-Stokes equations for fluid motion by calculating a weighted sum of physical values. So you can think of this as a 3D convolution of forces.

And the algorithm that I've implemented here is for every single frame, we're going to compute particle positions based on the forces. And then we're going to dynamically fill a uniform grid with particles in grid cell coordinates. We're going to use that as an acceleration structure to identify nearest neighbors so that we don't have to compare ourselves against every single particle. So we only look at our neighboring grid cells based on our current particle location.

From this, we're going to create a nearest neighbor map. from the uniform grid that I just mentioned. And then for every particle, we're going to evaluate density, pressure, curvature, and viscosity contributions based on the particle attributes. From this, we can define forces, which we can update and integrate per time step. And then finally, create a distance field based on the particle locations, which I'm gonna store as just a 3D texture. And then ultimately use that for a volume ray casting for the rendering. So let me switch back over to the demo machine.

So here we have the Stanford bunny sitting within this pedestal. And as I mentioned, this is a 3D texture which we're using as an input to a raycast volume shader so that we can do the rendering and then also calculating reflection and refraction, similar to the previous geometric displacement demo.

So he melts in a fluid-like fashion. And we've got upwards of 34,000 particles here. And for this particular demo, we're actually doing the rendering on one GPU and the simulation on another, and then moving the data between devices to do the rendering and updates for every frame. And you can also interact with this, so you can splash things around.

And with about 35,000 particles and doing all of the force calculations and the texture generation, we're getting on the order of 40 to 50 frames a second, depending on how tightly packed each particle is in terms of its location in space. We can actually get this up to about 100,000 particles and still be interactive.

So, as a segue into the next presenters' sections within this course, I just wanted to give you some brief optimization strategies for developing OpenCL kernels. Overall, you really want to maximize computation. You need to keep the device as busy as possible. And the best way to do that is to minimize the amount of memory that you're using and use that memory effectively.

So make sure that it's as close to the current thread that's running and utilize your caches efficiently. And I tried to illustrate this with these two best and worst case scenarios. So the best case is to have uniform memory access where every thread is accessing memory from a unique location, doing a whole bunch of work, using a linear execution, so not too much branching, not too many if statements, and then writing to unique destinations within your output buffer.

This will give you ideally the best performance. The worst case is to read from random locations a whole bunch of data, do a whole bunch of branching and iteration in arbitrary ways within your kernel, and then scatter and write to random output locations. By doing this, you're going to force the device to be able to run the kernel in a way that is not too much branching, not too many if statements, and then writing to unique destinations within your output buffer.

So you can basically blow out all of its caches and force data to get transferred way too frequently and also decrease performance drastically. So you can sort of judge this and see that for most applications you're going to fit somewhere in between, but you can use this as a guideline for best practices.

Also on the OpenGL side of things, you should really focus on referencing data that's been managed in OpenGL and just use those as your memory outputs. So don't allocate memory through OpenCL. Simply reference and use OpenGL managed memory. Then you don't incur the copy across the bus. And then also something else that I didn't have time to talk about is you can actually separate out your rendering from your computation using asynchronous execution. So OpenCL allows you to execute a kernel asynchronously. And you can just let it sit there and run and then get an event indicating that the kernel is finished and use that as a means for updating your rendering thread.

So in summary, OpenCL provides a very straightforward programming model for performing data parallel computation that's incredibly flexible and incredibly powerful, and also integrates seamlessly with OpenGL to enable you as a developer to perform some really advanced visual processing and use cutting edge rendering techniques. And with that, I'd like to hand it over to Ian Buck from NVIDIA.

Okay, hi. So I'm Ian Buck. I actually did my doctorate research in GPU computing back in the day, and I'm excited that now all this is becoming a reality. It's been a pleasure working with Apple. I drive all of the computing software at NVIDIA, and we've been working tightly with them, obviously, on OpenCL.

So the purpose of my talk is basically to give you an idea, a conceptual model of some of the differences between the GPU and a CPU. And I want to emphasize that a GPU is not just a CPU with more cores. It really is a fundamentally different kind of processor.

And it had to be in order to deal with the level of parallelism that we want to exploit. We have hundreds of cores in our processors. Each core is capable of maintaining thousands of threads at the same time. And as a result, it's much less dependent on caches. We can spend more of our diary on ALU and on computation than traditional caching architectures.

So the outline for this talk, I'm going to highlight three key things that emphasize how GPUs are different. First is the importance of filling up GPUs. There was a question earlier about how many threads do you really want to kick off, and I'll talk about that. Using the local memory, that's a big advantage that we have on the GPU. And I'll dig into that in terms of performance and give some examples. Secondly, optimizing for that memory performance. The previous speaker talked about scatter and gather and how making a linear kind of problem can really present the best kind of memory access pattern to the memory subsystems.

So first, filling up the GPU. There was a question earlier, you know, if I have a 10 million pixel image, do I really want to kick off 10 million threads? And the answer is yes. I mean, GPUs just love threading. You know, bring it. So that's how we keep the GPU busy. Today's GPUs have 16 processors, each with 8 cores. Henry talked about that this morning.

8800GT has a total of 128 cores. Threads are really cheap in our architecture. We've built our architecture to absorb that many threads. We actually have dedicated hardware support for context switching between those threads. And we'll actually switch on every instruction to another thread in order to keep the GPU busy, whether it be the ALU functional units or the memory subsystem. So more threads, the better.

The thread groups as part of the OpenCL model are distributed across the different processors. So we're actually going to be processing the threads and the thread groups concurrently. There's a first question, how many threads do I really need in order to get it working? Obviously you want to kick off more than one thread, but typically you want to kick off at least the number of thread groups as you have processors.

On the 8800GT, that may be 16 of these processors. But you can certainly go more, and the GPU can queue up millions of threads, and that's not a problem. It also means as GPUs scale and get bigger, and we add more processors to NVIDIA GPUs, your applications will scale seamlessly because we can execute more things in parallel.

So I want to talk about filling up each processor. We have that thread group, which is a collection of threads. How big should my thread groups be? On current architectures, we can support up to 768 threads per processor. The more threads per group, the better. Obviously, we'll use those threads.

The hardware will switch between one thread to another thread to keep the ALUs, the cores, busy on the GPU. The recommended minimum is around 64 threads, but certainly 256 threads is a better choice that hides any latencies that may be in the pipelines within the architecture. But certainly, you can grow further. This is often algorithm dependent. So you want to experiment, just try different sizes. See what makes sense for you.

Often the size of a thread group is determined by the kinds of memory blocking you're doing. Things you may have used to do on the cache for blocking for caches. That thread group defines typically your blocking size. OpenCL also provides a function called get kernel thread group info, which can give you an idea of based on how your code compiled, number of registers, amount of resources, what's a good optimal size for that thread group to help you decide. But certainly experiment.

So using the GPU's local memory. The local memory is a dedicated on-die memory that sits right next to those ALUs. It's as fast as registers to access. You can think of it as an indexable register file. So it's orders of magnitude faster than global memory. We computed on the 8800GT, you're pushing over 760 gigabytes of second of aggregate bandwidth to that local memory.

So if you can move your data from the global, which is a around 70 to 80 gigabytes a second, copy it into the local memory, spin on it there, you can actually get some huge performance improvements. So as an optimization tip, first get your algorithm working, then see if you can optimize for that local memory.

A bit of detail about that memory. It's banked. It contains 16 banks. Every word will service a different bank based on the MSBs of the address you're accessing in the local memory. So you want to think about that if you're doing some further perf-turning. Once you've taken your code into the local memory, think about how your access patterns are happening across the different threads. Because that can further give you more performance. Today's GPUs, the local memory is 16 kilobytes per processor, per thread group, and the threads within that thread group will be able to share that local memory.

Further optimizing for the global memory. The previous speaker talked a little bit about that. Today it's about 80 gigabytes of peak performance. You want to think about how the GPU is executing the threads. On a CPU, you'll switch to one thread, and that thread will dominate that processor.

And you often want to optimize for locality across different accesses. So the first access will read the first word, the next one will read the next word, the next one will read the next word. And you know in your head that the cache is trying to coalesce those together and leverage the caching.

On a GPU, we're running your threads in parallel, all at the same time. On current architectures, we gang them together in groups of 32. So instead of thinking about how threads, how an individual thread is going to access memory in its access pattern, you want to think about how the threads across, across threads are accessing memory.

If the first thread's reading the first word, the second thread's reading the next word, the third thread's reading the next word, that's going to look like one wide memory to the memory subsystem. And the GPUs are going to optimize for that and get you that 80 gigabytes a second.

So you need to turn your thinking around a little bit. The good news is often, as I'll show later in an example, this is how naturally you want to program -- your algorithms naturally want to access data when you have 10 million threads flying through the system. Often the threads are working on neighboring parts of the problem, neighboring parts of the data. They'll naturally read neighboring values and get that good performance. performance.

So as a result, brute force algorithms often result in the best performance. The complex stuff you may have used to doing your traditional C and CPU code, where you're trying to minimize the number of memory accesses, typically the best performance is just the brute force one. Create a thread per every data element in your problem, solve the problem for that thread, and the memory subsystem likes that kind of memory access.

So here's a simple example. Typically in fluid simulations, we're updating our forces within the fluid. It's a little different kind of fluid than we showed before. But let's say I have a simulation grid, and I want to update my fluid motion. I want to keep the pressures normalized in the system. Water is an incompressible solution.

So usually what we do is we perform our updates, and then we have to normalize the pressure. Normalizing the pressure is simply doing a local pressure calculation for every grid cell in my simulation, and seeing what the differences are and trying to normalize the velocities of all the fluid so that we keep the pressure constant across the fluid. This is simply a neighborhood kind of calculation.

In my grid cell, I look at my neighboring pressure and my pressure and try to normalize. Simple add operation. But I have to do this across all the cells. in the simulation. There's the diagram. So I'll have one thread be working on the green cell, looking at my neighbors, influencing my pressure. I'll obviously have a thread for every cell, looking at my neighbors, updating my pressures.

So there's two kinds of ways we do this on the GPU. The first way, and the way I'd recommend you start, is just do a simple brute force algorithm. Create a thread for every grid cell. We get lots of threads, bring it, the GPU loves it, and then we'll get great streaming performance. Because if we look across the different threads, everyone's going to read the guy above them.

Well, that's one big wide memory access, as seen by the GPU. Then we'll read everyone to the right. That's again, one big wide memory read, as seen by the GPU, and below. So typically, that brute force algorithm works great. And this just doesn't apply to the fluid simulation. We've seen this in image processing. We see this in more complex stuff like ray tracing. Often, you start with a brute force algorithm. It's the simplest thing to code up, and often gives you the great performance right from the starting gate.

There's another big possibility for improvement is, like I mentioned before, use the local memory. So what I can do is take the fluid, the grid simulation, all those pressures, copy some region of it, the region that my threads care about, the threads in my thread group might care about, into the local memory.

Now I have much better, the threads can work together to not overfetch their data. Every thread we can spin on the local memory, perhaps do multiple iterations of the pressure solve to try to normalize all the pressures in the system. But using the local memory, which is 700 gigabytes a second of memory bandwidth.

So that was a really quick introduction of some of the top three things that you want to do to keep the GPU busy. It is a massively data parallel processor. Don't be afraid to kick off millions of threads. We can handle it. We love it. It keeps the GPU busy. Threads are really cheap. The GPUs have dedicated hardware to keep all those threads busy and active. And we need that in order to keep all the pipelines full.

The other big point is use the local memory. Huge speedups have been seen by taking stuff out of the memory, putting it in that local-- think of it as a software managed cache. You get to decide what goes in the cache, not some hardware prediction scheme or caching scheme. It's under your software control. You simply declare variables that are local. They get allocated to that memory, and you can control reads and writes into it.

The other big point is you want to think about memory access. The threads are all working together. They're going to see similar memory-- they're going to perform their memory accesses together. Don't waste time optimizing for one thread. Don't start with the most complex CPU data structure you know about to minimize memory access. Just start with the brute force one. You'll be surprised. It's simple, and you'll get great performance right off the bat. So that was a brief overview. I'm going to introduce now the next Ian, and he's going to talk about optimizing compute kernels. Thanks.

Thank you. So it was something of a question to me what to say after five people in a row have talked about OpenCL. So I think the best thing I can do right now is to sort of bring it all together and give an example of sort of bringing it down to earth in the context of how you'd be programming in your own application.

So the first question you have to ask yourself is, like, when do I use OpenCL? There are a lot of other technologies out there, so I'll try to give you some context. And then after that, I'm going to work through a version of the NBody simulation that you've been seeing to sort of show how our developer tools work in conjunction with this and some survival techniques for working with this.

So when to use OpenCL? OpenCL, as you've learned, is great for large, data-intensive problems. It's really good for expensive computations, anything that might touch data more than once, like an NLogN implementation in op. And it's great for highly parallelizable problems. If your problem is really tiny, then perhaps if you'd just written a little piece of code to do that, it might have finished before OpenCL even got started. If you're just limited by the rate at which you can copy data around by the bus bandwidth, then you don't need a lot of software to help you out on this. You're basically bottlenecked by the bus, and the software isn't going to change that.

Think twice before putting inherently sequential problems that simply can't be parallelized. OpenCL is a data parallel programming environment. If you have no parallelism, it's not going to help you much. And similarly, you could have a highly parallel algorithm, but it might be constantly synchronizing back and forth and tripping over itself. So those things might not work so well either. And finally, the OpenCL kernels right now don't have object-oriented code in them. and of course call OpenCL from object-oriented code.

So here's examples of some other high-performance technologies you might use. There's Accelerate Framework, which is Apple's proprietary framework consisting of APIs for common computational algorithms like convolutions or FFTs or linear algebra, big number computation, that kind of thing. And they're all hand-tuned by engineers who sit there and stare at one piece of code all day and eventually check it in. And it's really great if it does the thing that you need to do. And I used to work in that group up until recently.

And that was always sort of my frustration, is like I would have to sort of predict ahead of time what everybody needed to do. And sometimes I get it right, sometimes I get it wrong. And quite often, there's something in your application which is its heart and soul. It's a specialized computation that you need to do that makes your application work.

And if you don't know what application what it is, that's probably not going to be there. So Accelerate Framework is something you could plug in. It'll help in a few places, but often you need to go ahead and write your own code. And that's really where OpenCL comes in, because we let you write your own kernel and have it executed in a highly parallel environment, hopefully with good performance.

There's also OpenGL. If you have a purely graphical task, it's an excellent thing. Previously in GPU areas, people might have used OpenCL for a lot of things. But now, it's a little bit different. So if you're using OpenCL for a lot of things, you might want to use OpenCL for a lot of things. But now, it's a little bit different.

So if you're using OpenCL for a lot of things, you might want to use OpenCL for a lot of things. People might have used OpenGL shaders to do general computation, but you've got to couch your algorithm in terms of pixels and vertices and whatever else. And so it kind of gets in the way. So that's great for graphics, but OpenCL is really designed to open up the wide space of algorithms that you might want to write and give you the language and ability to write those.

So there's also the question of, should I be using the GPU? And GPU has great bandwidth to its own internal memory, and even better bandwidth to the local memory as described. But you do have to get the data over there right now with current architectures. And it's sitting on the other side of a PCIe bus. The throughput over the PCIe bus is about three to four gigabytes a second. And if you're thinking in terms of CPU terms, well, that's maybe around one byte for every CPU cycle.

So what could I have done on the CPU during that time? Well, I could do four single precision multiplies in the vector unit. I could do four single precision adds. I could copy four floats somewhere out of the register. I could have done four loads as a vector load. I could have done four stores as a vector load. And in fact, I could have done all of those things in that one cycle concurrently. So you could get quite a lot done in that CPU cycle.

Usually when you're writing code, you aren't operating on one byte. I might want to multiply two floats together, and that's eight bytes of data. And during that time, I could have done this much computation. And of course, we don't have just one CPU. We might have eight CPUs. So I could have done that much work.

And of course, I might have to get the data back. So that might mean copying eight bytes back. So I could have gotten that much work done on the CPU. So you really want to focus on the GPUs to move problems that Move your data over there and keep it there as long as you can.

You might have to copy it back in order to do something with it, but you really want to just get it there and take advantage of the bandwidth on the device and use that as best you can. And if that process is too expensive for your overall computation, then maybe you're better off just stay on the CPU.

Don't bounce beta back and forth. and so I'm going to move forward onto the simulation. This is the same simulation or a similar one as the one you've seen before. We have a system of n particles. They all have a gravity interaction between them all, and we calculate them all.

A little bit of a disclosure here. We're using a substandard algorithm. We're really actually calculating the direct interaction between every particle and every other particle, so there's an n squared cost here. There are algorithms published out which are significantly cheaper. So in our context, we're doing this because it's great for demos, but it's not going to be good for your users. So if you really want to sell a product, make sure you're using the right algorithm.

If you're a scientist, obviously you want to make sure you get in the right journal. You don't want to stand up here in front of a thousand people and then have a second year grad student say, I used a different algorithm, and I can get better performance with my scientific calculator. So -- It's always disheartening when you hear that after months of work.

Make sure you have the right algorithm first. This is not just OpenCL. This is any bit of optimization. So you don't want to pour your heart and soul into something and realize that you're just completely on the wrong track. So I'd like to switch over to the demo.

[Transcript missing]

So we can run that. And I haven't really done any tweaking on the graphics on this, so it's just going to show the particles moving around. And we're getting somewhere around 2 gigaflops. But as you notice, this is probably not something the end user is going to want, because this is barely moving. And it's flipping the frames every second or two. So well, OK, what do I need to do? Well, let's haul out Shark. And I can quickly take a sample on it and see what my application is doing.

Am I spending too much time drawing? Am I spending too much time doing something else? No, indeed. My update particles, which is this function I just showed you that iterates over the particles, is actually consuming 98.6% of the time. And so obviously, I need to focus my efforts there.

Well, I've heard about OpenCL. And those guys from Apple, they told me that, oh, I could use that to run on multiple CPUs. Obviously, I'm not using it. I've got the full potential of my device. I've only got one CPU busy with this. So let's try that. I'll put this away and switch over to the OpenCL project.

So what have I done here? Well, I have the same create particle set that initializes my positions. And at the end, instead of just putting it in a regular array, I've created a CL array using that data. And this is a device that just tells CL, okay, here's my data. And CL needs this because it needs to be able to copy your data over onto the compute device, whatever that may be.

I've also taken my original update kernel, update particles function, and rewritten it as a kernel. Here I've got the kernel written as a series of C strings. You can also put it off in a separate file if you like. I have it this way because I wanted to have a C preprocessor thing embedded in there for my own use. And you can see that these look pretty similar with one interesting thing.

You'll notice the inner loop actually is quite a bit smaller in CL, and that's because CL has native vector types and native vector arithmetic. So whereas before, I had to write out in scalar code the x component, the y component, and the z component for calculating the distance from one object to another, here I can just do it in one line of vector code.

And down here, again, I've got the velocity component calculated component-wise, whereas I can do it in one line of vector code. So in some ways, this may actually clean up your code and give you a little bit more of an idea of what's going on. So I've added a little bit of vector acceleration for free, pretty much. So we can run this.

Stop the other task. Run the new one. Here's my thing. Oops. Oh, dear. There went all my data. Well, what did I do wrong? So just to reassure you, this is a bug, but I found this bug two weeks ago. So I'll just walk through the process I went through to find out what was causing this. So I've added a little printout. I've added a little pnf in here to read the output from the array, just to get some hint about what happened.

So I can run the debug version. It should spew coordinates from my stuff. And eventually, if this was working correctly, we would actually see NANDs appear. It looks like it's freezing. So anyway, what we saw was a bunch of NANDs appear in the data. You don't have to bear with me. So like, well, where were these coming from?

So what I can do is when OpenCL compiles your kernel, it's compiling it at runtime. So the usual set of symbols that would normally be expected to accompany your binary in this case aren't there. So there's no symbol to stay in GBA, okay, I want you to break on my filter here, or kernel here, because it doesn't know what that is because there's no symbol in your binary because the function never was in your binary. It's just sitting here as a C string. So the question is how do I stop there? So what I've done is I've taken my kernel and I've instrumented it with some new lines of code. Here I'm just looking at all of my data.

I have one for the position and one for a distance and one for a velocity. And I'm checking to see if any is NANDs -- any NANDs appeared in the data. And if so, I call this function called breakpoint. Is breakpoint some magical thing in OpenCL? Absolutely not. It's just a little function I wrote right here, and I'm going to put a breakpoint there.

So I can run this in the debugger. And once it comes up, I noticed I landed in my breakpoint. One of the things I did when I created this-- The thing is I passed a different number to each breakpoint, so I actually know which one it was. So this time I landed in breakpoint 3, so I can go look and say, oh, breakpoint 3, okay, that's this one. So it looks like one of my input velocities had a NAN in it.

Well, how did that happen? Well, I have a hint here. I also put in a breakpoint for the output velocity, so it seems like if my kernel ran and produced a NAN, and then I read it in later, I would have caught this earlier in a different breakpoint. So apparently my input velocities are busted.

In the very first frame, well, how do I manage that? Well, we can go up to my create particles thing, and I realize, well, okay, these are actually a vector full of XYZW. I forgot to initialize the W, so I have just random data in there. So let me fix this. And that's where the NAN came from, or so I would assert. And I'll fix that up, and we'll rerun it again. And there we go. Now the simulation is running.

So -- and you can see we're using all the CPUs up here, and I'm running much faster than 2 gigaflops, and now somewhat closer to 10. So that's great. But we think, well, gee, wait a minute, I am running on 8 CPUs here, but only going maybe 5 times faster. Well, what happened there? Well, we can take a look in Shark once again.

And so here is-- because we have no symbol information for the kernel, it's just showing up as unknown library. So but this is, actually, my update particles thing. And then we have a called a square root. Wait a minute. Was that in my last one? My original code? No, there's no called a square root. Well, my kernel does do a square root.

Maybe that came from there. We can click down and see, oh, there's a square root here. So how did that happen? Well, we can go and look at what the GCC was doing in my original version. Yeah, it's got the square root instruction. For some reason, OpenCL decided to call the square root function. Well, that's not really entirely optimal. You'll have to believe me when I told you I've already filed this bug.

In this particular case, since we're just writing a demo, we have to kind of ask ourselves, well, did I really need a square root in the first place? And it turns out that in addition to providing a full math library, OpenCL also has a full half precision math library which it does all its work in single precision, but it doesn't guarantee results for anything better than what you would need for half precision, which is a 16 bit floating point type.

And since this is just a demo, we can go in here and we can change this to use that function. And the advantage of throwing away some precision is that it can run much faster. So once again, if you're going to go present in a scientific conference, I don't recommend doing this, but since we're just at WWDC.

Just go for it. So we'll rerun it. And I have a little problem with things getting stuck. Let's see if I can get it to run. OK, well, there it goes. So we're getting about 22 gigaflops, which is a lot better than 10. Performing much better. So we've gone through multi-threading through the CPU and fixing one little thing with the divide. We've gone up a factor of 11 in performance. Well, we're not entirely done here. Let me run this again. And then I can sample it in Shark. Let's jiggle this so it works.

If we go actually look at what the kernel was doing, we notice that there's actually quite a bit of scalar code in there. And that's because I haven't really fully vectorized my stuff. I have vector code here and vector code here, but this is all more or less scalar stuff. And we can see that if you like looking at assembly, we can in here, and we see all of these scalar instructions right in the middle of the loop. So what can we do about that? So can I switch back to slides for a moment?

So the current data layout, as I explained earlier, just has packed x, y, z, w in it. And for this calculation, as we've already learned, w doesn't really do anything. And in fact, it causes trouble if there happens to be a nan in that value. So this is sort of bad in a number of respects. It wastes space on touching more data than I need to. And it forces us to write scalar code in order to do much of the work.

And that's because, as part of my calculation, I need the z component to talk to the x component in vector code. They're in the wrong place in the vector. So you have to move the z over by the x and then do the operation. And that sort of, in fact, has the effect of turning what should be vector code into something which is executing more like scalar code.

Except it's worse than that, because I have the extra operations in there required to move the z over to where the x is. So instead, what I'm going to do is reorganize my data. So I'm walking through it in a little more straight line fashion for what the CPU would like. I've thrown out the w component, and I put the x, y, and z in their own array.

So now when I load a vector of four things, it has four x's in it. So I can do that for four different particles. And that uses 25% less space. All the operations are vertical now, and I have no more scalar code. So I can show that code.

And so there it is running. So that's a pretty good speedup from 220 to 84 gigaflops for vector code. And we can actually go look at what code I had to write to get that, and it's pretty similar code. Here it is. It's the same thing over and over again.

The big difference is that X, Y, and Z are now float 4s. And so this is the vector code, and you'll notice there isn't a vec underbar or an mm underbar anywhere in this. It's all vectorized in OpenCL. And so it's a lot cleaner and hopefully more portable this way. So you can do it like that.

So the next question is, well, great, I've done that. And now I know that some of my users will be using GPUs. Not all of them, of course, but some of them will. So I want to make sure this works OK on the GPU. So I can change one character here to GPU. And we'll rebuild it. And we'll run it. There we go. Starting to move up. And you'll notice that we aren't really getting the hundreds of gigaflops that we were hoping for.

Optimization has its sort of ups and downs. And we have to accept what we get. And the reason is that I really haven't done anything yet to optimize for the GPU. And it turns out that all this stuff that Ian Buck was talking about is very important. So I'll cut the slides. And I actually tried quite a few things. I wrote several kernels here and several down there. So I'll cut the slides and show you what actually worked for me.

So the first thing I tried was to look at how many threads I was using. In my CPU thing, I defined 512 threads just because that was the number I picked. And it turned out that wasn't enough. And if I actually define thousands and thousands, or in this case, 16,000 threads, it ran twice as fast. So that was kind of educational.

The other thing is that because I'm running over the particle list over and over and over again in every thread, it turns out that for every particle I do, I end up touching all the other ones, which means 16,000 loads for every particle I calculate. And so this is a case where copying data to local memory is very important. And I got a six-fold speed increase on top of the two-fold I already got by doing that.

Now, it's not true that in every algorithm, local is going to help you. There are going to be some things where you really actually end up only touching the data once. So copying from global to local and reading it there isn't really going to help you. It's not really going to be any much better than directly accessing the global.

But you can define your problem set to use the image type, which is cached. And you can take advantage of the speed improvement there, which I did, but I don't have enough time to show you. I also did a few other things. I unrolled the inner loop four times. This is the same sort of trick you might use on a CPU.

It cuts down the loop overhead and gives the compiler better scheduling opportunities. And then I did a couple of things. I unrolled the inner loop four times. I did a little more optimization where I said, well, I have a loop here to copy from global to local, but this is doing some strange things because some of the threads are looping through this twice and some of them are looping through it three times.

And that means that there's some divergence in what the processors are doing. So I carefully controlled the number of thread groups I had according to the local memory size that I had allocated so that all the threads iterate through the loop the same number of times. And that gave me another 40 percent over the top of that.

So the overall improvement from my rather disappointing result at first is 22x. So I'll cut back to the slides, and we can see that. So there we have the demo running on the GPU. For comparison, I've got the original ticking along. It is moving, I promise. So there you go. So can we cut back to slides, please? So I'd like to invite Jeff Stollup to take any questions and to talk about downloading materials for working on these things.

Thanks very much Ian. So, again, Alan Schaffer is a Schaffer at Apple.com, a good point of contact for questions you may have in general about technologies, and he can route those things to us. The OpenCL developer resources, as I said, are on the attendee website for this session and the previous session.

Definitely a must download for you guys who want to use OpenCL. It has the specification, has release notes, has an example project, and also has an update for the OpenCL on your C disk. And finally, we have a lab in the Mac Lab B this afternoon at 2:00 p.m., and then the OpenGL lab should be in the graphics and media lab same time at 2:00 p.m. if you have OpenGL related questions.