Introduction to OpenCL - WWDC 2009

Mac • 56:24

OpenCL (Open Computing Language) lets a Mac OS X application tap into the vast computing power of the modern graphics processor (GPU). Discover the OpenCL architecture, its approachable C-based language, and powerful API for managing parallel computations. Attend this session to see how OpenCL can dramatically improve the speed and responsiveness of your application.

Speakers: David Black-Schaffer, Mon Ping Wang

Unlisted on Apple Developer site

Downloads from Apple

SD Video (129.6 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

Good morning, everyone, welcome to the introduction on OpenCL session. So today's session is going to be divided in two parts. The first part is going to focus on the compute model and the framework for using OpenCL, and then I'm going to hand it over to Mon Ping who's going to talk about the OpenCL kernel language, which is the language used to actually write your OpenCL kernels. So let's go ahead and get started. OpenCL compute model and framework.

My name is David Black-Schaffer, I'm one of the engineers in the OpenCL team, and so welcome to OpenCL. This is a really cool technology. We here at Apple are really excited about this, and we're mostly excited to see what you guys out there are going to do with the level of compute power that OpenCL gives you.

So to give you a feeling for this, with OpenCL you're going to be able to leverage the CPUs and GPUs in your system in order to write parallel code which is going to give you dramatic speedups on computationally intensive algorithms. So if your algorithm is not computationally intensive, you may have a hard time taking advantage of OpenCL.

But if you've got a lot of compute to do, OpenCL is going to give you a huge win. OpenCL also has another major benefit, which is it lets you write portable accelerated code. And you may not think about this, but what it means is that you don't have to write specialized code for each GPU and each variant of CPU. So no more SSE4, SSE3 type code, just use OpenCL and you can run across all the devices in an optimized accelerated manner.

So what are you going to learn in the presentation today? Well we're going to start out with what is OpenCL and what is it good for? Going to take a look at an algorithm that matches well to OpenCL and point out the parts of the algorithm that allow OpenCL to really work well with it.

Now we're going to go on and talk about how OpenCL exposes the hardware, what is the computation model in OpenCL, what is the memory model in OpenCL. And these are important to know so you can understand how to take advantage of OpenCL in the best way for your algorithm.

And then we're going to go through the nuts and bolts of how you actually use the framework. We're going to look at sample code for setting up OpenCL, doing resource allocation, executing your kernel, and then look at some issues with synchronization in particular when you have multiple devices in your system.

So let's get started here. What is OpenCL? Well fundamentally it's a framework that gives you access to all the compute resources in your application. So if you've got CPUs and GPUs you can use them all together with OpenCL. It's a low level framework for high performance. And this implies an explicit tradeoff, meaning that you may have to do more work, but because you have to do more work, you can really get the full performance out of the system.

So for example, in OpenCL you're responsible for managing your memory. But what that means is you can do exactly the memory movement and copying that's required for your algorithm so you get the best possible performance. OpenCL targets parallel computationally intensive algorithms. Again, if you're not doing a lot of work, OpenCL is not going to give you a huge benefit.

Now you'll note here that I put data in parentheses here. OpenCL is not limited to data parallel computation, but today's GPUs are really optimized for data parallel computation. So if you want to get the best bang out of your GPU, you need to be looking at a data parallel algorithm.

So with that said, there's a lot of things you can do with OpenCL. You can accelerate everything from physics simulations to image processing, medical imaging, financial modeling, lots and lots of stuff can get a big speed up from OpenCL. And what we're really excited about is what you're going to bring to the table here.

What algorithms are there out there that we haven't looked at or that haven't been seen before that when we give you OpenCL you're going to say hey, I can really accelerate this new thing and get a real boost. So I mentioned that OpenCL is particularly good on today's hardware for data parallel computing. Let's talk about what data parallel is.

So when I say data parallel computing, I mean two primary things; you're doing similar computations on independent data. What that means is that each of the elements you're processing is doing the same control flow. So in the same sort of computations and they're all independent, that is the results for element 37 don't depend on the results for element 1, so you can do them in any order in parallel. So let's take a look at an example of this, something that will map well to OpenCL where you can get a real speed up.

And the example I'm going to give you is averaging an image, this is a box filter. We're just going to take a pixel in an image and we're going to average the pixels around it, so we're going to do sort of a crude blur. So this is the same computation for each output. Each pixel is going to average the elements around it. And all the results are independent.

What I do in the first pixel doesn't impact what I do on the 15th pixel. So we take a look at this, we have an example image here, and I'm going to process a region of this image. You're going to look at the pixels in this region, can add them up and average it.

And indeed I'm going to go through and do this for all the pixels in the image, and at the end I'm going to get a result which is an average or blurred output of this image. So this is a great example of something that maps very well to today's GPUs. It's the same sort of computation for all the pixels doing this averaging, and they're all independent.

Indeed no matter what pixel I look at here, it doesn't matter if I processed other pixels before it or after it. And it's this characteristic that allows today's GPUs to really run lots of this stuff in parallel because you don't have to worry about synchronizing between them. So if we take a look at the actual computation kernel that's used for this example, here's the code for doing the averaging. And if you look at this, there are three parts to the kernel.

The first part is the kernel determines what pixel am I working on. So by using this image here, I've told OpenCL I'm going to paralyze [phonetic] this across the whole image. All of my pixels are independent, you can process them all in parallel. So the first thing the kernel needs to do is say okay, which one of those pixels am I processing? Then we simply go through and do a set of four loops, this is a really simple implementation here to average out the pixels around it, and then I write out the result. So this is the kernel that OpenCL is going to execute, and it's going to be executed once for every pixel in that image. Again, this is how I'm telling OpenCL what sort of parallelism I have in my problem.

I'm running it over the whole image, so I'm going to get one kernel execution for each pixel. So this gives you an idea of how you express parallelism in OpenCL. So for this example here, it's data parallel across two dimensions, thinking about an X and Y on a pixel basis. I'm going to execute that compute kernel for every pixel in there, there's going to be a lot of executions. So for example, if I'm processing 1,024 x 1,024 image, my global problem dimensions are 1,024 x 1,024. that's the total work in the problem.

And I'm having one kernel execution per pixel, which means I'm going to have a million total kernel executions. So I've told OpenCL here that you've got a million things you can do in parallel. OpenCL is then going to go out on the hardware device and try and run as many of them as it can at a time to get you really good throughput.

So now that I've given you an example of the sort of algorithm that maps well to OpenCL, and talked a little bit about how OpenCL sees the parallelism, let's talk more specifically about the OpenCL computation model. So in OpenCL, as I've alluded to, kernels are executed across a global domain. Each of the kernel executions is called a work item. So in the example I gave you before, the global dimensions were 1,024 x 1,024, and that defines the range of computation.

This is how you tell OpenCL here's the degree of parallelism in my program. There's one work item for each computation. So in the example 1,024 x 1,024 there are a million work items that OpenCL can execute in parallel. Now in addition to this global domain, you can split up your work into local workgroups, and workgroups are defined by their local dimensions.

So for example, if I had the global domain of 1,024 by 1,024 I might decide my local domain is 128 x 128, or 16 x 16. And these define the size of workgroups. Workgroups are important because these are the work items that are executed together on one processing element. So if you have a GPU with 100 cores in it, these are the work items that are going to run on one core.

And why that's important is it allows them to share local memory and synchronization. So within a workgroup because these items are being executed together, you can share local memory and those items can synchronize. So there are a few caveats to this computation model and these are important to keep in mind when you're looking at moving your algorithm to OpenCL.

So the first caveat is that global work items must be independent, there's no global synchronization. If you have an algorithm or work item of 3 million needs to synchronize with work item 74, you're going to have to look at how to restructure your algorithm to fit into OpenCL.

But what you can do is you can do synchronization within a workgroup. So if you need synchronization, you need to figure out how to fit that into the smaller workgroup not across the entire global domain. So the three things you need to take away from here: for the compute model for OpenCL you have global dimensions These define the range of your computation. This is how you tell OpenCL here are the things that you can execute in parallel.

You have local dimensions, this defines the size of your workgroup, and a workgroup are things that get executed together, and there's no global synchronization. If you need to do synchronization in your algorithm, you need to do it within the workgroup. So let's take a look at an example here. Here's the same thing I had before, my global dimensions 1k x 1k, and I'm going to run with local dimensions of 128 x 128.

So what that means is that this chunk of work items are going to be executed together on one processor on the OpenCL device, and this chunk will be executed on another. And if you think about it, this will divide up your entire global domain into chunks that get executed together.

So what this means is if you have synchronization, say I need to synchronization these two global work items, they're in the same workgroup so this is okay. I can go ahead and I can use barriers within my kernel and I can use memory fences within my kernel to synchronize the execution of those two work items.

However, if I have two work items that are in different workgroups, so here are two global work items that are far apart and they're not in the same workgroup, I cannot synchronize these. The functions for synchronization within a kernel only work within a workgroup. So now let's talk about problem dimensions. As I gave you an example before of 1k x 1k for an image, there are a lot of other problem dimensions you might run into. Say you have an array with a million elements in it, maybe it's financial data over time.

Maybe I wanna just use global dimensions of a million x 1 x 1, or maybe I'm processing a high-definition video frame, 1920 x 1200, 2.3 million pixels. Well if I want to process one pixel at a time, do per pixel processing, if I set my global dimensions to 1920 x 1200, OpenCL will process all the pixels in parallel with one work item per pixel.

Similarly, if I have a volume, say I have 256 cube voxels,16.7 million voxels and I want to process each voxel, I can set my global dimensions to 256 x 256 x 256, and I'm telling OpenCL you have 16.7 million work items that you can run in parallel in the most efficient manner you can.

So this brings up the questions of how do you choose the right dimensions for your algorithm. Well there are really two things you want to keep in mind here, you want to pick the ones that map best to your application, so you want to pick something that makes sense for the parallelism in your application. If you're processing images, it's very likely you'll want a two-dimensional set that's the size of the image.

But maybe you have an algorithm that processes pixels in 8 x 8 chunks, so maybe you don't want to have something which is mapping 1 to 1 with pixels. You also want to choose something that performs well. The way GPUs today get high throughput is they have lots of work items which allow some work items to process while others are waiting for memory. So you want to make sure you have thousands and thousands of work items running at any given time, so you want to make sure you have enough work items.

But by the same token, you want to make sure you do enough work in each work item that you amortize the cost of that work item. So you don't want to do a single add in a work item, but at the same time you want to make sure you have lots of them.

So this is a balancing act for performance. So now we talked a bit about the OpenCL compute model, we've talked about how OpenCL expresses parallelism and the limitations of synchronization only in workgroups. Let's take a look at the OpenCL memory model. So in OpenCL you have the host device which is the machine you're running on, and you have the compute device which is often a GPU or it may actually be the CPUs in that machine. The compute devices executing your work items and they're executed in workgroups.

So in terms of memory, what you have here is you have host memory, this is the memory on the CPU, this is the 2 gigabytes or 8 gigabytes or whatever your machine actually has in it. And you can get this memory and move it over to the global memory on the device. So if you have a GPU with 512 megabytes of memory, that's your global memory. You've got 512 megabytes on the GPU.

Now I want to really emphasize here, this is not synchronized. This is not a cache, data does not move automatically. You're responsible for taking your memory from the host and moving it over to the device when you want to use it. So from this global memory on the device, you can access it from the work items that you're actually executing.

Now in addition, workgroups on the device will have local memory, and local memory is shared within a workgroup. Now this memory is much, much smaller. On today's GPUs, we're talking in the order of 16 kilobytes of memory. However, this memory is phenomenally faster on devices that support it.

So if your GPU has local memory available, you can get 10 or 100 times the bandwidth using local memory that you can over global memory. So one of the biggest optimizations you can make on those devices is to really use local memory. In addition to work items, we have private memory, and private memory is per work item.

So if a particular work item needs to store some data that it's working on, it can use private memory to do that, but it's per work item, this isn't shared for the work group. So the important takeaway message here is that memory management is explicit. It's up to you to move your data through this memory hierarchy.

You're going to have to take your data from the host, move it into the global memory, move it into local memory, do your work on it, and then move it back. And the tradeoff for this is you can control exactly how data is moved, this allows you to get the best possible performance.

You're not going to have the OS or some memory manager guessing what you want to do, you do exactly what you need and just what you need. So now that we've talked a little bit about how OpenCL works, how it thinks about the world in terms of paralyzing computation and what the memory model that it's exposed to is, let's take a look at actually using OpenCL. So the first thing you're going to want to do when you use OpenCL is set up OpenCL and start allocating resources. So let's take a look at this. So in OpenCL there are a bunch of objects you need to be aware of.

The basic ones for doing setup are devices, these are GPUs or CPUs; context, and a context is a collection of devices which tells OpenCL that they can share memory and queues. Queues are used to submit work to a device. So in the example I have here, we have a GPU and a CPU in a context, and we have a queue for submitting work to each of those devices. In addition for memory, we have buffers and images. Buffers are simply blocks of memory, you can use them however you want, you can access them however you want.

Images are two-dimensional or three-dimensional formatted structures. For executing things we have programs, and a program is a collection of kernels. And we have kernels. Kernels are basically containers that keep track of arguments for executing a kernel. In addition, if you need to do synchronization or profiling, you're going to use events.

So here are the basic objects that you're going to need to use in OpenCL, and let's take a look in the actual details of you get them and manipulate them. So for setting up OpenCL there are basically three steps. The first step is you need to get the devices you want to work on. The second step is you create a context which allows those devices to share memory objects. And then the third one is simply create command queues so you can submit work to all those devices. So the code for doing this is fairly straightforward.

When you want to get your devices, you call CLGetDeviceIDs and you specify the device type. So here I'm saying give me a GPU, and OpenCL will return a GPU, if there's a GPU in your system that is. So you call CLGetDeviceIDs again, you want a CPU, and you'll get a CPU device. Now once I've got my devices, if I want to use them together I need to create a context that tells OpenCL that I want to share memory between these devices.

So to do that I call CLCreateContext and I specify the number of devices, and I specify which devices I want to be in that context, and OpenCL will go ahead and create a context which tells OpenCL when I use this context, I want you to make sure that memory objects in the context can be shared between the devices.

So if you have a CPU and a GPU, you want to make sure they're in the same context so you can share data between them. And finally, I want to submit work, so I need to create command queues to submit work to the devices. So I call CLCreateCommandQueue, I specify the context and the device I want to submit to, and I get a queue which allows me to submit work to that device.

Similarly I do the same thing for the CPU. So after calling the setup code, I've got two command queues that I can use to submit work to the CPU and the GPU, and I've got a context which allows me to create memory objects which are going to be shared between those two devices.

So a few notes about setup. Devices. All of the CPU cores in your system show up as one device, so in most machines you're going to fire up OpenCL and you're going to have a GPU and a CPU device. Now this makes a lot of sense. Remember OpenCL is going to execute your kernel in a data parallel manner, that is it's going to run your kernel across all the processors that are available.

So you if you are on an 8 core CPU, it's going to try to run your kernel across all 8 cores at the same time. And indeed this is just what a GPU does, your GPU may have 16 or 100 cores and whatever kernel you run is going to run across all of those cores at once.

So as I've mentioned, contexts are for sharing memory. You need to make sure if you want to share memory between two devices that you have both of those devices in your context. And queues are for submitting work, and you need a queue for each device to which you want to submit work.

So if you have a CPU and a GPU, you're going to have two queues, and it's up to your application to figure out how to submit work to one or the other and break up your work. So how do you choose a device? I mean, a system may have several devices. If you start up a MacBook Pro today, you're going to have the discrete GPU, the integrated GPU, and the CPU.

And your application needs to choose the best device for running on. So the way you do this is you're going to have to look through the devices and choose the one that is best for your algorithm. So the device that's best for you is not going to be the same for everyone.

And you need to know your algorithm and you need to know the devices and figure out the best one to run on. But OpenCL provides a rich way of querying devices, determine what capabilities they have, and what raw performance they may have. So you can ask for the number of compute units. On a MacBook Pro the CPU might have two compute units, it's got two cores.

The GPU may have multiple cores. You can look at the maximum clock frequency for the device, or you can look at how much memory they have. In addition, if you have an algorithm which requires special features, say your algorithm requires double precision arithmetic. You probably want to check and make sure the device you're using supports double precision arithmetic. So you need to pick the best device for your algorithm. And this is going to involve enumerating the devices that are available, looking through them for the characteristics you care about, and choosing the device you want to run on.

So now that we've talked about setting up OpenCL, how do you chose devices, how do you set up a context, and how do you create command queues. Let's talk about allocating memory resources. As I mentioned before, there are two types of two memory resources, they're buffers and images. So buffers are simply chunks of memory, you're telling OpenCL I want a chunk of memory that I can do whatever I want with. Within the kernel you can access it through pointers, as structs, as arrays, basically however you want, it's just a chunk of memory.

And within a kernel you can read and write to the buffer, so you can read data from a buffer and write data back within the same kernel. Images are different, they're opaque 2D or 3D formatted data structures. You can't access them as pointers, the only way you can access them is through read image and write image. And they have one particular limitation that you need to be aware of.

With any given kernel, you can read an image or write it, but not both. So if you have an image you can't use it as both the source and destination for a kernel. So let's talk a little bit what I mean by formatted opaque data structure for images. So images have formats and samplers.

When you go to create an image, you're going to specify a format, and the format consists of the channel order, do you want to have an RGBA image, or maybe you just want an alpha channel, or maybe you just care about RGB but not the alpha channel. And the type of data, do you want an 8 bit image or a floating point image? And you can find out which image formats are supported by your device by calling CLGetSupportedImageFormats. In addition, when you go to read an image, you specify a sampler for reading the image, and the sampler allows you to control how the image is accessed when you read it. So for example, you can specify you want linear or nearest neighbor filtering.

What this means is if you read between a pixel with linear filtering, OpenCL is going to average the pixels for you. And you can specify what happens when you read off the side of an image. So if you're addressing you can clamp at zero, so if you read a negative value of get zero, you can clamp at the edge value, you can tell it to repeat the image. And you can specify whether images are accessed using normalized coordinates or not, and that's where the X and Y ranges are scaled from zero to one. So the reason for providing this is that on GPUs, you get hardware to do this.

So if your algorithm can benefit from using linear sampling, on the GPU you can use linear sampling and get hardware to do that for you. In addition, OpenCL is going to store your images in a format that's efficient both in terms of size and speed for that device. And that's why the formats are opaque, so that we can optimize it for the device. So let's take a look at an example of how you actually go about allocating memory objects.

So first we're going to allocate an image, and to allocate an image we obviously need to specify the format first. So here I'm going to specify the format, I want a floating point pixel type, and I want four channels per pixel. So CLRGBA and CL_FLOAT. I then call CLCreateImage.

And when I call CreateImage, note that I'm passing in the context, I'm telling OpenCL I'm creating this image within a context which lets OpenCL know that it can share that image across all the devices in the context. I then specify the format and the width and height of the image, and OpenCL will create that image for me.

Similarly, if I want to create a buffer, I call CLCreateBuffer, I specify the context for the buffer, but instead of specifying a format since the buffer is just a chunk of me, I'm just going to specify the size of the data I want. So I here I calculated the number of bytes that are in my image and I'm telling OpenCL create a buffer of that size. So once you've created these memory objects, you want to put data in them, or maybe you want to get data back when you're done with them, and you do that through explicit commands.

So you need to call CLEnqueueReadImage or CLEnqueueReadBuffer. And these commands are going to specify which memory object you want to access, where you want to access in them, and provide a pointer which either supplies the data you're going to write in, or gives you where the data is going to come back.

So it's very important to mention here that these commands can operate synchronously or asynchronously. So you'll notice there's this blocking option in it. When you call CLEnqueueRead, you are enqueuing a command. That command is going into the queue for the GPU and at some point later it's going to finish. If you call EnqueueRead and your program returns immediately and you're trying to access that data, there's no guarantee the read has finished, it may still be sitting in the queue and you're going to have garbage data.

So if you run these commands in a blocking manner, your application will block that call until the read is completed at which point your data is safe to use. So if we take look at how we read back data, we obviously need to allocate some space on the host for storing the results.

If we're reading from an image, we need to specify where in the image do we want to read. So in this case we're setting the origin to 000, that is I want to start in the upper left-hand corner of the image. And we're setting this area that we want to read, the region, to the full image. We then call CLEnqueueReadImage, we specify the origin and the region, and we get our data back.

Similarly, if I want to read from a buffer, I'm going to call CLEnqueueReadBuffer, and here since there's no idea of X and Y of sizes, I just specify the number of bytes I want to read back, and OpenCL will read back the data. So again, it's important to note that you're enqueuing a command here. If you want your application to block until that command is done, you need to set blocking to be true.

So now we've talked about how do you set up OpenCL, how do you allocate your memory resources, let's talk about actually compiling and executing kernels. So for compiling and executing kernels there are three steps. The first one is you need to create a program, and a program is just taking an input of source code. So you're going to provide it with some string of here's the actual code for my kernel, or you can provide a precompiled binary.

This is analagous to a dynamic library. If you think about a program, it's going to have a bunch of kernels in it, and it may have support functions for those kernels. Once you've created a program, you're going to compile it. You specify which devices do you want to compile on, do I want to compile on the CPU and GPU or maybe I'm just going to run this kernel on the GPU, so I should just compile it for the GPU. You're going to do the standard things you do when you compile.

You're going to pass in compiler flags and you're going to check for problems. So if you're going to have a syntax error in your kernel, you're going to get an error back when you go to compile the program. And finally you create a kernel. And as I said before, a kernel is an object which stores arguments for execution.

So this is how you're going to set the argument so that you can go execute it. So let's take a look at this. Say the file here called kernels.cl. This file has one kernel in it called average_images at the moment, and I want to load this. So I go ahead and load it into my C program and then I call CLCreateProgramResource, just passing a pointer to that data and it'll create a program from that string that I loaded from a file.

Now how do I compile and create the kernel? Well CLBuildProgram. Take that program called build, and then to create the kernel, CLCreateKernel, and I specify the name of the kernel within the program I want. Now most people are going to discover the first time they compile their kernel that there's a syntax error in the kernel.

So what you want to do is you want to check the error output from CLBuildProgram, and if you've got an error, you can call CLGetProgramBuildInfo to return the build log. And this will give you a detailed log which will show you exactly where the compile encountered a syntax error or an error in the kernel code. So how do you execute kernels? You need to set the kernel arguments and then enqueue the kernel to be executed. So setting the kernel arguments, call CLSetKernelArg.

Specify the kernel you want to set and do this for each argument. To execute the kernel you're going to call CLEnqueueNDRangeKernel. And it's here that you specify the global and local dimensions. So remember I mentioned that the global and local dimensions are how you tell OpenCL the parallelism in your kernel, and here's where you actually set that.

So in this case, I'm telling you to execute this kernel, with the global dimensions it's image width by image height, and I'm passing in null for the local dimensions. I'm telling OpenCL I don't care about the local dimensions, pick something reasonable for me. And OpenCL will pick a local dimension and run your kernel for you. So it's very important to emphasize here your kernel is executed asynchronously.

When you call CLEnqueueNDRangeKernel, nothing may happen. You've just enqueued your kernel. If someone else is using the GPU or if the GPU is busy executing another one of your kernels. It's just going to sit there in the queue until the GPU is ready, so you need to be very much aware of this. If you want to get data back from your kernel after you've enqueued it, you can use a blocking read.

This means you'll enqueue the kernel, you enqueue the read, and your program will block until the read is done. Alternatively you can use events to track the execution status of kernels if you need to do this at a more fine grained level. So let's talk a little about synchronizing between commands.

So in OpenCL each command queue is in order, that means that everything is going to execute in the order you stuck it in there. So if I enqueue a kernel and then I enqueue a read, I'm guaranteed that the kernel is going to finish before the read executes. And this will behave just the way you expect as long as you don't submit to the same command queue for multiple threads.

However, and this is really important, you must explicitly synchronize between queues. So if you have a CPU and a GPU, you've got two command queues for submitting there and there are no implicit guarantees about the ordering between them. You are responsible for using events to make sure that things run in the order you expect.

So if you've noticed in the commands that I've shown you before, all of the CLEnqueue commands have three parameters at the end. And the first two parameters are specifying an event waitlist. So here you can say there are three events in this waitlist passing a list of those events and OpenCL will make sure that that command waits until those three commands are done before it executes. Similarly, you can get an event back from the command, and this is what you use to put in waitlists for other commands.

So let's take a look at an example of where you need to pay attention to this. Say I have two kernels, kernel one and kernel two, I want to run kernel one on the GPU and kernel two on the CPU. So I'm going to have two command queues here, so I don't have any implicit ordering, I need to take care of ordering myself.

Now in this case, the output from kernel one is used as the input from kernel two. So this implies a dependency. I need to make sure that kernel one finishes before kernel two starts or I'm going to get the wrong results. The output from kernel one won't be done and so I'll have incomplete data before I run kernel two. So if we take a look at what's going on here, here I've got a CPU and a GPU and I'm going to run one of the kernels in the CPU and the other on the GPU.

So I can queue kernel one on the GPU, it's sitting there in the command queue. I can queue kernel two on the CPU, it's sitting there in the command queue. Now some time later the GPU is available, you don't know when that is, other applications may be using the GPU, you may using the GPU. But your kernel one is going to start running. And some time later, kernel two is going to start running.

But again, you don't know when this is, it depends on how busy these devices are. So we have a problem here. Kernel two started running before kernel one was done, and this means that the data that we need from kernel one wasn't done in time. So if we take a look at what we do with events, here we're going to tell kernel two to wait on an event from kernel one. So kernel one executes, and then kernel two is not going to execute until we get that event back.

And when that event comes back, then kernel two can go ahead and execute and we get the correct behavior here. So if you have this scenario where you're submitting things with data dependencies to multiple command queues, you're responsible for taking care of those dependencies using events. So you can also use these events on the host. So if you want to block your program for a bunch of events, you can call CLWaitForEvents and pass in an event waitlist. Your program will block until those events are done.

You can enqueue a marker which is basically a way of monitoring a particular position in a command queue and you can get an event back to use for other things. And you can enqueue a wait for events which basically tells a command queue wait on everything after this for these particular events. You can also get information about events.

You can query an event to find out what type of command it was and what its status is. Has this command been queued, submitted, is it running, or is it finished? And if you submit a command to OpenCL that has an error, you can get that error code back through the event.

Events are also used for profiling. You can get information on profiling from events which allows you to time how long your kernels are taking and how long other operations are taking. All right, so now I've talked about setting up OpenCL, let's look at an actual program that uses OpenCL. So here we go.

Here's a simple program and this is basically going to run through the code that I was showing you earlier. So the first thing we're going to do is we're going to set up our CPU and GPU. And as you can see from here, we're going to call CLGetDeviceIDs for the GPU and CLGetDeviceIDs for the CPU. We're then going to create a command context which is going to use these devices, and then we're going to create a command queue for each of the devices. And I apologize that some of this appears to be commented out here.

Always dangerous to change things as you're actually doing the demo. So the next thing I'm going to do after I've created my command queues, is I'm going to read in the kernels from a file. So I've got a file here which has simply got my kernels in it.

And you'll notice this file has two kernels in it. The first kernel is average buffers and the second kernel is average images. So the average images kernel is the kernel I showed you earlier in the demo, it simply gets the X and Y ID for that particular kernel instance, and then does a four [phonetic] loop reading in the image data and writing the output. The buffer kernel version does the same thing but uses buffers.

But because buffers don't have any idea of X and Y coordinates, it's going to have to calculate where in the buffer that pixel should be, and it simply treats this buffer as an array, and then it writes out the output, treating the output buffer as an array as well. So once we've done that we go on and we create our buffers - excuse me, after we've loaded our kernel file here, we call CreateProgramResource, we specify the source we want to load, and then we build the program for all of those devices.

So we're going to call CLBuildProgram and we specify the program we want to build. We're then going to set up for running the buffer kernel. So you've got two kernels you want to run here, images and buffers, so the first thing we're going to do is allocate the memory objects for the buffers. And again this is just what I was talking about from before.

We all CLCreateBuffer, specify the context we're using, specify the size of the buffer we want to create, and we'll create the buffer. Then we want to write our initial data into the buffer, so we're going to call CLEnqueueRightBuffer, specify the buffer we're writing into, specify how much data we want to write into it, and then give it a pointer to the data we want to write into it.

Now because we're calling EnqueueBuffer with CL true here, this command is going to block the execution of our program until we've got the data written. Create the output buffer, and we call CLCreateKernel, create the kernel, and then we set the arguments for the kernel. So here, the first one, I'm going to set the input buffer and the output buffer, and I'm going to specify the image width and height for the kernel. And if you look at the kernel code here, that makes sense.

The first argument to the kernel is the intput, the second one is the output, and then we have the width and height. So once we've done that, we're going to set up for using images, and for images we first need to specify the format, then call CreateImage2D. When we CreateImage2D we're going to specify that format.

Then we're going to write data into the image. Here we specify the region and the origin, and we call EnqueueRightImage, and we're specifying a pointer to the data we want to write in. Create the output image, create the kernel, and then we set the kernel arguments. Now you'll note there are only two kernel arguments here. Well if we look at our actual kernel for doing images, it has only two inputs, input and output. But they're images. Now images are opaque data types, so we don't need to specify the width and height, that's implicit when I call the ReadImageF.

So once I've set up, I'm done with that. Now you'll note here's a simple straight C implementation of this filter. The C implementation is basically the same thing, it iterates over the whole image and then it iterates over a region around each pixel. And for each pixel it has to iterate over all four of the components in my RGBA image. So what OpenCL is giving you here is it's automatically paralyzing these outer two loops.

So I'm specifying these as the global domain, the image width and the image height, and so OpenCL is going to paralyze across those. And it's also giving us for free this inner loop. By using the vector type, our image is a float four, I get this vector operation for free.

Now when it comes time to execute the kernel, I'm going to specify which queue do I want to run on, this demo can run on a CPU or GPU. If I'm running the image kernel I'm going to EnqueueNDRangeKernel with my global dimensions of width and height, and then I'm going to read back the image with a blocking read.

If I'm executing the buffer kernel, I'm going to have the same global dimensions, image width and image height, I'm going to enqueue my buffer kernel, and then I'm going to read the results back from a buffer. So let's go ahead and see if this will run with those slight modifications.

So here we are running, this is the kernel you would expect, it's running OpenCL on the CPU using the buffers and it's just average the image. We can run on the GPU and we can run it with images, run it with buffers. And the point of this is just to show that here's a simple algorithm that I trivially run in OpenCL.

Now there's one more thing that I want to show you here which is regards to errors in OpenCL. So if we go to a particular point in OpenCL, here's our first time we're writing to an image. We're specifying the region as width x height x 1. Now it's a 2D image so a common programming mistake is to specify zero there.

So if I specify zero, I should expect to get an error back. So if I go ahead and run this, not surprisingly, I get an error back, CLEnqueueRightImageFailed minus 30. So minus 30 is not the most useful thing to know, and if you go and look at CLH, the error, minus 30 is invalid value. Well great.

So which invalid value? If you look at EnqueueRightImage, I'm specifying a queue, an input image true, an origin, a region, 00 input data, and then some event stuff. So it's not entirely obvious to me which thing I did wrong here. We've provided a very useful feature to you, and that is this environment variable that you can set, CLLogErrors. You can specify it goes to standard out, standard error.

If I enable this and I rerun my program, the error I get now is much more useful, it provides specific information on what happened there. CLInvalidErrorValueEnqueRightImageFailedRegion2 must be 1 for 2D images.st-ylH Easy to find. So we really recommend that when you're debugging, that you enable this to get much more detailed information about your API errors. So now a few hints on performance and debugging. And so this is sort of the high value part here, this is where you're going to benefit from our experience and what's going on here. So performance.

The big killer for you in performance is overhead, and there are a bunch of places that overhead can come from, the big one is compiling programs. Remember, when you go out there to build your program, you're executing a full compiler to do that compilation. So you want to compile your program once early on in your program and use it lots of times. Moving data to and from the GPU is very expensive. If you're not doing a lot of computation you're going to spend most of your time moving data.

So you want to go in there and you want to make sure you move your data to the GPU, do a lot of work on it, and then move it off. Starting kernels is also expensive, it takes a fair amount of time to enqueue a kernel, get it to the GPU, get the GPU going. So you want to make sure you're doing a lot of work for each kernel.

In particular what this means is that instead of having three small kernels, you're going to get a big benefit from merging those into one larger kernel. And events on the GPU are expensive, try to only use events when you need them. So when you've got these dependencies that you need to enforce, use events, otherwise take advantage of the in order properties of the command kits.

So for performance on kernels with regards to memory, you want to have a large global work size you want more than a thousand work items, tens of thousands, hundreds of thousands are fine, but you want lots of them. This is how you keep the GPU busy. If you're doing a lot of math, you can explicitly tradeoff performance and precision using the half and native variants of various math functions. If you're running on a GPU today, divergent code can be a real problem.

And what this means is that if you have a bunch of work items in a workgroup and they take different control flow paths, the GPU performance is going to suffer. So you really want to try and make sure that they're all executing the same control flow path together to get best performance. You also want to handle data reused through local memory. Caching on the GPUs today is not very advanced.

You can get a big benefit from using local memory if the GPU is supported. And when you're accessing memory from your work items, you really want to try and access it sequentially across all of your work items. So what that means is if work item 1 is accessing memory location 37, it would be really nice if work item 2 was accessing memory location 38.

What this allows the GPU hardware to do is coalesce these memory reads to get much higher bandwith. So for debugging, the biggest hint I can give you is start on the CPU. You can change to run on the CPU just by specifying you want a CPU device. You can use printf within our kernels, you can use Shark, and you can look at the code that you're running assembling GDB. These are big wins that you don't have on the GPU today.

Also, be very careful about reading and writing out of bounds from your memory objects on the GPUs. We don't have the level of memory protection that you're used to, so this can cause bad results. In fact, if you have a kernel that's crashing, I'd really recommend that you put in explicit address checks to make sure you're really reading and writing where you think you are. And play nicely with other apps. GPUs today are not preemptive.

If you run a kernel that's going to take several seconds, that means some other application that wants to use the GPU is going to have to wait for you. When it comes time to debug, since you can't stop a kernel in the middle and look at what's going on, use extra output buffers to track your state as you go through. So if you have a five-step algorithm in your kernel, write out the results after the first step so you can verify this.

And as I showed in the demo, you can set this context callback function which will provide easily through CLLogErrors, to get much more detailed information about API error messages. So with that, I'm going to hand it over to Mon Ping who's going to talk about the OpenCL kernel language.

Thank you, David. Now that you are familiar with the OpenCL program model let's quickly go over the language used to write kernels.

As you see in the examples before, it's derived from C99 so you can really just jump in and start writing your kernels. However, there are a few restrictions. There's no C99 standard header files, so you can't use C99 standard IO standard lib for example, there's no function pointers, you can't do recursion, there's no variable length arrays or bit fields.

We provide you with a whole set of parallel extensions, you can work-items and workgroups, we provide you with vector type, and we can give you various synchronization primitives. We give you address space so you can efficiently use your memory hierarchy on your machine. And we give you function to optimize image access as well a wealth of other built-in functions.

Now a kernel is a data parallel function executed for each work item. To identify that function in the kernel, you just have to put the kernel keyword in front of the function. Kernels can call other functions or other kernels. In this example, we describe the global ID, find the corresponding space in the input array, square the value, and put a new corresponding place in the output array.

And these can be executed in parallel. Now in your kernel you may want to know your current environment is, so we give you a variety of functions to get this information. You may want to know for example what work dimension you have, so you use get work dimension. You may want to know how many work items are used for your global file space. You can use GetGlobalSize for that information.

You may want to know how many workgroups are used, so you can use GetNumberOfGroups, and you also may want to know what's my current workgroup ID for this so you use this function. You may want to know for example how many work items are in my workgroup, so then you use GetLocalSize for that information, and you can get your local IDs and get local as well as the global ID which you've seen before. Now, the data types that are C99 standard data types, you get char, uchar, bool, ulong for example. We also provide you a half type which is used for 16-bit flows for storage.

We give you image types for 2D and 3D dimensions as well as a sampler type, and we give you the vector data types for you altivec and [inaudible] programmers. Now the great thing about vector types here are they're portable, so you don't have to write any more instrincis for altivec or fsc, you can just wait in a portable language which you use in your different devices.

We support vector links of 2, 4, 8, and 16. And for example, if you want an 8 element vector of integers using an int type, they're endian safe, they're aligned at the vector length, and we give you a whole set of vector operations and built-in functions. Plus, investigate some syntax vector operations.

As I spread a value across all vector elements, I can use this cast syntax. I can set each component of a vector using this syntax. You can easily manipulate vectors, so you can take for example a high part of one vector and store it into a low part of a vector using this high/low syntax.

And you can compose a vector out of other vectors so assume I want to create an 8-element vector, I can close out of the 4-element vector, grab the first and second element of the vector using this 01 syntax, and grab the art elements of a 4-element vector to complete it. Vector operations work as you expect, so we do have vector addition, you have all the vector components.

Similarly, they work for built-in functions, so you can call built-in functions on vector, using the component get set. Now, OpenCL provides a constant flow of workgroup functions and they're special. All work items in a workgroup must encounter a workgroup function and they must encounter the function with the same arguments.

Some examples of these functions are things like barrier, barrier memory fences, async workgroup copy which is just a copy from global to local memory and vice versa. Here's an example of an illegal use. In this program we're grabbing global ID, it's less than value, we encounter the memory fence, this is illegal because not all work-items encounter the memory fence. Let's take another example where we're doing a barrier. In this program we're just moving a curve averaging three values.

Now when it does the average it stores into a private variable called temp and then it has to hit the barrier, and then it outputs the result. And the reason why we need the barrier in this case is we're using the same input and output array. So let's assume that this program is broken in two workgroups of three items each and the current state is as so.

So Item 2 will do the addition while the rest of the items are waiting for GetGlobalID. Notice that work items are not required to be working in lock step. So we do the addition, we store into the private temporary variable then we move down to hit the barrier.

At this point the other workgroup will hit the barrier, we continue, it has to stop. And then items 0 and 1 can start. Item 1 starts to do the addition, it does the same thing, sorts the value, then moves down. Now remember that barriers only synchronize within a workgroup, not across workgroups.

So at this point item 0 to 2 can continue, hit the store, and store the value. Now this is going to be problem when item 3 starts to execute because it's getting a value from the old array and a newly produced value, giving you an incorrect result. So this is just to emphasize the fact that barriers work within a workgroup and they work fine, but if you try to synchronize across a workgroup, that's not possible. I don't want to give you the impression that all synchronization primitives are workgroup functions.

You have atomic functions to access global and local memory, for the general set of add, sub, exchange, et cetera. They work on 32-bit and 64-bit integers and you can write to global or local memory. Remember these are extensions, you have to check your device to see if it supports the use of CLGetDeviceInfo. Also note that can't you use these to do a spin lock because a workgroup needs to execute completely before anything else can be swapped in.

You've seen this picture before, this shows the memory hierarchy. And we map this memory hierarchy to using address spaces in language. So for private memory you use the private attribute and find out the type. Similarly for local memory you use local, for constant and global you use these attributes. Now Address Spaces can be a little bit tricky so let's go over quickly some rules for them. Kernel pointer arguments must use global, local, or constant.

So let's take for example distance. Here you were passing two global buffers, stars is pointing to a local buffer, and local search point into a local buffer which will be sure within the workgroup. Illegal use is to pass the private memory to a kernel. This makes sense because how can you pass something that's owned by the work item from outside of it? The default address space for arguments of a local variable is private.

So you see in our prefix element when we don't put a qualifier on it like temp here, this is a private work item variable. Image 2D and image 3D are always in the global address space. Now program global variables, those are variables outside of a kernel or function, must be in the constant address space. So you [inaudible], make it a constant, and give it to all your work items. But you cannot create something like a global variable called time to give access to all your items.

Casting between different address spaces is undefined. So remember when youcasting between adresss spaces make sure that you keep the same address space. So for example, I'm casting a float4 to global memory and casting to a float* point into global memory. However, if I forget to put the attributes, remember that the default address space is private, so this is doing a cast from global memory to private memory since address spaces are disjoined this will lead to undefined behavior. Now you've seen this kernel before in David's demo, we're just doing the average using global buffers. David also mentioned local memory may be much faster on certain devices.

For this program it might be much more efficient to grab a tile, copy it to local memory, manipulate it, and grab another tile and do the same thing. To do this copy, you should use async workgroup copy, which can copy from global to local as well local back to global. Let's see how our example changes when we do this. So we specify the async workgroup copy where we specify the destination address and the local memory from the source, from global memory, and then what size you want to copy.

It returns back an event because this function is asynchronous. So you have to wait for the event to finish using the workgroup event in the kernel then continue. Now images allow you to set image functions which allows you to read and write the image. Note that we support getting the value back in the format that's most useful for you.

So for example, if your data is -- you want it back in float format, you can use RightImageF. Similarly you can do integer or unsigned integer. You get information about your image such as height, width, and depth. You can also get things like channel data type as well as the channel order. Now conversions.

Scalar and pointer versions work just like in C99, however, there is no implicit conversions for vectors. Similarly you cannot cast for vector types primarily for the reason this fact that there's different semantics for vectors traditionally. Casts other potential problems. Let's say I want to do a round float to the nearest integer. I might write something similar like this.

Unfortunately this is wrong for values very close to zero, and it's wrong for negative values. And you also want to make sure that you get this converted into code which to instruction which is available on the hardware. To fix these problems, OpenCL provides you an explicit conversion operator that lets you convert to a destination type and specify the saturation and the rounding mode. These work for both the scalar and vector types, and there's no ambiguity with them. Let's say for example I'm converting a vector four float values to a vector of four unsigned integers, or characters in this case.

So I take the first value, -5, it's outside the range it gets clamped down to zero. Similarly, I have 254.5, since I asked for it to round to the nearest even, it would be rounded down to 254. 254.6 would round to 255, and values are arranged to be clamped down to 255. Notice that the left one is C99, this would be undefined because of the range.

We also provide you reinterpret data function called as_typen this allows you to interpret the bits of one type to another, the types must be the same size. Plus we can do a select with this, but note that you should probably use the OpenCL select built-in function for this.

So we're getting two values, we can do a vector comparison, but what happens is if the component is true, it'll set all to all ones, otherwise it will set it to all to zeros. We can then move down to taking the floating point value we're interested in, converting to its bits, do a bitwise and, and then convert it back into float using the f float.

So OpenCL provides you a wealth of built-in functions, and these are just a small set of what's provided, it's a very rich set. You have access to math function, geometry function, integer function, et cetera. And they're well defined, for example, for the floating point stuff, we support IEEE 754 compatible rounding modes. We define maximum errors of these math functions of these math functions, and support both scalar and vector types. And they're efficient, they're tuned for the hardware they're executed on so that it's good to use them. Now let's take a closer look at math functions.

They're well defined, we handle the ambiguous C99 library cases. As David mentioned before, they come in three flavors for example, full precision, let's take log gives you 3ulp, half precision gives you 11 bits of accuracy, but they're faster, and native precision which may map to instruction on the hardware, so they're fastest but since it is on the hardware, the actual precision is implementation defined, and some other features that are supported. We provide preprocessing directives as defined by C99, and optional extensions such as double, rounding modes, atomic functions, writing 3D image objects, et cetera.

Now just to summarize, OpenCL provides you a high performance framework. This idea for your data parallel computationally intensive algorithms, and we give you access to all the compute resources on the machine so you can make your algorithm run fast. It's portable across all mobile devices, and we provide you a well defined computational in memory model, this way you can easily break down your problems into global local dimensions, and we give you address spaces so you can use your memory hierarchy as efficiently as possible. And we give you a good efficient parallel program language. It's basic C99 that you're all familiar with, and we give you task and data parallelism as well a very rich set of built-in functions. So for more information please contact our evangelist Allan Schaffer.