Graphics, Media, and Games • OS X • 53:22
OpenCL lets your application tap into the vast computing power of modern GPUs and multicore CPUs to accelerate computationally intensive tasks. Get all the details about advances in OpenCL for Mac OS X Lion and learn techniques that yield incredible performance increases. Explore tightened integration between OpenCL and Grand Central Dispatch, and see how auto-vectorization makes it easy for your app to harness parallel computation.
Speakers: Sion Berkowits, James Shearer, Abe Stephens
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good morning. My name's Abe Stephens, and I'm a member of the OpenCL team at Apple. In this session, I'm going to tell you about what's new in the OpenCL framework in Lion. I'm going to start the session with a short overview of what OpenCL is. and then talk about a new Lion feature, which is the integration of OpenCL with Grand Central Dispatch. After that, I'll talk about some new options in Lion for compiling OpenCL source files, both using Xcode and using a new command line tool.
So let me dive right in and tell you about OpenCL. OpenCL is a framework that you can use to take advantage of both CPU and GPU parallel processors in your system to accelerate computationally intensive aspects of your application. OpenCL was designed to allow you to use both CPUs and GPUs, and it's designed for parts of your application that are computationally intensive. But the OpenCL system consists of a runtime API, a compiler and a programming language that you use to implement these pieces of your application.
OpenCL's programming language is portable between devices, so you can write a single program and run it on a GPU, and then also run that part of your program on the CPU or a different type of GPU. Now, OpenCL is able to be this portable because the programs that you write for it use a data parallel programming model, and we can map that data parallel programming model to a variety of different kinds of processors. So data parallel is the name of the game in OpenCL, and I'd like to give a very simple example of a data parallel problem, and then show you how you would implement this problem in OpenCL.
So let's take a very simple example. We're just going to add the variables a, b, and c together, or to get c. And to make this a parallel problem, we'll say that a, b, and c are arrays of numbers or vectors of numbers. And we'd like to compute the element-wise sum of a and b and get c. So if we wanted to do this in a serial program, say with just one thread or one processor, we could use a for loop, and our for loop would just iterate over the elements of the array.
So if we take a closer look at this for loop, and this is the for loop that we're going to try to make a data parallel version of,
[Transcript missing]
Now, it's also possible for some problems to have instances where the individual work items aren't completely independent. Maybe they need to communicate.
Maybe instead of computing the sum of two different arrays, we're computing the sum across all the elements in an array. And in order to schedule work items in such a way that they can communicate and share data, we have to put those work items in what's called a work group. So the second type of dimension that a problem has in OpenCL is the size of a work group, which is the local size.
And of course, the problems that we solve with OpenCL can be one-dimensional, for example, that addition thing that we just looked at, or two-dimensional. Maybe I'm processing an image, and the image has a width and a height. And I can also work with three-dimensional data, and so I can define a global and a local size that divides my work in one, two, or three dimensions.
Now I'd like to talk a little bit about the memory model that OpenCL uses. And this is essentially where the information that your kernel is going to process on comes from. Now, the data starts off in the host application, and the application thread will allocate data in host memory. This is the same memory that your application is generally running in.
And then when you send a command to an OpenCL device, or when you send a kernel to the OpenCL device, the runtime will move the memory objects that you've created into global memory on that device, or the global address space. And using the CPU device, the address spaces are actually all the same. They're all physically in main memory on the machine. With GPU devices, or at least with discrete GPU devices, the global memory might actually be a separate piece of memory on the discrete GPU.
Anyway. When your kernel is executing, it can load data into another address space, another piece of memory, which we call local memory. And local memory is that communication memory that is exclusive to the work group. So all of the work items executing in the same work group can access this piece of local memory.
Now, each work item also has an exclusive piece of memory that's specific to just the work item, and this is where your program can store intermediate values or other types of variables, and that's called private memory. And so, as a programmer, when you write an OpenCL kernel, you can write a very simple kernel, like the one that I showed earlier, or you can write a kernel that will move data between these address spaces. And when your program is mapped onto a GPU device, or if it's, you know, run on the CPU, the address space and the type of memory that's used for different variables will change depending on what the underlying hardware supports.
Okay, so after we've allocated memory and we've seen sort of how to write a data parallel program, let's look at the steps that are involved in getting started with an OpenCL program. Well, the first step, is to decide or to select a compute device from your system, and there's an API that allows you to query the system and discover which devices support OpenCL. After you've selected a device, in this case, I'm selecting the CPU, you create a CL command queue, which is the object that you use to communicate with that compute device that you can send commands to. After that, you also have to create memory objects.
In the example that we looked at, I had three buffer objects, A, B, and C, so we can create those. And then the next step is to create a program object and then a kernel instance from that object. And actually, in the next part of the talk, we'll talk about some new ways of handling this step that are new for Lion.
Anyway, once I've created my program object, I can send an execute kernel command to my device, and the runtime takes care of moving the data objects and the program to the device that'll be executed. And after that, my device will go off and compute the work and perform the work, and eventually I can move the results back to the application and continue with my computation.
Okay, that's a very, I guess, brief summary of how OpenCL is used to solve data parallel problems and how you might or what you might need to do in order to integrate it with your application. Now I'd like to switch gears and talk about a new Lion feature, which is integration between Grand Central Dispatch and the OpenCL framework.
So as we just discussed, OpenCL uses a queuing model to send commands to the device, to the compute device. And there are other queuing systems in Lion. A very common queuing system is Grand Central Dispatch. And in Grand Central Dispatch, you create a dispatch queue or you get a global dispatch queue, and you send blocks of work to that queue.
And so these blocks are just essentially arbitrary pieces of code that are queued in a certain order and executed by the CPU. And there are different ways and different types of queues that you can create using Grand Central Dispatch. Now, OpenCL also has this command queue model, but one difference is that instead of sending blocks to the device, you can send blocks to the device. So instead of sending blocks of work to a OpenCL command queue, you have to send specific commands.
So there's a specific command or specific function API to enqueue a read buffer operation or to enqueue a kernel execution and specify the local and global sizes that the kernel should be executed with. So in Lion, we've added the ability to combine these two types of queuing systems.
It's possible to create a dispatch queue, a Grand Central Dispatch dispatch queue, and then send blocks to it. And those blocks can contain either functions that execute kernels, that execute OpenCL kernels, or other types of OpenCL commands. And so in this case, the dispatch queue will support the other parts of the Grand Central Dispatch API, but it can send OpenCL commands to a CPU or a GPU compute device.
Okay, so let's take a look at what this would actually look like in your code. Now, I have a block here, and I'm going to invoke the add arrays kernel. And this is a version of the same kernel that we looked at earlier. And I do this by calling dispatch async, passing a queue, and then a block that contains a call to the add arrays underscore kernel function. And then there's a special structure that we'll talk about in a second. and then the arguments of that kernel, A, B, and C.
Okay, so if you remember, this is the kernel that we looked at just a few minutes ago. It's the @arrays kernel. It takes three arguments that are arrays of integers. And what I'm going to do is I'm going to take this kernel and feed it through the OpenCLC compiler, and the output of that compiler will be some bit code that's loaded at runtime and also a declaration of a kernel block. Now, the kernel block has the same name as the kernel with the word underscore kernel appended to it. It also takes this nd_range structure, which simply defines the global and local size of the kernel that should be launched on a device.
And so what's happened here is I've changed the process of... of creating a program, compiling a kernel, and then launching the kernel such that all I have to do is to define my add_arrays kernel in a .CL source file, And then at runtime in my application program, I simply call the add arrays underscore kernel function, pass the end array structure, and then the arguments to the kernel. And so this process is a little bit more streamlined.
Okay, let me tell you where these queues come from. So the dispatch queue that you are sending this work to is an ordinary dispatch queue, except that it's associated with a compute device. And you associate the queue with a compute device by calling the GCL create dispatch queue function. And you can associate or find a device for your dispatch queue in several different ways.
One way is to specify device type, in this case the GPU. It's also possible to specify other attributes like the queue priority. And if your application has some way of allowing the user to specify a compute device, or you have, you'd like to use a specific OpenCL compute device, you can always specify the CL device type use device ID flag, and the command queue will be, or the dispatch queue will be created with that specific device.
Now, the data that you pass to your kernel block functions has to be allocated in a special way. And we can support either data that's been allocated as a buffer, so in this case, using GCL malloc, or data that's allocated as an image. And of course, we can also support passing immediate data, so just a floating point or an integer value to a kernel.
So, in this case, I'm allocating memory that is safe to pass to a kernel block using GCL malloc. I can also use GCL malloc to wrap an existing allocation, in this case, using the flag CLmem use host pointer. And to create an image, I use the function GCL create image and pass a format that describes the channel order and channel type of the image to be created.
There are other commands that I can send to dispatch queues. There's a memcopy command that operates very much like memcopy. The order of the operands indicate the type of command that's executed. So for example, in this case, my destination is A, which is that buffer that I allocated using GCL malloc. And my source is the host data. And so I'm copying data from the host to the CL resource. I can also use map, and I can map and manipulate data that was used by OpenCL on the host.
Let me talk a little bit more about how MAP is used in conjunction with the memory consistency model that you have to follow when you're using this Dispatch Queue API. So in this case, what I've done is I've allocated my three buffers, A, B, and C. And as you can see, they've been allocated in host memory by the runtime. Now, when I enqueue and dispatch the add arrays kernel, and it's actually executed by the device, the runtime will move those buffer objects into global memory.
And now at this point, my application might not be able to safely manipulate A, B, and C, because it's possible that the runtime is -- or that the device is using the data asynchronously. And since I've dispatched the block of work already, I don't really know if it's been completed or if it's pending or what its current state is. Maybe it's running concurrently.
And so in order for the application thread to be certain that it can safely access the data in A, B, or C, we have to either map the data or make a copy of it. And those commands are both enqueued into the command queue. And so if I wanted to map the data, I would do something like this. I would call dispatch sync because I want the results of my command to have completed before the dispatch function returns. And call GCL map pointer.
Process the results. And then when I'm done, I have to call again with dispatch sync. Actually, in this case, it's a call. In this case, either dispatch sync or async would work since I don't care when this finishes. GCL unmap, which tells the runtime that it's possible or that it's able to move that data to a different device if it needs to in the future for some other command.
Now, in this API, we can synchronize between the work that's running on an OpenCL device and the application thread, and we do that using Grand Central Dispatch synchronization primitives. So I'd like to show two examples. The first one is a case where we enqueue work to multiple GPUs and then wait for that work to complete in the application.
The second example will be having work executed on one GPU and then having another kernel on a second GPU wait for the first kernel to finish. So let's look at the first example. In this case, I have a dispatch group that I've created, and I enqueue work to the first queue, then enqueue work to the second queue using dispatch group async.
After that, I perform more work on the application thread. And then at some point in the future, I call dispatch group 1, and I call dispatch group 2, and I call dispatch group 3, and I call dispatch group 4, and I call dispatch group 5, and I call dispatch group 6, and I call dispatch group 7, and I call dispatch group 8, and I call dispatch group 9, and I call dispatch group 10, and I call dispatch group 11, and I call dispatch group 12, and I call dispatch group 13, and I call dispatch group 14, and I call dispatch group 15, and I call dispatch group 16, and I call dispatch group 17, and I call dispatch group 18, and I call dispatch group 19, and I call dispatch group 20, and I call dispatch group 21, and I call dispatch group 22, and I call dispatch group 23, and I call dispatch group 24, and I call dispatch group 25, and I call dispatch group 26, and I call dispatch group 27, and I call dispatch group 28, and I call dispatch group 29, and I call dispatch group 30, and I call dispatch group 31, and I call dispatch group 32, and I call dispatch group 33, and I call dispatch group 34, and I call dispatch group 35, There are many other functions you can send to Dispatch Queues. There are a number of other functions for synchronization for callbacks and semaphores. There's a memory object finalizer that's similar to the existing CL 1.1 API. And timing functions and then also the ability to interoperate between Dispatch Queues and the command queues that are part of OpenCL 1.1.
Now, the documentation for this and the APIs are all in the header file, in the gcl.h header file in the OpenCL framework. And you can access this functionality by just including the framework header, opencl.h. Now I'd like to switch gears and talk a little bit about offline compilation, which is another new feature in Lion.
Now, in Lion, you can use either an online compilation routine, which is what you would have used in Snow Leopard. In this case, your kernel source is in a string in your program, and you pass this string to cl create program with source, and then you call cl build program, and you end up with a device binary that is executed on the compute device. And what this means is that your program source code has to be in a really human-readable format, in that kernel source.
Now, in Lion, we've added the capability to compile offline and then load a binary, and that binary is independent of the specific device that's in your system. It's actually specific to a class of devices. And so in that case, you would put your source in a kernel, in this case, kernel file.cl, a cl source file.
Then, while you're building your project, you run that source file through the OpenCL C compiler. That produces a bitcode file. And then from there, at runtime, you can call cl create program with binary to load that bitcode file, then build your program and end up with a device binary.
Okay, so to create your bitcode file, you use the OpenCL C compiler, and there are three different triple architecture arguments. You can create a bitcode file for 32-bit GPUs, for 32-bit CPUs, or for 64-bit CPUs. And now this bitcode file is specific to the class of device, not the individual device, so that 32-bit GPU bitcode file will work with any of the GPUs that support OpenCL.
Now, you'll notice in this example, the bitcode file is named, in this case, kernel-gpu32.bc. The important part of that name is the .bc at the end. When you pass this file name to cl-create-program-with-binary, The runtime makes sure that you have passed a bitcode file by checking to see that the extension ends in .bc.
Okay, so it's possible to create this bitcode file and automate this process in Xcode 4. To do this, you would add the OpenCL.cl file to your target, and then when you build your project, Xcode build will produce the-- or will invoke the CL compiler, then produce that bitcode file, and then at runtime, you can load the bitcode file.
You can either use CL create program with binary, like we just discussed, or you can use the Dispatch Queue API and call the kernel block that was created by the compiler for your function. And now I'd like to show you an example of how that works in Xcode.
Okay, so I'm launching Xcode here. And what I have here is a very simple example. It's not quite as simple as the kernel we looked at just a second ago, but it's very straightforward. And so what I'm going to do is -- I'll look at the code in just a second, but what I'm going to do first is show you how to add a .CL file to your Xcode project.
So the first thing I'm doing is going up to "Add files to my project," and now I'm going to select the .CL file that I created before the session. And you can see that it contains a kernel called "countstuff." I'll go down here and make sure that I've added the .CL file to the target. In this case, the target's called "example0," and click "Add." And there's my .CL file right in Xcode.
Okay, so what else do I have to do in order to use this integration? Well, I have to include the header file that the compiler generated. In this case, the header file is the name of the .CL file with a .h appended to it. And then after I've initialized data and I'm ready to execute the kernel, I can just call my kernel.
In this case, it's the countstuff kernel. So the name of the kernel in the .CL file was countstuff, and I've appended underscore kernel to it and created an ndrange structure. Okay, two more things I have to check. If I go to my build phases settings, so this is under the project settings.
I want to make sure that I've added the source to the .CL to my compile sources list, and that I'm linking the project with the OpenCL framework. So now if I build... and I'm able to successfully build the project. In this case, the program just counts some numbers and outputs the statistics of the count. What's more interesting is that if I look at the build settings in Xcode and click on all, and then type in OpenCL, We can see that there are a number of OpenCL build settings that I can set within the Xcode GUI.
There are settings for an auto-vectorizer that we'll talk about later in the session, as well as some other optimizations and flags that can be set. You'll also notice the OpenCL architectures option, and this is setting the different types of bitcode files that I should produce. In this case, I'm producing bitcode files for each of the three types of architectures.
Okay, now we've seen how to use OpenCL with Xcode integration, and how to use OpenCL with Grand Central Dispatch to make launching kernels and executing programs easier. Now I'd like to invite my colleague, Jim Sheare on the stage to talk a little bit more about OpenCL and how it can be used with OpenGL.
Thanks, Abe. So a lot of you are probably using OpenGL, and you might be wondering how you can add OpenCL to your application. So I'm going to be talking about that topic, sharing with OpenGL and a few others for about the next 20 minutes. So what do we mean by sharing? Well, if you have an OpenGL program, it probably has some data. In this case, some geometry data, like a vertex buffer object, which you want to render in OpenGL. But then maybe you want to, say, do some compute on that same geometry, or maybe you wanted to generate that geometry using OpenCL.
And that's the same data. So that's one type of sharing, geometry data. But if you're rendering in OpenGL, you're going to render some kind of picture, right? Maybe through some frame buffer object to a texture. And then maybe you want to post-process that same piece of data in OpenCL. So again, you're working on the same data, sharing. The other type of sharing you might not be as familiar with is sharing of synchronization primitives. And I'll come back to this a little bit later.
So let's step back for a second and take a look at how CL and GL view your system. So imagine you have this Mac that has a GPU and a CPU in it. OpenCL sees these as devices, devices for doing computation, whereas OpenGL sees them as renderers, renderers for making a picture.
You get access to the devices in CL by asking for device types with CL get device IDs. So you can ask for the CPU, or you can ask for the GPU, or if you're computationally greedy or whatever, you can ask for all of them and we'll give them to you.
In OpenGL, you can restrict the devices you ask for similarly, except you use pixel formats, and you set up your pixel format with certain attributes. So here we have a pixel format that has some attributes about the drawable destination, like how many bits are we going to use for color and for alpha. But notice there's also an attribute for what type of device we want. We want accelerated devices in this case.
So we'll give you just those devices. So then you're going to create a context in CL and GL with these devices. Now, even though the CL and GL context look the same on the slide, that's a little bit misleading. They're not the same type of object. But from your perspective as a programmer, they're the gateway to your devices.
So that's why I've drawn them the same. So how do you actually do work on these devices? Well, in CL, like Abe said, you create command queues, right? You can have a GPU command queue or a CPU command queue. Just depends on what device is better for the kind of work you want to do.
In GL, it's a little bit different. The device that's doing the work is represented by a virtual screen, okay? A virtual screen is a combination of physical hardware in your box plus the attached display. In this case, I only have drawn one virtual screen, but that's what's doing the work in GL.
So here's the situation we want. On one side, we have GL on the GPU, which is going to do some rendering. And on the other side, we have CL, which is going to do some compute. And they're going to do it on the shared geometry between them. But they want to take different views of the geometry, okay? GL is going to see this as a VBO, whereas CL just wants to see it as a buffer of data, as a CLM object.
Well, they can do that because these are managed by what's called a CGL share group. It's an object we haven't talked about until now. And actually, the CGL share group does more than just manage the data. It also manages these shared devices. So the real picture is this. The CL context and the GL context are both associated with a particular CGL share group.
So, how do you use share groups? Well, your first step is to get a share group. And the good news is, if you have an OpenGL context in your program, which you probably do, you already have a share group. You just need to ask your GL context for that share group. So step two is you create your CL context in a special way using the share group.
Step three is to create any of these objects that you want to share between APIs, create them in OpenGL first. And you do that in the exact normal way you would create any OpenGL objects. You don't do anything different. But then on the CL side, you create your CL objects from the GL objects using special API that we provide in the system.
So let's take a look at how the code looks. OK, so here we have maybe a custom NSOpenGL view in your application. So you can interrogate that OpenGL view and say, oh, give me the context, please. And then the next step is you then interrogate that context and ask for the share group using cgl get share group.
And then you package that share group that you have into an array of CL context properties. So you make this array. And then you pass this to CL create context. And when you do that, we're going to give you back a CL context that has the devices from the share group in it.
But there's one tip. If you want to have the CL CPU device in that context as well, you have to do one extra step. So the setup is the same. Setup hasn't changed from before. But now the extra step is you have to get the device ID for the CPU device by calling CL get device IDs for the CPU. And then in addition to passing this properties array, you also feed us this device ID.
So another thing you might want to do is you might want to use the CL device that corresponds to the current virtual screen. Now again, to remind you, that's the currently in use OpenGL renderer. So why would you want to do that? Well, you would want to do that for performance reasons.
So let's say that you have a problem that's memory bound, and you want to do compute and render on some shared data. So again, the time to shuffle this data across the PCIe bus is what's bottlenecking you. So you'd want to make sure you're doing your compute and your render on the same device so we don't have to move the data.
So you can do that. You just ask CL. You call CL get GL context info apple and ask for the current virtual screen. We'll give you back the CL device that corresponds to the current virtual screen, and you can use the command queue associated with that device to do your work.
So how do you create these OpenCL memory objects that we talked about? Now remember we said you're going to create them in OpenGL first, and you don't have to do anything special, normal OpenGL creation. And then you call these special entry points. So here you see CL create from GL buffer, and in this case we're passing in a VBO that we've created in GL. But there are entry points for creating CL objects, image memory objects, from textures, you know, 2 and 3D from render buffers that you might have attached to a frame buffer.
So that brings up the question, okay, let's say I create an image from a texture. What is my format? What do I get? It just depends on the internal format of the texture. So there are some examples here. For example, if your internal format is GLRGBA8, when you create a CL image from that, you're going to get, you know, CLRGBAUnormInt8. It just depends on the internal format. And there's a mapping for each one.
Okay, so now we know how to create the objects. How do we actually use them in CL? What do we have to do special? Any work that we have to do? Well, not really. You just use them as you normally would any CL objects. If you're already a CL programmer, you're going to be very comfortable here.
Okay, but there's one thing you should remember, and that's this notion of flush, acquire, compute, release. So if you guys have been here in previous OpenGL sessions at WWDC, you've probably heard of flush and bind. We talked about that a lot. This is the same idea. So what does this mean? Okay, let's say we're going to do some work in GL on this shared object.
We want to call GL flush. Next step is that we acquire these objects in OpenCL, okay? CLNQ acquired GL objects. Special function we provide, and you pass it an array of the objects you're going to work on. And that's your way of telling us, "Hey, OpenCL, I'm about to do something to these, so watch out." And then you just do whatever it is that you want to do in OpenCL.
Whale away. Okay, when you're done, though, make sure you call CL_NQ_RELEASE, GL objects. And again, that's just your way of telling us that you're done. As long as you do that, you'll be fine. Now note, this GL flush is only required if you're NQing CL and GL work on different threads. If they're on the same thread, hey, it's even easier. You can remove this flush, maybe get some better performance, right? And then you can -- this is what the picture looks like. It's just acquire compute release. Even easier.
So that's on the CL side. What about using the objects on the GL side? Well, really, you have to do nothing. That's the good news. When you call CLNQRelease GL objects, we call all the flushing that's necessary on our side. So you simply bind and use them in OpenGL as you would. Nothing special.
So that was sharing data. And remember I said there's a second half that's maybe new to some of you, which is sharing synchronization. So what am I talking about? Well, both CL and GL have these primitives. In CL, they're called CL events. In GL, they're called GL sync objects that allow you to wait on some work to complete, okay, but not all of the work.
Basically, the point of these is that you want to avoid completely flushing your pipe. If there's a piece of work at the beginning of the pipe that you need to wait on for some other work, you don't want to wait until everything's done. You just want to wait on that one piece. And because these two things do the exact same thing in their respective APIs, when we're using the APIs together, it makes sense to use them together. There's this natural correspondence, so that's what you can do.
So what's the picture look like? Okay, so on one side again we have GL, which is doing some rendering on a shared object, and then on the other side we have CL. And what you want to do is you want to call render on the GL side and then acquire compute release on CL. And remember we said we need this flush here. So what are we going to get? We're going to get something like this. The flush happens, and then the acquire compute release happens, and our post-processing works out fine in CL.
But note that if you're doing this over and over again, if you're doing render flush, render flush, maybe for multiple frames, you're going to be doing a little bit of GL work in between each flush. And if you're following our best practices for OpenGL, you know that really that's not great. You should queue up a lot of render work, right? Now these other renders that have just dropped in, these aren't touching that shared object notice. They're other GL work unrelated to the shared object.
So what you want is you still want this one render to happen before your CL compute to happen, but that's not what is going to happen. What's going to happen is everything up to the flush is going to, you know, complete before they acquire the compute and the release. So that's bad. You're going to get the correct result, right? But the problem here is performance. The CL side is now waiting on GL to finish. And so you're not taking full advantage of your CL queue.
So what you can do is you can insert a GL sync object into your command stream on the GL side. So then what happens is you use CL API to create a CL event from that sync object, and then you wait on the acquire for that event to complete.
So what happens is that sync gets processed on the GL side, and this event is triggered, and your acquire and compute release can proceed right from there before waiting on all that other work. So that's exactly what we want. So what does this look like in code? Okay, again, here's our situation. We have OpenGL that wants to render the geometry and make a picture, and then we want to post-process the picture.
So the code is pretty straightforward. We just have some function which draws all of our stuff to this shared FBO. And then this is what we need to do. We need to create a GLSync object using GLFenceSync. Okay, and that inserts this fence into our command stream. We do all this unrelated work, and then we eventually call it GLFlush. So the trick on the CL side is use the special function CLCreateEvent from GLSyncKHR, kind of a mouthful.
And you give it the sync object, it gives you back an event. And then you can pass that event in the wait list of any of the CL functions that take a wait list. In this case, it's the NQ acquired GL objects. And that makes that CL call block until that event is done.
So I'm happy to say that you heard Abe talk about in the first half of the session how we've added integration with Grand Central Dispatch. Well, the CL/GL stuff, it plays very nicely with that. It's fine. But one thing you have to remember to do is before you create your dispatch queue, you have to remember to set the share group, first thing you need to do.
So you get your share group in the same way as before, okay, and then you pass it to the special function, GCL/GL, set share group. As long as you do that, everything else will work fine. And then we provide special entry points for creating CL objects from GL objects.
And in addition, if you use OpenCL in this way, you don't have to worry about the acquire and the release. We're going to take care of that for you. So I have a demo for you. So those of you who are old hats at WWDC have probably seen a Blue Pony demo before. And you know that Blue Pony is this lonely, solitary guy.
He was always walking around by himself, you know. So we thought this year we would do a couple things for him. So the first thing we did is we gave him some nice terrain to play around in. So what we're doing is we're generating this terrain in OpenCL into a VBO directly on the card, and then we're rendering that VBO in OpenGL. So then we also thought, well, that's great, but Blue Pony is still extremely lonely and just walking around by himself. We decided to give him some friends.
So now Blue Pony is no longer lonely. Okay, so one other thing we decided to do is add some fun into their lives. So we're going to add this nice depth of field effect. So what we're doing in GL is we're rendering to a texture through an FBO.
And so we're going to do some post-processing on that texture in OpenCL. So I'm going to command the ponies to line up here. So our ponies are going to obey us, as they should. Okay, so now the ponies are all going to line up, and we're going to turn on our depth of field effect.
So you can see that this pony in the foreground, the gray pony, he's in focus where these guys in the back are blurred. And we can move the depth planes. So now I've moved it so the purple pony and the yellow pony are in focus. And now there's that beige guy in the center. And then our buddy Blue Pony, he's in focus now. And then this guy in the back. Forward, back, forward, back. So we can really mess with the ponies.
Anyway, we've shown how it's really easy just to sort of share data between these two APIs, and we can improve ponies' lives by doing so. So that's a good thing. So I want to talk about another type of sharing that you might not be familiar with, which is sharing iOS services with OpenCL.
So IOSurface, that's a technology we introduced in Snow Leopard, right? And it's an abstraction for shared image data. Okay, so great, but what does that mean? Well, I talked about two APIs in the system, OpenCL and OpenGL, which use image-like data. But we have other APIs in the system, like Core Video, for example, which uses image-type data.
So wouldn't it be nice if there was one way of packaging up images so that we can pass them around between APIs, keeping in mind that these APIs might use different architectures like the GPU, which have their own memory spaces? And that's exactly what IOSurface is. It's a way of packaging your image data so it can transcend APIs, architectures, address spaces, and even processes.
So what's an example use case? Okay, you're writing a video processing application. Okay, you're gonna process some video frames. And you're writing your effects using certain processing modules that you might write some now, you might write some later, maybe using GL, CL, GPU, you don't know. You don't wanna have to worry about it. So instead of passing your image data around in some special way, you can use an I/O surface.
You just pass an I/O surface to one of these processing modules. It does what it needs to do, and then the change to that I/O surface is reflected in your application's point of view. And you don't have to set up any special context, anything like that. It just works. Just wrap it up in an iOS Surface and you can do that.
So another use case is multiple processes. Again, you have some application which is going to process video frames. And maybe out on your box you have some nice render server that does all the effects for you. And again, it's using a variety of different things. Maybe it's running in 64-bit mode, and your application's in 32-bit mode. You can open a connection to this process using a Mach port and just chuck your iOS surface right over that Mach port to that other process. It does what it needs to do, and then again, that change is reflected in your process's point of view.
Very nice. So the simplicity is one reason to use IOSurface. It's just easy to get your data around between different APIs, but it's also efficient. So in this case, if the application and the render server are using the same device, as you bounce back and forth between these two processes doing work to this image, it doesn't need to be copied, so of course we're not going to copy it. IOSurface just knows.
So how do you integrate this with CL? Well, you create CL images that are backed by I/O surfaces, and you use a special API to do that. So here we have an I/O surface that somebody's given us or we've created ourselves. We can interrogate that to get the width and the height, and then we just call a special function, CLCreateImage from I/O Surface 2D. All right, and we pass it the width, the height, the surface, and this image format parameter.
Okay, so what about that image format? Well, I/O surfaces don't have a native image format. They're just a bucket of bits. So you can interpret them however is most useful for your application, however you need to. Okay, so whatever image format you want to pass, basically we're going to believe you. We're ready to believe you as long as your width and height matches up, all right, and times the pixel width. So as long as the size matches up. You can interpret this I/O surface in multiple ways even in the same CL program if you want to.
So what do you actually have to do to use these? Well, let's say you want to modify this IOSurface in your application on the CPU before you do some work in CL. What do you need to do? Well, you need to lock it first and then do something.
In this case, we're going to throw some dog pixels into the image and then unlock it when you're done. And then you're free to go ahead and use that in OpenCL, perhaps modify it in OpenCL. So that brings up the question, OK, wait, I'm going to modify in CL.
Do I need to do a lock and unlock? No, you do not need to do that because OpenCL is asynchronous. We're going to do this lock at the time and the place where it's most appropriate. OK, so you don't need to worry about doing that. We'll take care of that.
So what do you need to do if you're going to modify this in CL? What is your responsibility? Not much. Again, you just kind of treat it as any image memory object in OpenCL. Do whatever you need to do, kernels, whatnot. Okay, then you call CL flush. Just make sure you call CL flush. And then back on the application side, before you access that data, even if you're only reading, make sure you call IOSurface lock and unlock when you're done.
And you can see from the second parameter here to IOSurface lock, we're doing a read-only lock. You still need to do that. That's IOSurface's chance to say, okay, what else in the system has touched this shared memory? If I need to move it around or copy or make sure it's clean, that's my chance to do that. So just make sure you do your lock and you're all right.
One other feature that's new in Lion that we've added is support for YUV I/O surfaces in OpenCL. So if you guys are video processing people out there, and you're using YUV format images, you're gonna be pretty happy with this. So you can create an OpenCL image from a YUV I/O surface just using the mechanism I just described. And we have this new image format for you. There's two different channel orders depending on how your data's packed.
And then we support three data types. And then within your CL kernel, you use these as you would any other image object using the OpenCL built-ins to access your data. You have read image, write image, you're going to get back or write the YUV data for one pixel.
So that's I/O Surfaces in OpenCL. So that brings us to our last topic of the session, which is another new feature in Lion, which is the auto-vectorizer. Okay, so to sort of set the stage for this, let's talk about parallelism in CPUs versus GPUs. Okay? OpenCL sees all of these devices as having a number of compute cores with in them a number of processing elements.
So what I've drawn on the slide is we have a CPU with four cores, and that CPU has one processing element in each core. Okay, but this is a big, beefy processing element. It's got a nice vector unit on it, out-of-order execution, super scalar, right? But on the other side, I've drawn a GPU. It has eight cores, and it has 16 scalar processors in each core. So these are, like, you know, in a sense, wimpier, not as tough as this nice CPU core, but there's a lot more of them.
So what does that mean for us as a programmer? All right, let's say you write this kernel, and this is the same kernel that Abe showed earlier, this, you know, add arrays kernel. Now, notice we're passing into this kernel floating point arrays, floats, single floats. So when we do this addition, it's a scalar addition operating on one data element at a time.
Okay, well, great. What does that mean? All right, you send it to the CPU and the GPU. Now, notice the GPU lights up completely. All of these little scalar processors are going to be really busy. But the CPU, all the cores are busy, but that processing element, we're only using one quarter of the vector width. If you were to look at Activity Monitor, he would tell you, "Oh, yeah, I'm completely busy. All the cores are running." It's not really true, because you're only using a quarter of your vector width.
So what can you do? Well, you can write a specialized kernel, all right? So what's different about the specialized kernel? Well, now I'm passing in float fours. I've vectorized my kernel. And that means that addition in the kernel is going to be a vector addition. Okay, so I send that to CPU only, specialized for that device. Oh, now, look, my vector unit's all lit up, and I've extracted all the firepower from my machine.
Okay, so now I know how to get all the performance out of both devices, but that leads us to this terrible conclusion, which is that we have to write multiple kernels. This is no good. You would tell me, I don't want to write multiple kernels. I should write one CL kernel and get good performance. That's what OpenCL is supposed to do for me. Give me good performance on all devices. And you would rightfully say, it's easier to write a scalar kernel.
Any of you guys who have done vector programming know that there are weird edge cases to watch out for. It gets just a lot harder than writing scalar code. And this should just work. You should just take care of this for us. So we agree with you, and to tell you about how he did that, I'd like to invite to the stage Sion Berkowits from Intel to talk about the auto-vectorizer.
So I'll start with, as Jim said, a little bit about what it really takes to write kernels for the CPU these days. So you start with a kernel, which is a straightforward implementation of your algorithm or your problem domain. And then when you want to run it on the CPU, you see that it's underperforming. So you need to add some optimizations to it. And you really start that by adding a loop over the kernel code to execute several work items together.
And this gives you some performance, but again, this still underutilizes the vector or SIMD units. So you're going to optimize again and add execution of multiple parallel work items to utilize the different lanes in the vector. So you get code, which is somewhat far from your original algorithm, is harder to implement and maintain, and is less likely to be scalable among devices.
So what you'd really like is for all of this to be done automatically for you. And this is where the Intel AutoVectorizer comes to help. The AutoVectorizer is a compiler optimization, which is run by default whenever you compile OpenCL kernels for the CPU. What it does is it packs together several work items into the vector instructions, and it also generates a loop over the entire work group, which allows using more complex optimizations, such as the invariant code motion and strength reduction, and also gives you the added benefit that the work group size, the maximum work group size running on the CPU grows from one to 128. All of this allows you to achieve a speed up of up to 4x without any additional developer effort.
So I'll show you a short, simple example of how the vectorizer works. You have a kernel code, which represents a single work item, and you have many such work items going to run in parallel in your application. So let's visualize this a little better. So basically, the kernel code is a bunch of instructions, some of them scalar, some of them maybe vector instructions. So the first thing that the vectorizer does is it scalarizes all of the vector instruction.
Now, once this is done, the vectorizer continues by packing several of these work items into the vector data types, each lane representing a different work item. As an added benefit of this, the amount of invocations that actually happens is reduced. And once this is done, the vectorizer adds a loop over the kernel to execute the entire workgroup.
Now, a little bit about the do's and don'ts of writing optimal code for the CPU with this vectorizer. So what you should do is not much. Keep it simple. Write your code as best fits the algorithm of your problem's domain. Don't try to add optimizations that are device or target specific into your kernel. Also, avoid as much as possible using control flow, which is work item ID dependent, as this is kind of a vectorizer killer. and let the auto-vectorizer do the optimization work for you, basically.
So I'll show a demo of the auto-vectorizer. So this is an example of a popular scenario where you use OpenCL filters, filters written in OpenCL to process like a movie. Now, since the processor is - Okay, okay, since the process is relatively strong, we're going to do many of them in parallel. So I'm going to start them. And as you can see on the top right, the frames per second goes down, the more filters I process at the same time.
And right now the movie has slowed down to practically a crawling pace. So all I have to do is I flip on the vectorizer switch, and here I get a 4x speedup more or less, and the movie is again running at a very good speed. So I'll turn back to James.
So, let's wrap up the session. What did we talk about today? Okay, so Abe talked to you about how using OpenCL online is really easier than ever. This new integration with Xcode that we offer and the ability to send kernels to Grand Central Dispatch dispatch queues is going to get you up and running with CL faster than ever before if you're new to CL. And it's just an easier way to use OpenCL online.
Sion just told you about the auto-vectorizer, which is great news that you can now just write scalar code and get the best possible performance on your CPU. And he showed you video processing, which is typically an application which runs in the domain of GPUs, but we get really good performance on the CPU as well. So that's great. Abe also talked to you about the offline compiler, so if you don't want to ship your human readable source with your program now, you don't have to. So for those of you who are asking for that, you know, you're welcome.
I talked to you about easy and efficient sharing that you can do between OpenCL and OpenGL, and also about how to share data using IOSurface across APIs. So we really encourage you to check that out if you haven't before. So for more information about anything you saw, you should contact Alan Schafer. He's our graphics and game technologies evangelist, and his email's up there on the slide. Or you can go to the Apple Developer Forums.
So if you found this session interesting, there's another GL session which you might like a lot. Okay, that's going to be Thursday, 10:15 in Mission, which is the room right behind us. And that's called "Advances in OpenGL for Mac OS X Lion." So I encourage you to check that out. And with that, I'd like to thank you for coming today.