Graphics, Media, and Games • OS X • 54:11
OpenCL dramatically accelerates computationally intensive tasks by letting your application tap into the vast computing power of modern GPUs and multicore CPUs. Understand the OpenCL architecture and get all the details about OpenCL features in OS X Mountain Lion. See how auto-vectorization makes it easy for more apps to adopt OpenCL and gain specific insight into tuning your code for optimum performance.
Speakers: Eric Bainville, Sion Berkowits, Anna Tikhonova, Russell Williams
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good afternoon and welcome. My name is Anna Tikhonova. I'm an engineer on the OpenCL team. Let's talk about adopting OpenCL in your application. So some of you in the audience might be new to OpenCL. So let's begin with what is OpenCL and why should you use it? So today's Macs have a multicore CPU with a vector unit and one or more GPUs.
If you want to take advantage of all this different hardware in your system, it can be quite a challenge. Because the programming approaches for the multicore CPUs and for the GPUs are quite different. You will actually have to write different versions of your code for the CPU and the GPU.
So specifically, to take advantage of a CPU, you would need to split up the work across multiple CPU cores using Grand Central Dispatch. And to utilize the vector unit, you would need to write SAC or AVX code, depending on the instruction set. and then this multi-threaded, vectorized, optimized CPU code will not run on the GPU.
You would need to write a separate, scalar version of the code for the GPU. And I don't know about you, but this sounds like a lot of work to me. But OpenCL is here to help you. OpenCL is a technology which enables you to write parallel code which will run efficiently on both the CPU and the GPU.
The language for writing OpenCL programs is C-based, which is good because most of you are familiar with C, and if you're new to OpenCL, you will find it very easy to get started. The common code you will write for the CPU and the GPU will be scalar in nature, which is good because you have to write just scalar code.
But you may be thinking, how could this possibly give me the best performance on the CPU? Well, OpenCL does the work under the hood for you with the Intel Auto-Vectorizer. Your code will be compiled to fully utilize the vector unit, and you will get a performance boost without doing any extra work. OpenCL will also take care of scheduling the work across multiple CPU cores for you.
So this is of course a very brief discussion of what is OpenCL and what makes it so powerful. If you're new to OpenCL, if you need more introductory level information, at the end of this talk we will refer you to some additional resources. But for now, let's just take a look at what's on the agenda today. We have some new and very exciting things for you in Mountain Lion.
First, I will talk about what's new in 1.2, OpenCL 1.2, and then Sion Berkoviets from Intel will talk about the improvements to the Intel Auto-Vectorizer in Mountain Lion. And then my colleague Eric will talk about taking C code and turning it into optimized OpenCL code. and then his talk will be followed by a presentation by Russell Williams and David McGavin from Adobe. They will talk about how they use OpenCL in the recently released Adobe Creative Suite 6. So, let's get started.
The first 1.2 feature I will talk about has to do with the way you compile OpenCL programs. But before I tell you exactly what it is, let's just quickly review how you usually compile OpenCL programs. You have two options right now. First one is online compilation. So imagine you wrote some OpenCL program and you need to compile it for both the CPU and the GPU. To do this, at runtime, pass this OpenCL program to the OpenCL compiler as a plain text character string. and then the compiler will build it for you into device-specific executables.
Your second option is offline compilation. This option is specifically for those of you who prefer to not ship source. In this scenario, you first use the offline OpenCL compiler to compile the OpenCL code into bitcode. And then at runtime, instead of loading plain text source code, just pass the bitcode to the compiler and the compiler will build device-specific executables for you. And let me show you exactly how to do this. You can find the standalone OpenCL compiler in the OpenCL framework. And this is the path. And this is the command line we'd like you to use. Please use the arch flag to specify which architectures you're building for.
And, you know, we've shown you this before, but what's new in Mountain Lion is that we now guarantee that if you build your bitcode files in Mountain Lion, you'll be able to run your programs on any OS starting with Mountain Lion. And in Mountain Lion, we're also bringing the traditional compilation and linking model to OpenCL. And one of the cool things this allows you to do is to create OpenCL libraries. And you can use these libraries in other OpenCL programs.
So how do you do that? How do you create an OpenCL library? Well, just as you would expect. Imagine you have some OpenCL source code, for example, a collection of image processing filters, several collections of image processing filters, and you'd like to create a library of these filters. So you just compile your code into intermediate object files and then link them together. And then during the linking stage, you have two options. The first one is to generate device-specific executables, as before, or you can create an OpenCL library.
So how do you use this library in other OpenCL programs? Well, once again, just as you would expect, just include a header file for this OpenCL library in your new OpenCL program, compile the code for this program, and then during the linking stage, link in your library. and then tell the compiler to generate device-specific executables for you. And that's it, you're done. So that's it for this feature. What else is new in Mountain Lion? You're now also able to call printf within your kernels.
When the kernel finishes executing, the output of all the printf calls will be flashed to standard out. But there's one thing you should be aware of when using printf in OpenCL. If you have many work items executing in parallel and all of them are calling printf, the order of the printf output for the different work items is not guaranteed.
So, for example, you can't expect that the work item with the global IT zero will print to standard out first. Let's look at an example. Imagine you have a whole bunch of work items executing in parallel and all of them are computing an index into some global data buffer. And you're trying to use printf to print out these indices to standard out.
And this is the kind of output you will see. So if you look at this, this is not very helpful. What is going on here? Which work item is outputting which index? You don't know. So we recommend that you always print out the global work item ID along with the data that you're printing out. And this way you can really figure out what's going on.
Another cool thing about Printf in OpenCL is that it supports vector types. So here are a few examples of how you can use Printf to output vector data. Please note the vector specifier V, which is followed by number 4. So this number stands for the size of the vector you're printing out.
And here is the output for these different Printf calls. Please note that the data for the different vector components is separated by commas. So this is it for Printf. What else is new in Mountain Lion? In Lion, you were only able to overload built-in OpenCL functions. And now we enable you to overload functions that you write. Just use the overload attribute. So imagine you're working on a function that graphs data. And you can do something like this to have one version that handles integer input and another version that handles floating point data. Or you can use the overload attribute.
And this will make your code more clear and more readable. Also, when you're using OpenCL, it's always better if you give us more information about what you're doing so that we can possibly give you better performance under the hood. So for example, there's some commonly used operations such as filling memory. You're probably using kernels to do this right now.
In Mountain Lion, we also have some new convenient APIs for you. So one to fill a buffer with a pattern and another one to fill an image with a color. If you use these APIs, we know what you're doing. We can possibly give you really good performance. performance.
also give us more information about how you're using your memory objects. So for example, if you know that you're only going to be writing to a memory object from host, tell us with this flag and we will know that the data for this memory object does not need to be cached because you're not going to be reading it, we won't let you, and the cache can be used more efficiently for other memory objects that you will be reading or writing.
Also, if you are working on modifying a memory object on the GPU and you're also enqueuing a map command to modify it on the CPU, and that you know that right now you're using this map command to overwrite the data in the mapped region. So tell us with this flag.
And we will know that if the most up-to-date data is on the GPU, we don't need to copy it to the CPU as we would normally do because you're just going to overwrite it. It's just going to be a waste of resources. And finally, we've deprecated some APIs on you.
We replaced them with new ones. So check the new headers, use the new APIs, and if you forget, the compiler will remind you. So this is all for what's new in OpenCL 1.2. And now I will invite Sion Berković to talk about the improvements to the Intel Auto-Vectorizer in Mountain Lion.
So hi, everybody. I'm Sion. I work at Intel on developing the auto-vectorizer, which I'm going to tell you about now. So I'm going to start with a short recap of what the auto-vectorizer is. Then I'll tell you about what's new in Mountain Lion. I'll explain a little bit about how the auto-vectorizer actually works. And I'll give a few tips about how to write code efficiently in the presence of the vectorizer. Then I'll show a short demo.
So if you developed already some OpenCL code for the CPU, you must have seen that getting performance out of it is not so trivial. It requires adding some optimizations which are target specific, most notably dealing with vectors efficiently. It also loses its performance portability, meaning that different targets or different architectures may require somewhat different optimizations to be really optimal.
So that's why we introduced the auto-vectorizer. It's basically a CPU compiler optimization for OpenCL, which was introduced in OpenCL for Lion. And it basically adds for you the optimization to use the vector registers or SIMD registers in the architecture. The way this works is the auto-vectorizer basically packs together several work items into the vector registers. Each vector lane represents a different work item of your work.
Some of the features of the auto-vectorizer as existed in Lion is, again, it runs by default when you compile for the CPU. It does not require any user intervention to actually run. It works well in the presence of both scalar and vector operations that the user added to the kernel. However, it only worked in the absence of control flow.
So, introducing the new auto-vectorizer in Mountain Lion. The major addition that we did was the introduction of support for control flow. Now when I say control flow, I mean more or less all the known styles, if-then-else blocks, loops, for loops, while loops, every nested form of those. Also, the vectorizer automatically optimizes the code to the underlying CPU architecture, and by that providing significant speedup compared to non-vectorized code. However, the speedup itself may vary depending on the use of control flow in the code. Now, to explain why that is, I'll go now into some more details of how the auto-vectorizer actually works dealing with control flow.
I start with a simple example of an if-then-else block. So the problem with standard if-then-else code is that different work items of your open-series application may choose a different path in the code. And since the vectorizer tries to pack everything, several work items into a single instruction, then it has a problem there. How can it pack it when they are going in different directions? And the way this should work is that the auto vectorizer should basically execute both the then and the else side of the statement.
And to do this, the vectorizer serializes the code and executes both sides. However, unneeded calculations along the way are disposed and in the case of instructions with side effects such as every kind of memory access, those instructions are avoided. So, in this case, the vectorizer is not able to execute the instructions. So, in this case, the vectorizer is not able to execute the instructions.
So, in this case, the vectorizer is not able to execute the instructions. So, in this case, the vectorizer is not able to execute the instructions. So, in this case, the vectorizer is not able to execute the instructions. So, in this case, the vectorizer is not able to execute the instructions. So, in this case, the vectorizer avoided.
However, there is one case of uniform control flow in which case all the work items go in the exact same direction and this is detectable during compilation and then the code itself is not modified as it works just as is. Now, in the case of loops, we have a slightly different problem. Different work items may actually iterate over a loop for a different amount of iterations. In the example here, you can see that you have a loop where every work item basically iterates the amount equal to its ID.
So how can you pack several work items into instructions when you have this thing? So, basically what the auto-vectorizer does in the case of loops, it iterates over the loop for an amount which is enough for all the work items it packed, and it continues iterating until all the work items finish their work, and it will only continue executing past the loop after all the work items have finished doing their loop iterations.
So now I'll give a couple of tips for programming efficiently in the presence of the auto-vectorizer. So one thing is about memory access in control flow. So as I said earlier, memory accesses in control flow need to be masked because some work items may have to avoid them. So adding this masking adds overhead and basically penalizes performance.
So the recommendation is avoid as much as possible having the explicit memory accesses inside the control flow. Try to move them out. As you can see in the small example here, there is a memory write inside both the then and the else side of the if, and that can be moved outside the control flow statement and replaced with a temporary write to register.
The second tip is about row-wise data access. The way the auto-vectorizer works is it packs together several work items which have consecutive indices, global IDs, in global dimension zero. That means that when the kernel accesses array elements, for example, it is preferred to access consecutive array elements in consecutive work items.
As you can see in the small example here, access to array A is consecutive, while in the case of array B, the access is trided. That means that when the code is vectorized, the vectorizer can basically emit a single vector load in the case of array A. However, in the case of array B, it has to do several scalar loads and gather the results to a vector register.
So now I'll show a short example, the new vectorizer. So I start with more or less the same example that we showed last year for the vectorizer. This is nine high-definition movies running in parallel, and we apply some image filters on them. Now, as you can see on the frames per second bar on the top, the frame per second rate doesn't decrease too much in the presence of these filters.
However, this time around, we added a new filter, which is a wave simulation, which is a kind of physics simulator. Now, this contains a lot of control flow, and the old vectorizer does not work on it. And as you can see, performance goes down tremendously. If I try to use this, you can see that the wave is slowly moving on the screen. If I do some more, I can slow it even more. Now, with the new vectorizer, we have a new filter.
It's easy to -- you just get speed up by vectorizing also the control flow in the new filter. And also, with this, I can even turn on the rain feature, which basically does a whole lot of raindrops altogether. And as you can see, even with that, performance is hardly impacted at all.
So to summarize, the auto-vectorizer is an optimization that exists in OpenCL running behind the scenes. You don't need to do anything to turn it on. And it basically optimizes your code. And now in Mountain Lion, it optimizes it even if it includes a complex control flow inside, giving you performance. So thank you. And I'd like to invite Eric to talk about OpenCL optimizations.
Good afternoon. My name is Eric and today I will tell you about OpenCL code tuning. Let me show you some numbers. So this has a running speed of a Gaussian blur algorithm on a 16 million pixel image. I will tell you details about this algorithm later. This first one is what we get when we run it.
So just a straight forward implementation of it on the CPU. And then you take the same code and port it to OpenCL and you get this. And after a few steps of tuning, you can reach that. Yeah. And this is what I'm going to tell you about now.
Okay, let's go. So to do this we'll follow this very simple program. Yeah, it reminds me when -- okay, anyway. So first we'll choose a suitable algorithm. Then code it to OpenCL, benchmark it and decide if it's fast enough or not. And if it's not fast enough, just identify why and find a way to fix it and start all over to line 10. So first we need to know what fast enough means and how to choose a suitable algorithm. So to know what is fast enough, the simplest way is just to benchmark code on the GPU.
And we will use very simple kernels. So this is what I call the copy kernel. It takes an image stored in a buffer. and puts it, read it from the input buffer, the in pointer here and write it to the output buffer. Okay. So, this function is run for all pixels of the image. So, we will schedule 16 million of it.
And the first thing we do inside the function is to get which pixel we should process. This is X and Y. And then we just read one value, one float value from the input buffer and write it to the output buffer. Benchmarking this, we will get the best speed we can expect for the GPU. All right? And then there's a MAT kernel. MAT is for multiply plus add.
So, we read the value like we did before and do a few floating point operations on it like three here before storing it back to the output buffer. And we'll change the number of floating point operations we'll do. Okay. Let's benchmark Now, so this is on a Mac Pro AMD GPU. And this is what we get with a copy kernel, like 13.5 Giga pixels per second.
OK, now let's add the floating point operations. Okay, this is with three floating-point operations. As you can see, we get almost the same speed, which means that actually the MAT kernel is memory-bound and we are still waiting for memory and the floating-point operations are free. Okay, let's add more. Six. Same. Twelve.
18. Okay. And then we start finally to see something with 24 floating point operations. So usually it means that what we want for a suitable algorithm is to have first to be embarrassingly parallel. It means you process the same thing for all pixels of the image. Then as we have seen, usually memory will be the first bottleneck. So you want to minimize memory accesses. And then you want to increase the compute to memory ratio.
And then from this benchmark we can get an estimate of the running speed of our code. Okay, so now we are ready to choose algorithm. And I will present you some Gaussian blur algorithm. This is actually the definition of the Gaussian blur. So the cute green curve is a Gaussian with a sigma equal five.
And so to compute the convolution we need in this case 31 by 31 filter. So that will at the end, if you use this definition to compute it, we'll do almost 1,000 image accesses and 2,000 floating point operations per pixel. This is huge. Fortunately, we have a lot of data The Gaussian filter is separable.
It means that it's a product of two one-dimensional filters. And in this case, we can compute the same thing by processing two one-dimensional convolutions, one horizontal on each row and one vertical on each column, and we get the same thing. And the cost in memory and flops is much lower.
And then there is a third algorithm which is called Recursive Gaussian Filter. This one does not compute the exact Gaussian blur actually, but a good approximation of it. So it works sequentially on each row and we have this kind of green spider moving from left to right. And the first pass will take the input, the red line, and produce a temporary output, the blue thing. And then it will come back from the right to the left, read the input again and the blue thing again and produce the output, the yellow curve.
Okay, and this... Okay, to apply this to a two-dimensional image, you will then need four passes, that's spiders, so one spider for each row from left to right and then right to left, and the same thing vertically. And so the final cost for this one is 10 memory accesses and 64 flops per pixel.
Okay, this thing finishes? Okay, done. So I've showed you three algorithms for Gaussian blur. I put them back all together on this table with the copy kernel first. So the first column is the number of memory accesses per pixel. Then you have the number of flaps and the compute memory ratio.
And the last column is very interesting. It's the estimated running speed obtained from the copy kernel. So you see the last one does five times more memory accesses than the copy kernel. So it's one-fifth of the speed. Actually, this is the one we'll be implementing, obviously. It's supposed to be faster. Okay, let's do it. That's line 10.
This is the code. I try to simplify it so the gray lines... We will learn one of these functions for each row of the image. This is a horizontal path. And the first thing we do in this kernel is to figure out which row we must update. This is a Y.
And then we'll do a loop on X from left to right and from right to left. At each step of the loop, we load one input value and store one output value. And there's a backward loop below it. And this is a vertical kernel. It does exactly the same thing but with X and Y exchange. So we run one of these for each column of the image. Right, now we have the code. Let's benchmark it.
That's 27 million pixels per second. Didn't I say 2800? Well, so clearly we have a -- something is slow here. So let's see in detail the respective speeds of the two horizontal and vertical passes. This is vertical pass is quite fast, which is almost 2 gigapixels per second. But we have an issue with horizontal pass. It's this incredibly slow. So why? Usually the The usual suspect in this case is a memory access pattern.
So imagine inside the GPU you have a schedule of 16 million functions to call, and they will be called by groups of 300 at the same time. And you will have these 300 work items requesting memory access with a different address. This is what I call the memory access pattern.
And so the hardware is optimized for certain accesses and it does it very fast. And in the other cases, accesses will be conflicting and then they will be serialized and running time will be very slow. So how to identify these patterns? We can show some simple rules. So this is a fast case.
This is when each consecutive work items will access consecutive addresses in the buffers. This is the case the hardware goes very fast. We want this. And for image processing, this is when consecutive work items access consecutive pixels in the same row. And this is a slow case. It's almost the same, but we access the buffer with a stride. Here it's 1024.
So in this case, all hardware requests will end up in the same memory channel or bank and we'll have conflicts and the accesses will be serialized. It will be extremely slow. So in image processing, this is when consecutive work items access consecutive pixels in the same column. Right. So let's see how we can... Oh yeah. And then in the other cases, you just need to benchmark the code to see in which case you are.
So let's see if I can identify the patterns in the kernels we have just seen. This is a copy kernel. So as you can see, consecutive work items will have consecutive values of x. This is a global ID 0. And then in this case, access to the input buffer will be fast because we access consecutive values, consecutive addresses in the buffer. And the access to the output buffer will be fast. So now let's look at the vertical path. In this case again, consecutive worker times will have consecutive values of X, and we will do fast accesses to the input and output buffers.
Now, let's have a look at the horizontal one. So in this case, consecutive work items will have consecutive values of y. And as you can see, the stride for the access is w in this case. So when w will be larger, it will be in the slow case. This is what we have seen actually, and this is why the horizontal kernel is slow.
Okay, so we have identified the bottleneck. Now, how do we solve it? Actually, there is a very simple solution in this case, just to drop it. We can use the vertical one and do transpose after it, and then transpose again after another vertical pass, like this. So you do vertical pass, transpose, do it again, and transpose. So if we assume we can do a fast transpose, this is supposed to be faster than the horizontal plus vertical, okay? So now we need a transpose kernel.
Yeah, so we can update our estimate of the running time. So we will do two more passes. Transpose will be run with two memory accesses. We do that two times, so that's 14 memory accesses. And the estimate drops to 2,000 million pixels per second. So how do we transpose? Actually, transposing is just like copying. So we read x, y and start with y, x instead of x, y again. So it's almost the same kernel as a copy kernel.
All right. And within this case, the access to the input buffer will be fast, but the access to the output buffer will be slow because it will be done with a stride of H. So let's benchmark it. Okay, and we see exactly what we expect that it's after edge becomes larger, will be slower.
So how do we solve this? Another usual solution to this issue is to move the problem to a faster memory level. Inside the GPU, we have processing cores, that's the yellow boxes here, and each core has arithmetic units, registers, local memory, and everything is connected to a global memory. And then it's connected to the host, the Mac.
Okay, the speed of this connection is 10 gigabytes per second. It's pretty slow actually. And then each layer of memory is 10 times faster than the previous one, more or less. So that would be 100 gigabytes per second for the global memory, 1,000 for the local one, and 10,000 for registers.
So morality of this, when your data is moved to the upper levels, you want to reuse it as much as you can. And also you want to avoid host-to-device transfers. So what we will do is move our problem to the local memory. Just like this. So we'll have a workgroup, a block of work items, loading a small block of the image, storing it in local memory, And then when all the work items in the group are finished doing that, moving it back again to the output buffer.
Okay, let's write the code for this. It becomes a little longer, but it's not very complicated. BX and BY will be the workgroup coordinates and they will be mapped to BX/BY blocks in the image. And IX and IY are the pixel coordinates inside the block. And then we'll just... Each work item will load one value from the input buffer to the temporary local memory buffer, this hoax. Then we'll wait for all work items to do that. That's a barrier call. And then we'll store the temporary buffer back to global memory output. And in this case, both accesses to global memory are fast.
Because consecutive work at times will have consecutive value of ix. Okay, let's benchmark it. This might be faster. Oh, sabotage. It's not fast at all. So yeah, why? We must now consider globally what happens inside the GPU. So we have all these workgroups, and each workgroup will be mapped to a core inside the CPU. Let's say if you have 10 cores, the first 10 workgroups will be executed by the 10 cores together.
So they will all do the input accesses together. And it will be fine, because it will be on the same rows. But then, when they want to store the result in the output buffer, they want to access the same columns. And they will be conflicting in this case. So how do we solve this? We can just change how the workgroups are mapped to the image blocks. Instead of mapping them by rows, we'll just map them diagonally like this.
In this case, the accesses to the input buffer will be fast and so will the accesses to the output buffer. Actually, the change in the code to do this is pretty small. It's just adding one line. You just change how the world groups coordinates are mapped to the image blocks. Okay, now let's benchmark it. Ah, now it's faster. Good.
Right, so let's take this transpose kernel and with the vertical recursive Gaussian kernel and put that, everything together. And this is what we get. All right. We can reach something like 650 million pixels per second. You tell me, didn't he say 2,000? Yes. So, something is still slow here.
The issue we have now is we are running only not enough work items to saturate the GPU. We are running one work item for each column of the image. And that will be something like thousands of work items. But actually, it's not enough to hide all latencies inside the GPU. So, we need to find another algorithm using more work items. But this will be a subject for another talk. So, let me summarize.
So actually this simple procedure and we have followed it and in a few steps we have obtained a significant speedup. I just want to add a few things to this. After you write the code, just check the output you get with reference implementations because otherwise it's quite easy to get much faster but with the wrong output. So yeah, you need a reference implementation somewhere. And second thing, don't give up at the first try. Usually with very simple changes, you can get much faster code. Well, thank you. I will invite on the scene Russell.
Hi, my name is Russell Williams. I'm one of the architects on Photoshop. And I'm going to show you how we used OpenCL to speed up one of the cool features in the new Photoshop CS6. I'll just start off with a demo here. And what this feature is, is it's a blur gallery. It's a collection of creative blur effects like tilt shift, which is a popular effect now.
What I'm going to show you right now is called a field blur, where you put pins on the image and adjust how blurry you want that general area of the image to be. And the blur filter interpolates between those things. And first I'm going to run it on the CPU.
And this is a hand-optimized SSE hand-threaded CPU implementation. And you can see we used all 24 threads, all 12 cores on that. And it took 5.2 seconds down there in the corner. And if I just go in here and turn on OpenCL and rerun that, didn't even get a progress bar, 1.8 seconds.
So that was a pretty good speedup and that's against the very fastest 12-core Mac there is on the CPU side. So that's really good. What was it? That was about a 200 line, just a very small OpenCL kernel, not doing a Gaussian blur but simulating lens optics of a blur. It gives a much more aesthetically pleasing and realistic result. And we broke the image into 2K by 2K, 4 megapixel blocks in order to fit an arbitrarily big image on the GPU and I'll talk about that again in a second.
So why did we use OpenCL for this? We've done a lot of blur filters and other things before and never used OpenCL in the past and now we're starting to use it. Well, first of all, we wanted to use the GPU and OpenCL is the only cross-platform GPGPU solution available. There are other GPGPU languages like NVIDIA's CUDA, but they aren't cross-platform and for us that's a deal killer.
Why didn't we use OpenGL? Well, we've used OpenGL in the past and we will use OpenGL for more Photoshop features in the future. But OpenGL is fundamentally a 3D rendering package, a geometry rendering package. And in order to do something in OpenGL, you have to recast your problem into those terms.
And it's got a steep learning curve. You've got to get your head around all of that 3D rendering stuff before you can write your algorithm. And if all you've got is a matrix of numbers and you want to run an algorithm on them and you get a matrix of numbers out, OpenCL is just far simpler, both to write and to learn.
Also, in the last year, I said portability or cross-platform is very important to us. And OpenCL really hit a new level of maturity and ubiquity in the last year. It's now shipped as part of the standard user mode install drivers on the Windows side for NVIDIA, AMD, and Intel integrated graphics. Now, I originally titled this OpenCL challenges, but really they're challenges for anything, doing anything on the GPU. And we were just using OpenCL on the GPU this time. We did our traditional hand-coded version on the CPU.
So the challenges on the GPU are, first of all, you need an algorithm that takes advantage of one of the two things that the GPU can do better than the GPU. One is a bunch of compute, and the other is memory bandwidth. That however many gigabytes per second the GPU has to its memory is much greater than the bandwidth the CPU has to its memory. And our blur kernel, like the Gaussian example you just saw, will in fact scale more with GPU bandwidth.
If you have a GPU with greater memory bandwidth, our blur gets faster. If you have a GPU that has the same memory bandwidth but more compute, it doesn't get any faster because we're still using up those free flops. He told you about the first 20 flops are free. And of course, the algorithm needs to be embarrassingly parallel. You need to be able to compute multiple output results independently without having one depend on the last.
It's really helpful to have, as he mentioned, have a debug reference implementation. We were developing this lens blur algorithm at the same time we were developing the OpenCL kernel, and that made things take a lot longer. It's much easier if you know what answer you're expecting so that you can more easily debug OpenCL. For debugging your algorithm, Xcode's debugger is much easier to deal with.
There's always a desire, at least in the marketing department, to say, "Oh, we did it on the GPU and now it's ten times faster." Well, it's not going to be a single number, right, because the GPU and the CPU are independent variables. That same GPU will be a lot faster compared to the GPU on the PC. So, it's not going to be a single number, right, because the GPU and the CPU are independent variables.
to the CPU if you put it in a six-core Mac Pro. And I say unpredictable. What I mean there is you have to be empirical. Just like this whole last talk about optimizing, just looking at it up front and saying, well, this only needs to do so many flops and so many memory accesses, and so it ought to be 10 times faster. You have to actually implement it and optimize it and go through that little basic algorithm in order to find out how much faster it's going to be.
We are constantly prototyping algorithms to see which ones will be suitable. We have to deal with resource limits. If you're dealing with just a screen image and you know the number of pixels you're computing or how many fit on the screen, that's probably not a problem. But Photoshop tries to deal with arbitrarily large images, multi-gigabyte images. And so the memory on the GPU is not virtualized for you. And so we break up the image into four megapixel chunks, and stitch them back together.
Similarly, there's no preemptive multitasking out there. And typically, if you're just running through a bunch of pixels, it's not a big problem. But if you have some iterative convergence algorithm, you can't just have something out there running for 30 seconds. It'll lock up the screen interaction, and the operating system will kill your process because it doesn't like that. So you have to break up your problem into manageable-sized chunks.
And then finally, if you're just coding for the Mac, you're pretty golden. That's a much easier issue. If you're determined to be cross-platform, then you have this problem of not only do you have a couple of operating system versions on the Mac, but you've got the Windows side with NVIDIA drivers, AMD drivers, and Intel integrated graphics drivers, and they're all updated at different times and independently. And so it really blows up your test matrix.
But even with all those challenges, we would definitely do it again. And as I said, we're actually going to be doing it again. And so we're going to be doing it again. And as I said, we're actually in the process of prototyping more algorithms to see what we'll put in the next version.
And the other thing is that just in some very preliminary work, we've seen really great results from the AutoVectorizer for running OpenCL on the CPU in Mountain Lion. And so we'll be looking at doing more of that. And with that, I'll turn it over to my coworker, Dave McGavran, and they've done some really amazing stuff, and a lot of it, in Premiere Pro. Thank you.
Hi there. My name is David McGavern. I'm the engineering manager for Adobe Premiere Pro. And we've been doing GPU optimizations for quite a while now. But about a year ago, we wanted to really see what we were going to do with CS6. And we had sort of two goals with CS6.
We didn't just want, you know, a certain amount of effects to go faster or some things to go faster. We liked that idea, but we really wanted to be able to change the way that we do video workflows. We wanted editors to actually have a different way that they can now use the application to do more creative things with their work. And so when we looked at how we were going to do that, we had two goals.
One, we wanted to be able to do it on both platforms. And two, we wanted to be able to do it on the Mac laptop. The 15-inch laptop is extremely popular for our customers, and we really wanted to be able to support that. And when you look at really how you can go about doing that, the only answer that you can go at is OpenCL.
So we looked at jumping into OpenCL. We looked at the OpenCL for Premiere CS6, and we went really all the way in. You'll see we ported our entire video pipeline to OpenCL, not just a little bit, all of it. And working with Apple and our other partners, we will have a pretty compelling reason why that our users will now have a different workflow when they're editing video. So I'm going to show you a quick demo.
So here you can see Adobe Premiere Pro CS6, and we're going to show you a few interesting ways that our workflows have changed. So down here in the bottom right, this is our timeline window. What you're seeing there is four individual HD frames, and we put a scale and a position on them so they're picture-in-picture.
And one of the new features we have is the adjustment layer feature, and you might be familiar with this from Photoshop or After Effects. Having this in a video editor that does real-time video editing is pretty compelling, and it's a new feature in CS6. So what that allows you to do is instead of adding effects just to the individual video clips, you can now add an effect to the adjustment layer, and that will change everything that's under the adjustment layer as a total component.
So the first thing I'm going to do here is I'm going to go over to the effects, and I'm going to add a three-way color corrector to the adjustment layer. So I'm going to go over to the effects, and I'm going to add a three-way color corrector to the adjustment layer.
So here we go. So what we're going to do is we're going to grade that composite of those four HD streams. So here we go. We're going to make it very colorful. And it took a little while because it's actually a pretty complex optimization. And that's a really good feature for our users. They get to now grade an entire composite of video edits.
But it's not really what we were aiming for. We really wanted to change workflows. We wanted people to be able to stay creative in real time. So I'm going to go in here and I'm going to turn on OpenCL. Again, this is on the 2011 MacBook Pro 15-inch.
So here we are now, and so now I'm going to actually use another feature which we call Don't Stop Playback. So now we're actually playing that back in real time, which is 4 HD streams with some scales and some positions and a three-way color corrector. And when I say that we can now do things that we were never able to do before with OpenCL, I'm now actually grading that whole thing in real time without dropping any frames. That's a very different workflow for our customers and very powerful, and that's a lot of computational power going on there. But that's not it. We can now go and add more effects.
So while I'm editing, I can go and grab more effects and add them to the video stream. So we'll just add a brightness and contrast here. And again, editing is not stopping. We're not dropping frames on a 15-inch laptop with a lot of HD streams. And I can just continue to edit this way without stopping, being very creative and staying in the workflow and not being distracted. Now, I actually didn't even tell you all the truth. In fact, each of those video streams also has effects on them. So I can go to one of those other video streams and you'll see another three-way color corrector.
And I can do that and you'll see that one of the screens -- I don't remember which one -- will be changing as I update that one. There you go, on the bottom left there. And each of those video streams has that. So now we have a very large collection of video algorithms working in parallel and not stopping on a 15-inch laptop. And this is how we can use OpenCL to really change the way video editors edit video. So that was that.
So how did we do this? So we've been -- like I said, we've been working with GPU stuff for a while, and we've also been working with multiprocessing for a while. And so we've done something that helped us is that we broke up our pipeline into multiple stages. And the first thing we'll do is we'll start on those HD frames and we'll start reading them from disk. And we'll be doing that in parallel.
And we'll be doing as much as we can at once ahead of time of the play head where you're seeing it play. And then we'll start doing some CPU processing. Now, what's very important is for us, we didn't want to throw every single thing up onto the GPU because then you'd have a whole bunch of CPUs not doing anything, which would be the inverse problem that we had a couple years back.
So we do some CPU processing. And in that, maybe we'll do an H2640 code or we'll do some other codec work. And there's a few effects that we didn't actually get to in OpenCL, so we'll do those. And we'll do that as a subtree render. So we'll package everything under the non-GPU code together.
And at that point, we're ready to go. And we'll start to upload it. And all of this is going in parallel together. And at this point, we get it up onto the GPU and the GPU becomes a coprocessor. And so while the disk I/O is going on and the CPU processing going on, the GPU starts cranking away and doing all its work.
Then we either go out to display using an OpenGL interop or we bring it back down to the system memory to do an export. So here, you get an idea of how much we actually did. Again, we started this a year ago. And there was a lot of people who were pretty skeptical when I came in this room.
And they said, we're going to port to OpenCL. That's a lot of GPU work that we do. And so what you can see is in the intrinsics there in that first top left column is just the things that a user may not always see. We're doing deinterlacing, compositing, blending modes, nested sequences.
All that stuff is what the user doesn't see. And then you have a series of transitions and effects. And those are the things that the user actually applies. All of these are optimized for the GPU using OpenCL. And that's pretty impressive in the time that we had to get this done. So not only that, on top of all that, we get to do bigger frame sizes. We get to do deeper color. And we have this all in a floating point space. So it's not even just faster. It's also better quality.
So what did we learn from doing this? Well, it was a lot of work to do this in a year, and there was a lot of things we had to learn. So one of the first things you saw is when Ana was talking, she talked about how you had to load up these kernels at run time. There's a lot of kernels. And some of those kernels are broken into multiple kernels. So we had a lot of stuff getting loaded up and compiling at run time.
We needed to load those asynchronously or we didn't get a very good user experience. So by loading them asynchronously, we can get to the situation where a kernel was needed before it was ready. So we then actually preempt those asynchronous loadings so that we pick the kernels that are most important at the right time. So that changed the user experience, and now you don't even know that compiling is really going on when you're launching.
We also had a lot of work to get OpenCL Interop high enough performant. There's a number of ways to do it, and we tried them all, and we did a lot of benchmarking. And what we came to is that CL Create from GL Buffer and CL In-Q Acquire GL Objects were the way to go for both the upload and download. These were the things that gave us the best performance.
Then comes the question of OpenCL images versus buffers. They're both sort of independent ways to deal with memory on the GPU. And the OpenCL images actually fits our model better. It's a 2D, you know, optimized version to get to video frames, which fits what we're doing. But for the most part, we actually settled in on buffers for a couple reasons.
Buffers allowed us to do better caching. We can actually manage that memory better, and we had a lot of frames in parallel as you saw. It also allows us to read and write to the same -- to the same buffer at once, which is very important for a lot of our kernels.
In the year that we had, you heard Russell say that things have really matured. We started on when we were doing a big thing and our code wasn't working right, and there was still growth areas in OpenCL, so we actually had trouble with pinned memory, which is a shame because that actually brings a lot of performance.
At the time, we had to avoid it because we couldn't get it to work, which is probably our fault. So we're going to go back and look at that again, but that's something to keep in mind that using pinned memory will give you better performance, but it was tricky for us to get it to work. We also had trouble with structures passing them up to the kernels. We couldn't always get the alignment right, so we found a good way to get around that is to just flatten the structures and move on and not spend too much time there.
Also very importantly, and you've probably heard this a lot since you've been at WWDC, filing radars. The OpenCL team has been extremely responsive when we've found performance issues or concerns either helping us fix our code or fixing bugs in a few cases. So file radars, and that's a really great way to get things taken care of.
So when we look at all that we've achieved with OpenCL, there's still some things that we can brainstorm about that are still possible in a video editing application to do with OpenCL. We can continue to increase the set of effects. There's a few effects that we have that we don't have OpenCL implementations for.
It is a little bit of a problem there because not all of those are in use all the time, and there's maybe more work than needed for the amount of people who use those effects, but there's always some there. We can start looking at making sure that our third-party developers who write plug-ins for Premiere has the opportunity to write. So we can start looking at making sure that our third-party developers who write plug-ins for Premiere has the opportunity to write.
So we can start looking at making sure that our third-party developers who write plug-ins for Premiere has the opportunity to write. GPU effects, which they can do now, but it's not really optimized to fit within our pipeline. We can look at doing codecs, GPU encoding and decoding of certain codecs, and that would be very interesting in some cases. But again, we want to be careful not to let the CPUs sit there idle because then we're wasting resources on the machine. We can look if there's an opportunity to take care of multiple CPUs at once when those configurations are available.
And we have some scopes that we'd like to -- you know, we can see some nice optimizations there. Also, there's, you know, effects that we -- we can look at doing codecs, and that would be very interesting in some cases. But again, we want to be careful not to let the CPUs sit there idle because then we're wasting resources on the machine.
So, we -- we could probably do an OpenCL that actually would never make sense in software just because they're so slow. Customers only want to wait so long. And OpenCL actually brings some opportunities where we can do some things we've never done before. So that's Adobe Premiere Pro CS6 and how we use OpenCL.
If you have any questions, you can come find us. I'll be hanging out at the session tomorrow morning at the lab. So I'm going to give it back to Ananau, and If you have any questions about what you've seen today or any questions about OpenCL in general, contact Alan Schaffer. He's our graphics and game technologies evangelist. And also, if you're new to OpenCL, here's a link to the past WWDC presentations. And you can also go to the Apple Developer Forums. Thank you very much for coming.