Introducing OpenCL - WWDC 2008

Media • 1:07:22

OpenCL is a groundbreaking new technology that unleashes the blazing power of modern GPUs and multi-core CPUs for use by your application. See how OpenCL gives you full access to the computational capabilities of the Mac platform. Learn the details of the OpenCL architecture and discover its C-like language syntax and powerful built-in functions.

Speakers: Aaftab Munshi, Henry Moreton, Kevin Quennesson

Unlisted on Apple Developer site

Downloads from Apple

SD Video (762.8 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hello, everyone. My name is Aaftaib Munshi, and I'm going to talk about OpenCL and using OpenCL to do data parallel computing on the GPU and CPU.

[Transcript missing]

So the OpenCL specification has been developed in collaboration with AMD, Imagination Technologies, Intel, and NVIDIA. And so this is something that we work very closely with them, with these vendors, to define something that's really robust. Robust in its feature set and robust in its ease of use. It has to be really easy for people to use for this to be successful.

and in addition, Synchronos is an industry-wide consortium which includes members that I had talked about in the previous slide. and other developers have developed standards such as OpenGL and OpenGL ES. Apple has released a specification to Khronos and a new compute working group has been started with the goal of taking this and ratifying it to be an open standard that will run across multiple devices and multiple platforms.

[Transcript missing]

What are the goals or the requirements that OpenCL had to answer or solve? Well, the first thing is, you know, today we use CPUs for doing certain things and we use GPUs for doing certain things. We want to actually leverage all the computational resources in the system. So we want to view CPUs and GPUs as peers. Now I talked about, you know, the 8800GT has 504 gigaflops and the Mac Pro CPU 8 core has 96 gigaflops.

Well, why not take advantage of both of them? You get 600 gigaflops. And there are certain tasks a CPU is really good for and certain tasks a GPU is good for. You may actually want to design the algorithms that take advantage of both of them. And we want to be able to do that. And so OpenCL allows you to do that.

The parallel programming model has to be really easy to use. And I mean, one of the reasons why a GPU is really a good example of a successful parallel computing model is that if you just think of it in terms of 3D graphics, I have a vertex shader, a fragment shader, I really write one shader and then the data gets executed with a bunch of vertices or a bunch of pixels. And I don't really have to worry about how parallelism happens. It just has taken care of for me. And we want to do something very similar, but apply it to functions.

So it's great I have the GPU doing really amazing computational power, but I'm trying to move my algorithm that used to run on the CPU to the GPU. I want to see consistent numerical behavior. And so we need to be able to specify the accuracy of floating-point computations. So when I do an addition, subtract, or multiplication of the GPU, I expect to see the same result as I would on the CPU.

And we actually need to take it a little further. We have a bunch of math functions that are defined in math.h today. We want to be able to support that on any computer device, but also define what is the minimum error bound. Because if I am using the GPU for computations and the CPU for computations, I need a library that clearly defines what the minimum error bound is so I can design my algorithm accordingly. So that's really important. Numerical accuracy is really important. And then we can use OpenCL as the vehicle to drive future hardware requirements for compute as future revisions of OpenCL come into play. So where can I use OpenCL?

Any data parallel algorithm. Okay, next slide, I'll talk about what I mean by data parallel. Any data parallel algorithm, that's a performance critical path. So some of the domains where you would encounter data parallel algorithms would be physics simulation. Let's say you're doing rigid body collision or fluid simulation.

Image processing is a natural, you know, false writing. I don't need to explain that. Signal processing, FFTs, data parallel, video audio encoding, math libraries, you know, BLAS, LAPAC could totally run on the GPU and the CPU. Non-traditional graphics, restoration techniques like ray tracing, you know, financial modeling, medical imaging, the list goes on and on. This is just a small sub-sample of... and many more. So, what is data parallel computing? In general, parallel computing can be classified into two subcategories: task parallelism and data parallelism.

What do I mean by task parallelism? Well, let's say I have a library that operates on numbers and I have an algorithm that produces the sum of the numbers. I have an algorithm that computes the min and max and another algorithm that produces the median. So, these are independent tasks or functions and they can actually operate in parallel. And in the example I just gave, there is no dependencies between them and they can actually be executed in parallel.

But as a programmer, I have to really think about how do I break these things into multiple tasks which can execute in parallel. There may be cases where there are dependencies. For example, let's say I want to generate a histogram and then I want to take the output of the histogram to do some tone mapping.

Well, I cannot execute the tone mapping in parallel to the histogram. There is a dependency that the histogram task has to complete first before I do tone mapping. So I need to basically -- when I write something that's going to be task parallel in nature, I need to carefully think, break up these tasks and provide the -- define the dependencies and then create the parallel model.

That's what I mean by task parallelism. In OpenCL, you can actually do that. And each task can actually be data parallel in nature. So in the example I just gave you about summing the integers. We're going to write a task parallel. We're going to talk about how that would be described as a data parallel example. So they can coexist together.

But what is data parallel computing? Well, if I have a function that is actually operating on multiple data elements and the operation on each of the data elements can be executed in parallel, that's what I mean by data parallel computing. And that's basically been the most successful model of parallel computing. Traditionally most supercomputers use that model. So let me describe an example.

Let's look at the scalar code here. I want to multiply a number in A, an array A, which has n entries, with another number in B, corresponding index in B, and the product I want to store in result. Well, if I were to write scalar code, it would just be a for loop.

If I were to write a data parallel version, well, if you look at the code, each multiplication for any entry in A and B can actually execute in parallel. So if I could create n independent units of execution, where n is the size of the array here, then I can execute all of them in parallel. And so that's what I mean by data parallel computing. Well, let's take another example, a little more complex. Well, let's take the sum of all integers.

So if I wanted to sum all integers, then it's basically for loop, which sums and produces a result. Let's look at how you would describe that as a data parallel. And this is just one example, one way you would do it. There are many ways of doing this.

Well, let's say I have-- so the array of integers, and I have m threads, in this case, eight threads. Oops, there's nothing there. Sorry. In this case, eight threads. Then each thread generates a partial sum. So in this example, thread 0 takes the sum of minus 2 and 1 and produces minus 1. Thread 1 takes the sum of 10 and 5 and produces 15 and so on. So at the end of the first instance of running this code, I have eight partial sums. OK.

Then I can actually loop together and now take four of those eight threads. And produce four partial sums. And then take two of those threads and produce two partial sums. So finally, I'll end up with something that has two partial sums. And then I can use the last thread, the first thread, to generate the final result. So this is one example of how you would do data parallel. So you have the first main loop and then a log n loop.

So what is the data parallel compute model in OpenCL? So define your computation domain, i.e., how many units of execution can operate in parallel on the data that your function is using. We call them threads, but these are basically just elements of execution, the fine-grained in nature. They're not like P-threads. So that computation domain in OpenCL is called global thread. So in the example we talked about where we just multiplied one array with another and the array was n elements in size, I would spawn n threads of computation because they can all be executed in parallel.

If there is no communication between these threads, which was in the case when I multiplied one number with another, I don't care. I shouldn't have to care how the compute device goes and parallelizes that. And OpenCL refers to this as implicit mapping. And so the mapping, you just specify how many independent threads of execution you want and OpenCL will go figure it out how to map that and execute that on the compute device.

In the example where I gave the sum of all integers, I needed to know how many threads were working because they were sharing information as they were generating partial sums. That is an example of what I call explicit mapping. So in there, you really need to know how many threads are working together. And we call that a thread group.

So thread group has a special property that the threads that are in the thread group that are operating on the data can communicate with each other, which they needed to in the case of the sum of all integers. Because you needed to share the partial sums together. So they can synchronize, they can coordinate and communicate with each other. And these multiple thread groups can execute in parallel. And they can execute in parallel on the same compute unit in a compute device or across multiple compute units in a compute device. So this is a high level architecture overview. Well, how do I use OpenCL?

[Transcript missing]

Next thing, memory objects. Well, there are two kinds of memory objects. The first is array. It's basically very similar to something that you do when you do a malloc. You get a pointer. So in an array, you get a pointer. It's a linear collection of elements.

You can reference each element using a pointer. The thing to note is, you know, on a CPU, the reads and writes will be cached, but on a GPU, typically, it won't be cached. And so there are performance implications of that which you need to remember or recognize when you write your compute kernels or functions.

and then 2D images, sorry, images are, there are two types, 2D and 3D. And there are, you can, the difference between an image and an array can be classified as the following. In the image, the data is stored in a non-linear format, so you do not get a pointer and cannot directly access the elements. So there are built-in functions that the OpenCL language will allow you to specify what you want to read, where you want to read and where you want to write.

In addition to that, you also get what we call a sampler, so controls that dictate or determine how you want to read from an image. So given a 2D coordinate for a 2D image in an XY or a 3D coordinate for a 3D image XYZ, you can also specify how to convert that to the actual position in the image using what we call addressing mode, which is very similar to what GL, OpenGL has, and the filter mode if you want to do a bilinear filter or not.

And the last object, well, the function, the data parallel function that we have is called a compute kernel. So here's an example. So ignore some of the tokens like _kernel and _global for now, and I will talk about that. So it just looks like C code. And a program would be analogous to a dynamic library. It has a bunch of functions. Some of them are exported. So the functions that are declared with the kernel qualifier are your exported functions that you can actually call.

So how does one write compute kernels? Well, we want to make sure there's something people are very familiar with. And so the OpenCL C is basically derived from C99, okay? And has additions to it, and some of the additions are vector data types. So you can add -- and vector data types can be two component, four component, eight component, up to 16 components for, you know, integer and float types.

And you can do all your basic operations you would do, like if you say A plus B, it works for a vector type as it would for a scalar type. Then there are image types, so you can actually use image data types and reference them and call the built-in functions to read and write.

There are address and function qualifiers. The function qualifier that only exists today is the kernel qualifier, which is basically something that tells us that it's a compute kernel. And address qualifiers, I will talk about that in a few slides. And there's a rich suite of built-in functions. So all the math functions that you would see today in C99 are supported.

There is functions for doing synchronization that you would need for threads in a thread group. There are atomic functions, there are conversion functions to convert from one data type to another, like into a float and so on, okay? So let's take an example. Remember the sum of all integers? So this was the first loop.

So what would that code look like? So there's the underscore underscore kernel qualifier that says dpsum and there was the input array which is called a and the temp sum is basically the array that will hold the partial sums together. And finally I'll store the result in result.

So ignore the global and local qualifiers for now. So each thread is identified by a unique ID and it has a global ID. So if you had n global threads you had spawned, each gets a unique ID which can be one dimensional, two dimensional, and three dimensional. And that has an advantage and we'll talk about that in the next example I give. And each thread in a thread group which is the explicit working group together that works together has a unique ID as well.

So we generate the partial sums here. The first partial sum. So we'll have eight of them. And like I said in the example I gave then you loop over until you finally get two partial sums and then thread zero computes the final sum. So here is what the code looks like.

So the first loop was getting a partial sum across n threads, then we summed the partial sums, and then we generated the final sum. Let's take another example. Let's say I want to do an image filter where I want to just do a reflection across the Y-axis. And in this case I have an image.

So each element actually in this particular filter, since I'm doing a reflect, can operate independently. And in this, so each element or pixel in the image can actually be an independent unit of execution. So the threads I want to spawn here would be the width times height of the image.

Well, I think it would be better to describe this not as a one-dimensional problem but as a two-dimensional problem. Because I can actually, if I have a two-dimensional identifier, I can actually use that identifier as my coordinate to get into the image. And that's, you know, you can do that. And so you call read image to read. And then, you know, the right reflected to the reflection and then the right.

Well, with image, what happens is, you know, the underlying data format can be different from, but when you read or write, you know, the output of the read is always four floats. And the input to the right is always four floats. And automatic conversion happens. Well, what if I already know what the format is and I don't want to use images?

So let's say it's RGBA 8888, 8 bits per pixel. You know, can I use something else? Yeah, you can. In this case, you can use the Clarison array. and many more. You have many choices available. These are not optimized examples, so don't go and write this and expect full performance.

So let's talk about, remember in some of the slides we were talking about global and underscore local. So what are these things? Well, in OpenCL we have memory address spaces. And this describes a different memory subsystems or hierarchy that may exist in the computer device. On some computer devices they may be collapsed and I'll talk about that.

So first thing is remember each thread is an independent unit of execution and has what is called its own memory, what we call private memory. Then the threads that are in the thread group that want to be able to share data, well they need some buffers to be able to share data with and that's called local memory.

And local memory, so in the case on GPUs, the local memory would be actually a user managed cache where your computer is going to manage it. Whereas on the CPU it's just basically cached memory that maps to your DRAM. And then there's global memory, which is where your input and output data finally get read from or stored.

And there's a property of global memory. Let's take an example. Let's say I want to do a convolution filter. The filter weight can be stored in global memory but it's basically just being read. I'm not modifying it. And it has a unique property that each thread as it's operating, applying the filter, it wants to read the same filter weight.

So wouldn't it be nice if there was a way to be able to do that? was a way to be able to cache that and be able to access that really, really fast. And that's what the constant qualifier is telling, indicating. So that tells the OpenCL compiler that there is a unique property behind this constant. And these memory address spaces can only be specified to pointers. So anything you declare as a pointer, you need to tell us what the memory address space is. OK, so we looked at the high-level architecture overview. We looked at what the objects are.

We looked at how to write computer kernels. So I wrote a computer kernel. I think it's going to perform really well. Well, how do I get the computer device to execute it? What code do I need to write on the host to execute

[Transcript missing]

Now, I want to load my program. Well, so I have my source, I load it, I build the executable, and I then create the kernel. CreateKernel is basically very similar to if you do a load library, then you call get proc address to get your address of the function. That's effectively what it's doing.

Now, before I call my function, so when I make a function call, I call the function, I specify the arguments. I specify the arguments in OpenCL using CL, something called CL set kernel args. So in this case, the input arguments are the three memory objects I created and their sizes are the size of the memory object type. So all memory objects that are created on the host side are opaque handles and they get mapped to the right pointers by the OpenCL framework when actually the function gets executed on the device. So on the host side, you never have to deal with pointers.

[Transcript missing]

All right, great. I executed my kernel. I got data. I want to visualize that data now. So how do I do that? I'm going to use OpenGL. Well, so how does OpenCL work with OpenGL? Well, we want to make sure that any GL objects, such as a texture or a buffer object, vertex or pixel, can actually be used as a CL array or a text image, which you can.

Well, not only that, we want to make sure that they both point to the same bits. We don't want to be doing any copying, so sharing has to be really, really efficient. So OpenCL can generate some data which can actually be directly used as the input to GL without requiring any copy, and that's what it's implemented. So how do you share these resources?

Well, the first thing you need to make sure that the GL context is created using a CGL share group. You want to make sure that the compute device group that you are creating with CL points to that same thing, which is basically saying I'm going to point to the same devices.

And when I allocate my arrays or images, I use this flag called memalloc_reference, which basically says that I'm not going to allocate any data or memory behind this, but I'm going to use whatever memory has been allocated by GL. And then sharing is very easy. I attach, so I can attach to a GL image or a GL buffer object, and then use that as an argument in my CL set kernel args. And I can execute multiple kernels. When I'm done with it, I detach. And that's basically telling GL that, okay, CL is done with it. Now you can use it. it.

So in summary, we have a new compute framework that works across GPUs and CPUs. It uses something we're familiar with extensions. We define numerical position requirements so that there's a consistent minimum error bound that's guaranteed across compute devices. And this will be the open standard for compute on desktop platforms. Well, let me introduce Henry Moreton from NVIDIA. He's a distinguished engineer in the architecture group at NVIDIA. And he's going to talk about GPUs and OpenCL and how they're such a close fit. Thank you.

Good morning. So, I'm Henry Moreton, obviously. And I have a few things I'd like to talk with you about. First of all, I would like to give you some idea why you should be interested in OpenCL. And then we'll drop into a kind of a brief discussion of CPUs versus GPUs and some of their sort of respective characteristics. And then I'll spend the majority of my time talking about the GPU as an OpenCL processor. The image at the bottom of the screen is kind of a schematic view of a series of processing clusters that are kind of typical in the GPUs that we build today.

So, you know, why is this interesting? You know, even at the outset, I mean, this is the genesis of it. This is the beginning of OpenCL as a computing platform. But it happens that OpenCL is supported on all shipping NVIDIA GPUs. Okay, that means that any GPU that ships today is capable of supporting OpenCL.

Okay, so that's nice. But in fact, this has been the case for a while now. There are 70 million OpenCL capable GPUs already out there in the marketplace in people's hands. And that number is going up at about 2 million a week. So there's a non-trivial, possibly a non-trivial user base.

Now as far as the Mac platform or the Apple platform, these are the current products that have OpenCL capable GPUs in them. from relatively high-end, very high-end GPUs down to kind of entry-level integrated systems. Okay, so what about GPUs and CPUs? The 8800GT I think is actually what's shipping in the Mac Pro. It has 112 cores and supports about half a gigaflop and has some pretty significant memory bandwidth.

If you compare that with kind of a contemporary desktop CPU, it has two cores, it has a little bit less than a tenth the floating point horsepower, and rounding it has about a tenth of the memory bandwidth. So for very data parallel, data intensive, bandwidth intensive, floating point intensive workloads, it's not as capable a platform.

Both of these support IEEE floating point and I'll stress that, and Afi actually was fairly clear about this, when I say IEEE floating point, I'm referring to rounding modes. The GPUs don't yet support floating point exceptions, those sorts of things, so it's not the full IEEE standard. It's important to understand that they're both programmable using OpenCL, which is actually great, because one of the things that the CPU enables is the use of OpenCL across all Macs that are shipping today.

Okay, so what about the GPU as an OpenCL processor?

[Transcript missing]

and I are going to talk about the TPCs. The original name stems from its graphics heritage. Each TPC has a series of texture filtering blocks and a texture L1 cache. You can see it on the diagram. And then they're further broken down into multiple I called them thread group processors internally.

In video we call them SMs which stands for streaming multi-processor. And a thread processing cluster can have varying numbers of these thread group processors. On the right you see a TPC composed of two of them. On the left in sort of smaller scale there are three. And depending on the market segment that you're going after, one will be more appropriate than the other.

The, now each of these SMs or thread group processors is made up of a collection of scalar processors and there are eight per thread processing group. and David Let's see, what else do we have here? You can see the local memory that Afi was talking about. It's explicitly embodied in the architecture. It's a program managed cache. We also, I think and others. I called them address qualifiers here.

Basically the hardware implements both the private address space, the local and global as well as optimizing constant accesses. Now, when I say it implements the private address space, there are instructions in the instruction set that actually reference memory, and then the hardware takes care of mapping that to memory in an efficient fashion in terms of interleaving the memory references from all of the threads that are executing in parallel so that the memory accesses are efficient. The same is true of local and global memory, and I'll talk a little bit more about that in a moment.

So if you take and put together a whole group of TPCs, you can build up a chip. And what we see here at the higher level is a set of four TPCs. which are then set on top of a memory subsystem that's quite wide, which is how we achieve the remarkable memory bandwidth the system supports.

And you can also see on the left side in gray this little block that says atomic. So the language supports atomic operations on memory. And we implement those in what we call fixed function hardware, which sits very, very close to the memory. So that the read-modify-write of the atomic operation can be extremely efficient. It will also note that all of the bulk of the, let's see, how do I describe this? The host CPU is connected to the GPU via PCIe, but everything else in this diagram is on die, with the exception of perhaps multiple gigabytes of memory.

So given that, we can build a variety of different platforms by incorporating different numbers of TPCs and different numbers of memory partitions. So you can scale the bandwidth of a product, you can scale the compute capabilities of the product depending on your point in the marketplace. So going back to the Mac configurations that we were talking about earlier, We have ranging from 130 gigaflops up to 504. And you can see the varying numbers of SPs in these different products. These are all fairly powerful machines, but there's still a wide range of performance from the bottom to the top.

Okay, so what's, you know, I talked a little bit about the kinds of things that we've implemented in the GPU in terms of providing direct support for OpenCL. What other sort of attributes are there? What's interesting? Well, one of the really critical and extremely valuable attributes of OpenCL and our architecture, frankly, is that there's a transparent scaling in terms of performance across the product lines.

You can take the same piece of code and run it efficiently on a single TPC low-end system. That's what this 8SPM GPU, we call the motherboard GPUs, a relatively low-end machine. You can take that same piece of code. And run it on even conceivably a dual GPU system that has 256 processors and supports about a teraflop of performance.

And the code will just run faster. I mean, it isn't, you know, perfectly magical. I mean, there are some limitations to scaling. But one of the wonderful things about this is when we ship a new product that has higher performance, has more cores, your application, assuming that at least some of your application developers, just runs faster without any effort on your part.

Another attribute, actually flipping back to this for just a second to clarify or kind of illustrate why this comes about is, Afi had mentioned the thread group. So the thread group is the unit of execution that's guaranteed to execute at any given instant. And those thread groups run on these thread group processors.

And you might have a machine with a single thread group processor or you might have a machine with, say, 32 or 64 thread group processors. Because there's no dependency among thread groups, we can run them serially or we can run them all in parallel and that's where this scaling comes from. Okay. Another attribute on the GPU side is that we support thousands of threads. And I mean it very literally, thousands of threads.

Now, what is that, you know, what is that bias? Well, you can use a very, very sort of deep workload to hide the latency of compute operations. For example, if you perform some costly, say you perform a sign function or a multiply add that has maybe 10 or 20 clocks of latency.

and others. While that is grinding through the machine, I can run ten other threads to hide the latency of that operation so that we never have trouble with dependent operations on the GPU or rarely do. The same holds for memory accesses. As Afi mentioned, in current shipping GPUs, there are few memory caches. There is a memory cache that sits behind the constant accesses and I'll talk a little bit more about that in a moment.

But accesses to global memory go all the way out to the DRAMs. And it is by having hundreds and literally thousands of threads that we can hide the latency of memory access. So, okay, thousands of threads, that's great. One of the things that we've done in the design of our processors is to make sure that the threads are extremely lightweight.

You can imagine that if you have a thousand threads or tens of thousands of threads, the threads have to be very lightweight. The threads are all managed in hardware. They're created, scheduled, and even the synchronization operations among threads are all managed in hardware. So there's no visible overhead to the application for these very large numbers of threads.

Okay. Now the hardware also, you know, in terms of the way we design hardware, we look at the kind of typical operations that are being performed by applications and optimize the hardware to make those operations as fast as possible. So one of the things that you often see in a program, a kernel, as Afi was describing, is that All of the threads that are executing along are often, in fact, the vast majority of the time, they're at the same program counter. They're executing the same instruction. So we optimize that.

If a If a wavefront of SPs all reads from the same memory location, all reads the same instruction, it does it in a single clock. If, on the other hand, some of the threads within that wavefront take the if path and some of the threads take the else path, it takes two clocks. Or it takes two instruction memory fetches.

So we make the common case go extremely fast and we make the other cases go as fast as possible. We're sort of providing for the right trade-off in terms of the instruction cache design. Because ports into an instruction cache are expensive and the caches tend to grow roughly linearly with the number of ports.

We also optimize the accesses to the local memory as well as global memory in the sense that Typically, threads will access a series of consecutive locations in memory. So what the hardware does is it coalesces all of those requests into sort of one cache line fetch. And if, because of your algorithm, it has slightly less structure, if they access multiple cache lines, we take exactly as many clocks as there are cache lines referenced. And the hardware takes care of managing all of that. So there's no need for the programmer to deal with creating masks to perform a vector load but mask off the elements that they don't want to actually be loaded.

So none of that overhead in terms of lane management falls to the programmer. So not only do you not have to program it, but it's also dramatically more efficient. Now, as far as the constant accesses. Afi had mentioned that it's very beneficial to identify regions of memory that are read-only and also regions of memory that are dominated by a large number of threads accessing the same memory location.

It happens that in graphics hardware, the heritage of the constant cache comes from the transform and lighting engines of like eight years ago because in those engines, there was this chunk of memory which held matrices for transforming vertices and every thread that was executing was going to access the same coefficient every clock.

It happens that that kind of characteristic memory access pattern shows up in memory image processing operations but in actually a very, very wide variety of compute loads. So what we do is optimize for that. However, if there's some renegade thread within a wave front, a group of threads that are executing all along at the same time, if it goes off and reads from some other location, the hardware manages that gracefully. It simply takes another clock, typically, to read this other memory location. But again, it's all managed in hardware on behalf of the application, so there's no additional compiler or programming necessary to get the programming running. to get the performance there.

And then finally, obviously we have huge amounts of bandwidth. And we can stream data from the DRAM onto the chip into local memories, into registers, and operate on it at a sustained rate such that you can keep all of the compute units fully occupied during the execution of a kernel.

[Transcript missing]

You should understand that our GPUs in OpenCL were designed together. I've spent a great deal of time at Apple, literally a couple of hours a week. Fortunately, they're not too far from campus. Our campus, that is. I think it should be apparent, if you haven't figured this out already, that the GPUs are a tremendous platform for running OpenCL. They're extremely efficient. If the workload is appropriate, you get great performance.

and you get, in terms of absolute performance, as you move up the product line, you get sort of rewarded for moving up the product line with even better performance. Also, in terms of efficiency, and this is, depending on what environments you're working in, a really critical issue. They have, they're extremely efficient from the perspective of what we call perf per watt.

One of our design criteria when building GPUs is to optimize the performance delivered against the power consumed because, you know, if you, If you have a laptop, you don't want it to get too hot. You need to be able to cool the devices that are inside it. And it turns out that power consumption, battery life, thermal dissipation, all of these things kind of determine what you can get away with in a given platform. And then finally, to revisit the open standard aspect of OpenCL, it will be supportable across all of our products.

So the pictures across the bottom of the slide kind of are exemplary of the various product lines from NVIDIA. And we plan to continue to support, to continue to ship OpenCL capable machines. And I, that's it. Our, that's all I had today. Our next speaker is Kevin. Kevin Quehnesson: Thank you.

Thanks. So hi and thanks. So you learn about OpenCL, it's a very cool and very exciting technology. So you might want to do your OpenCL development in your usual development environment such as Xcode or whatever you would prefer. But there's this great tool on the U.S. called Quartz Composer that can help you.

So Quartz Composer basically can be seen as a visual programming environment that -- so it's pretty much a graph that organizes a processing flow where each of the nodes of this graph is a patch that takes some inputs, do some processing and returns some outputs. and so Quartz Composer is used for doing graphics animation or image processing or compositing pipelines and also data visualization.

But the great thing about Quark's Composer is that most of the time these things are done with no code at all. and so for those of you OpenCL developer who wants to know, learn about OpenCL, learn about a language and discover that or go further or explore what OpenCL can do for you, CrossComposer is a great place to look at because it will handle all what you don't want to handle.

For instance, OpenGL attachments for drawing or vertex buffer object texture, texture caching for OpenGL context and also the OpenCL program setups and device setups and so on. So you just go inside CrossComposer and you have an application where you can directly, simply go and type your kernel code and start playing and interfacing with other patches.

So another great thing about Cross Composer is that it integrates all the, most of the OS technologies. So for instance, QuickTime, CoreImage, CoreMedia, CoreAudio and so on. So you can use and leverage these OS technologies in your OpenCL development, for instance, as inputs to your kernels or as environments to visualize or to better explore and present the result of a simulation.

So Quast Composer adds OpenCL support. And so what it means is we have an OpenCL kernel patch that's simple as just one patch, so one of these processing nodes that's... And in this patch you can write OpenCL kernel and Quast Composer will automatically parse this kernel, find its argument and understand what are the input ports, the output ports, and automatically populates the patch so that you can have some abstract processing units that you can insert in some processing flows and some Quast Composer composition.

and so that's very easy. You open Cross Composer, you create an OpenCL kernel patch and right there, without having to set up anything else, you can start doing some OpenCL and start leveraging Cross Composer to visualize the results to eventually go pretty far. So we'll show that in a demo in a second.

So one other great advantage of Cross Composer is that everything is at runtime. So you don't need to build and run whenever you make a change in the kernel or change somewhere. You can simply go in the setting of the badge and inspect the kernel code and as you change, as you type in the kernel, Cross Composer will automatically call CL to recompile the program in the background so that your simulation or whatever processing in OpenCL you're doing will pick up the change at the simulation or where the processing goes. So I'm going to demonstrate that right now on the demo machine.

So here I'm going to open in-course composer here. So this N-Body demo that you've been seeing previously in the show. So you have 16,000 particles, 16,384. And here are the composition look. So you have the antibody kernel which is here and simply the result which is the position, the output of that kernel is passed into a renderer that will simply render that thing.

Render the result of the OpenCL simulation. And that's as simple as it can be, as simple as it is. And Quartz Composer also provides you with patch to interact with the visualization so you can very easily do some rotation like that. And so let me show you now the settings of that OpenCL kernel.

So here I have all the code of the kernel that is a copy and paste of the OpenCL kernel from all the other demos. So there's no translation to do. Just paste that stuff in and it works. And here I have a line that I commented out that simply takes the time step and multiplies it by a constant, which is five. So I can just uncomment that line and immediately the simulation at the current time picks that up and goes five times faster.

So if I can comment that line up and so we're back to the previous time. So it's very easy as you go to change the kernel as it goes and to see the results. It's a very nice way without having to build and run to iterate really fast and to go to where you want to go and probably further much faster.

So let me start the simulation again. And show you other power of CrossComposer is also that you have access to all the CrossComposer pipeline and other possibilities. For instance, you can create very easily this little UI that allows you to change some parameters, but you can also leverage the CrossComposer pipeline to create some interesting effects. And so for instance here, I'm going to create a trail of all this.

of all these particles over time. So here you can see in three dimension over 200, so, n-body simulations stacked over time and composited together. So, Quast Composer is going to handle everything, the vertex buffer object caching and the display, so you don't have to worry about anything. So, you know, it's -- so let me put that maybe later so you can -- oh.

So it's a very nice way to leverage Quartz Composer to go in some other direction. So how it works is to do that motion blur, it's a simple, here I go, and it's a simple queue patch that will stack up 200 vertex buffer objects, so 200 results of that simulation, and then an iterator, which is a patch that will render what's contained within that queue for all its elements.

So all the elements contained in that queue, so 200 times. So here you have, so, or 90 times. So here, for instance, you have 200 vertex buffer objects of 16,000 particles drawn in front of you by Quartz Composer in real time. So let me go back to the slide, please.

Can I go back to slides, please? Thanks. So more details on the OpenCL kernel patch. So by default, the OpenCL kernel patch is going to handle everything automatically. So it's going to find out the outputs and input ports from the kernel prototype. It's going to figure out the thread dimensions and the output dimensions, so the dimensions of the output arguments. But you have the option to override everything. So you have the option in the settings to override the local thread dimension, all those things. So by default, it's very simple, but you can give all the customization that you want.

So yet very powerful. So we really want Chorus Composer to be a very, very good OpenCL development tool. And so we added to the Chorus Composer editor itself new functionalities to make that support really optimal. So in particular, we added this ability to define a global OpenCL kernel-- OpenCL kernel.

OpenCL program, sorry. So what it means is that--so if you have a kernel patch and you write an OpenCL kernel within that kernel patch, this code and this kernel will be accessed by the patch itself. But using this global code view, you can write an OpenCL program, so maybe multiple kernels, and these kernels will be accessible by every patch in the composition, so will be shared.

So if you change the kernel in the program, it will be reflected automatically to all the patches. So that's a very nice way to have a centralized-- a centralized location for OpenCL development. And so it allows you also to bring in any sort of OpenCL code within Chorus Composer and start using it and leveraging in the composition.

So another thing we added to Cross Composer is a new patch specifically targeted at OpenCL simulations. So a simulation pretty much means that I have a kernel that iterates over time over the same objects. For instance, position and velocity. So in the NBody case, that's what we do. We have position and velocity. Each time we use the NBody kernels to update that position and velocity. So each time step.

But CrossComposer only have access to a slice of that time step, to a patch at a given time. And so what we need is that so that this patch can have at the next time step in its inputs the result of this same patch at the previous time step. So at the previous execution of the composition.

So we added a very powerful feedback patch that allows you to do that sort of feedback. And so the feedback patch at t=0 is simply a pass-through. So we'll pass the arguments coming on the left, so on the input ports to the output ports and pass it to the current output port. So we have a kernel. But at t different from 0, so at t bigger than 0, it will feedback the previous outputs on the inputs of the patch.

So that allows you to do this sort of evolution over time. And it's very powerful. It can also be used in iterators if you want to iterate the same simulation or the same processing at a given time step. And so that allows you to do a lot of things. And so I'm going to show you an example of the sort of thing that we can do with that. feedback badge.

So here I have a cloth. So that's as simple as it is. We use the feedback patch to do some cloth simulation using leveraging OpenCL in Quartz Composer. And so you can play with the cloth around, you can move it. It's kind of fun, it's kind of elastic, but it's really nice. So you can do all that stuff. So the feedback is there used to propagate the result over time, but also at a given time step to iterate the simulation over all the springs. So here you have a set of points, so there are springs.

And so to enforce these springs, we use the feedback patch to propagate over all the cloth. So you can have fun. So to import that image, it's as simple as an image source patch that will bring in the image. And so you can simply put that, so here's the composition, as an image port on the mesh and it's automatically textured by Quartz Composer.

So in that composition, just like in the N-body, there's no code beyond the OpenCL. the OpenCL code. So here I can play because Quartz Composer is integrated within the OS. I can, for instance, apply a core image effect on that texture and it's composited and everything very easily and you see the alpha blending on the side.

Okay, and so to sum up, so Quartz Composer is great to help you visualize the result of an OpenCL, OpenCL simulation, OpenCL kernel, anything. It's great to help you interact with that visualization. And as I showed you in the composition, it's also great to understand the different parts, the different processing parts of what's going on.

So, for instance, you have the optimization step, you have the flow, you have the--here the position are updated and so on. So you can understand what comes next because of that graphical representation. So the things are organized. It's also great to help you communicate your processing to other people. You can also write notes in the graph that allows you to document what's going on. So Quartz Composer is great to leverage OS technology with no code.

So, as I demonstrate to you. And so Quartz Composer integrates technologies, but it's also integrated in the OS. It's integrated in Cocoa using QC view. So you can take that composition, put it in Xcode and have an application out of it with, again, with no code. It's also integrated within Core Animation using QC composition layer. And it can also be used as a pure renderer in your application. So without even any OpenGL, pure OpenCL processing pipeline. So you can use it in applications to take some data and to return some data, leveraging OpenCL while using Quartz Composer to create this graphical workflow.

And so the message is really that Cross Composer can be this bridge that takes OpenCL and brings it to new limits and also that integrates it in new areas, for instance, applications or for instance some very new possibilities there. So the Cross Composer is in the C so you can try it. You can try the NBody example.

And we are having a session this afternoon that will show you some very nice things of the new things we added in Cross Composer for the first time in particular interactivity. So in a very short amount of time we're going to create some interactive OpenCL for the browser.

And I'm going to show you an example of what the kind of synergy is between the two. So this is a little bit of a demo of what the cross-composer can bring while bringing OpenCL into other area in particular with this ability to integrate by going to a last demo to the demo machine.

So here I simply took this demo we're going to bring this afternoon and added the closed simulation to simulate... to simulate some paper or some paper claw thing. So we just took this co-simulation, wrapped it as a patch, and then we don't have to worry about what that patch contains, and put that within that simulation. And here's an example of what it can do.

So you can now, you know, give some, you know, so it's a little elastic, so it's not real paper, but, you know, you can have some fun, and so you can bring these things here, and you know, take the other people, and they grow automatically. So all these things are physically simulated using OpenCL. So it takes OpenCL, which is not only for scientists, it's for everybody.

It's for everybody who wants to bring new possibilities to their applications, to whatever they want to, they're working on, speed in sorting, speed in visualization, or new possibilities in, you know, in things that you thought were fixed. So for instance, an image don't have to be on a quad. It doesn't have to be on a quad. It can be on this, on this claw thing. And for instance, so, and here, so you can simulate some real page curls. So here we have some.

You know, a page curl going on. So you don't have to fake the page curl to try to find the function. You can just take that image and say, "Okay, let's do a page curl to reveal what's beneath." Well, that's all the things. And then, oh, she's gonna, and if I release, the key is gonna drop down.

So, and, and, so to do that page curl, literally what we just do, we just take the two corners and put them up, and that's all what you need. And so that's also what bring, what the simulation brings to you, the fact that you don't have to worry about the functions and all the things, because you have the simulation, because now it's abstracted. Now, with that patch, you can leverage it to do lots of, lots of very powerful things, but very easily, because you have all the things which is, which is hidden. So, and, and that's it. So can I, can we go back to slides, please?

So that was this little extra picture. So we're having a session this afternoon. Please come if you want to learn more about the technologies and about the new features in Quark's Composer to do these sort of things. And now I'm going to bring back Jeff on stage for Q&A.