Configure player

Close

WWDC Index does not host video files

If you have access to video files, you can configure a URL pattern to be used in a video player.

URL pattern

preview

Use any of these variables in your URL pattern, the pattern is stored in your browsers' local storage.

$id
ID of session: wwdc2008-724
$eventId
ID of event: wwdc2008
$eventContentId
ID of session without event part: 724
$eventShortId
Shortened ID of event: wwdc08
$year
Year of session: 2008
$extension
Extension of original filename: m4v
$filenameAlmostEvery
Filename from "(Almost) Every..." gist: [2008] [Session 724] Introducing...

WWDC08 • Session 724

Introducing OpenCL

Media • 1:07:22

OpenCL is a groundbreaking new technology that unleashes the blazing power of modern GPUs and multi-core CPUs for use by your application. See how OpenCL gives you full access to the computational capabilities of the Mac platform. Learn the details of the OpenCL architecture and discover its C-like language syntax and powerful built-in functions.

Speakers: Aaftab Munshi, Henry Moreton, Kevin Quennesson

Unlisted on Apple Developer site

Downloads from Apple

SD Video (762.8 MB)

Transcript

This transcript was generated using Whisper, it may have transcription errors.

Hello, everyone. My name is EFI Munshi, and I'm going to talk about OpenCL and using OpenCL to do data parallel computing on the GPU and CPU. So OpenCL is a new framework that lets applications use the GPU as a general purpose computational processor. And not only that, typically today you do certain tasks on the CPU and you use the GPU for doing 3D graphics, rasterization or image processing operations. With OpenCL you can actually use the GPU not just to do general purpose programs, but you can actually combine the GPU and the CPU and take advantage of both of them. So in this session, I'm going to talk about going to high-level architectural overview of what OpenCL is, dig a little bit deeper into how to use OpenCL and how does one write in a parallel computer programs with OpenCL. And in the advanced OpenCL session, we really go into the nuts and bolts of how to use OpenCL, how you would optimize OpenCL parallel programs for the GPU and the CPU. So OpenCL actually implements both a data parallel and a task parallel computing model. And so just bear with me for a few more slides, and I will actually define what I mean by data parallel and task parallel. But in this topic -- in this session, we're going to focus primarily on the data parallel computing model.

So the OpenCL specification has been developed in collaboration with AMD, Imagination Technologies, Intel, and Nvidia. And so this is something that we work very closely with them, with these vendors, to define something that's really robust, robust in its feature set and robust in its ease of use. It has to be really easy for people to use for this to be successful.

And in addition, Synchronos is an industry-wide consortium which includes members that I had talked about in the previous slide. sorry, and that develops standards such as OpenGL and OpenGL ES. And Apple has released a specification to Kronos, and a new compute working group has been started with the goal of taking this and ratifying it to be an open standard that will run across multiple devices and multiple platforms.

So why OpenCL? Well, there is a shift in how computer architecture is being designed. If you look at the CPU side, the model of basically increasing the megahertz of a single-core CPU and using that to gain any performance automatically for applications no longer holds true for a wide variety of reasons, a major one being power. So there's a shift towards multiple cores in your system. And that means applications really need to be able to take advantage of that to gain performance. And in addition to that, we have the GPU. It has evolved from being a fixed function graphics pipeline to being a programmable graphics pipeline which does vertex shaders and fragment shaders. It truly now is a general purpose computational coprocessor. So you can use the GPU to do a lot of things, besides just 3D graphics. Not only that, it is really powerful in its computational horsepower. For example, if you look at the NVIDIA GeForce 8800GT, its peak gigaflops is 504. Compare that to, for example, on a Mac Pro with an eight-core CPU running at three gigahertz. We're talking 96 gigaflops. So there's a appreciable ratio of performance that you can actually gain if you could use the GPU also. However, writing parallel programs is a lot harder. I mean, if you really want to take advantage of both the CPU and the GPU, then the model that you would use to write parallel programs for the CPU looks completely different from how you would use the GPU. So we need to have something that abstract, provide a framework that hides that and let applications be able to use these resources easily. So one more question-- well, why don't we just use a graphics API? Well, graphics API is centered towards doing graphics-based operations. When I want to actually move my data pal computer algorithm, I really want to think in terms of I got a bunch of functions that I want to execute, I want to supply data and generate data which can get reused. So it's a different model. It's more like how you would write code on the CPU today.

In addition to that, there are features that are available in today's devices such as the GPUs, which are targeted specifically for compute. And I'll talk a little bit about that. And graphics APIs don't necessarily expose those. And we need to be able to take advantage of that to really get performance. So we need something that's much more different and provides a model that application developers are already familiar with today, which is writing functions. So... What are the goals or the requirements that OpenCL had to answer or solve? Well, the first thing is, you know, today we use CPUs for doing certain things, and we use GPUs for doing certain things. We want to actually leverage all the computational resources in the system, so we want to view CPUs and GPUs as peers.

Now, I talked about, you know, the 8800GT has 504 gigaflops, and the MacPro CPU 8 core has 96 gigaflops. Well, why not take advantage of both of them? You get 600 gigaflops. and there are certain tasks a CPU is really good for and there are certain tasks a GPU is good for, you may actually want to design your algorithms that take advantage of both of them. And we want to be able to do that. And so OpenCL allows you to do that.

The parallel programming model has to be really easy to use. And one of the reasons why a GPU is really a good example of a successful parallel computing model is that if you just think of it in terms of 3D graphics, I have a vertex shader, a fragment shader. I really write one shader, and then the data gets executed with a bunch of vertices or a bunch of pixels. And I don't really have to worry about how parallelism happens. It just has taken care of for me. And we want to do something that is similar, but apply it to functions.

So it's great I have the GPU doing really amazing computational power. But I am trying to move my algorithm that used to run the CPU to the GPU. I want to see consistent numerical behavior. And so we need to be able to specify the accuracy of floating-point computations. So when I do an addition, subtract, and multiplication of the GPU, I expect to see the same result as I would on the CPU. And we actually need to take it a little further.

We have a bunch of math functions that are defined in math.h today. We want to be able to support that on any computer device, but also define what is the minimum error bound. Because if I am using the GPU for computations and the CPU for computations, I need a library that clearly defines what the minimum error bound is so I can design my algorithm accordingly. So that's really important. Numerical accuracy is really important. then we can use OpenCL as the vehicle to drive future hardware requirements for compute as future revisions of OpenCL come into play. So where can I use OpenCL?

Any data parallel algorithm. Okay, next slide, I'll talk about what I mean by data parallel. Any data parallel algorithm, that's a performance-critical path. So some of the domains where you would encounter data parallel algorithms would be physics simulation. Let's say you're doing rigid-body collision or fluid simulation. Image processing is a natural, you know, false writing. I don't need to explain that. Signal processing, FFTs, data parallel, video-audio encoding, math libraries, you know, BLAS, LAPACK could totally run on the GPU and the CPU. non-traditional graphics, restoration techniques like ray tracing, financial modeling, medical imaging. The list goes on and on. This is just a small sub-sample of-- where we can use OpenCL. All right, so what is data parallel computing?

Well, in general, parallel computing can be classified into two subcategories, task parallelism and data parallelism. So what do I mean by task parallelism? Well, let's say I have a library that operates on numbers. And I have an algorithm that produces the sum of the numbers. I have an algorithm that computes the min and max. And another algorithm produces the median.

Well, these are independent tasks or functions, and they can actually operate in parallel. And in the example I just gave, there is no dependencies between them, and they can actually be executed in parallel. But as a programmer, I have to really think about how do I break these things into multiple tasks which can execute in parallel. There may be cases where there are dependencies.

For example, let's say I want to generate a histogram, and then I want to take the output of the histogram to do some tone mapping. Well, I cannot execute the tone mapping in parallel to the histogram. There's a dependency that the histogram task has to complete first before I do tone mapping. So I need to basically, when I write something that's going to be task parallel in nature, I need to carefully think, break up these tasks, and define the dependencies, and then create the parallel model. That's what I mean by task parallelism.

In OpenCL, you can actually do that. And each task can actually be data parallel in nature. So in the example I just gave you about summing the integers, we're going to talk about how that would be described as a data parallel example. So they can coexist together. But what is data parallel computing? Well, if I have a function that is actually operating on multiple data elements, and the operation on each of those data elements can be executed in parallel, that's what I mean by data parallel computing. And that's basically been the most successful model of parallel computing. Traditionally, most supercomputers use that model. So let me describe an example.

Let's look at the scalar code here. I want to multiply a number in A, in array A, which has n entries, with another number in B, corresponding index in B, and the product I want to store in result. Well, if I were to write scalar code, it would just be a for loop. If I were to write a data parallel version, well, if you look at the code, each multiplication for any entry in A and B can actually execute in parallel. So if I could create n independent units of execution, where n is the size of the array here, then I can execute all of them in parallel. And so that's what I mean by data parallel computing. Well, let's take another example, a little more complex. Well, let's take the sum of all integers.

So if I wanted to sum all integers, then it's basically for loop, which sums and produces a result. Let's look at how you would describe that as a data parallel. And this is just one example, one way you would do it. There are many ways of doing this. Well, let's say I have-- so the array of integers, and I have m threads, in this case, eight threads. Oops, there's nothing there. Sorry. in this case, eight threads, then each thread generates a partial sum. So in this example, thread zero takes the sum of minus two and one and produces minus one. Thread one takes the sum of 10 and five and produces 15 and so on.

So at the end of the first instance of running this code, I have eight partial sums. Then I can actually loop together and now take four of those eight threads and produce four partial sums. And then take two of those threads and produce two partial sums. So finally, I'll end up with something that has two partial sums, and then I can use the last thread, the first thread, to generate the final result. So this is one example of how you would do data parallel. So you have the first main loop and then a log n loop.

So what is the data parallel compute model in OpenCL? So define your computation domain, i.e. how many units of execution can operate in parallel on the data that your function is using. We call them threads, but these are basically just elements of execution, the fine-grained in nature. They're not like pthreads. So that computation domain in OpenCL is called global thread. So in the example we talked about where we just multiplied one array with another, and the array was n elements in size, I would spawn n threads of computation because they can all be executed in parallel. Now-- If there is no communication between these threads, which was in the case when I multiplied one number with another, I don't care. I shouldn't have to care how the compute device goes and parallelizes that. And OpenCL refers to this as implicit mapping. And so the mapping, you just specify how many independent threads of execution you want, and OpenCL will go figure it out, how to map that and execute that on the compute device. In the example where I gave the sum of all integers, I needed to know how many threads were working, because they were sharing information as they were generating partial sums. That is an example of what I call explicit mapping. So in there, you really need to know how many threads are working together. And we call that a thread group. So thread group has a special property that the threads that are in the thread group that are operating on the data can communicate with each other, which they needed to in the case of the sum of all integers, because you needed to share the partial sums together. So they can synchronize, they can coordinate, and communicate with each other. And these multiple thread groups can execute in parallel. And they can execute in parallel on the same compute unit in a compute device or across multiple compute units in a compute device. So this is a high-level architecture overview. Well, how do I use OpenCL?

So the first thing, what are the objects OpenCL deals with? And they're really categorized into three things-- the compute device, which is going to execute your function; the memory-- that's the data that you're going to use or produce; and your function, which we call as executable objects. So let's go a little bit deeper into that. So a compute device is basically anything that can execute data parallel programs. And a compute device may actually have multiple units of execution that can execute multiple programs. So here is one example of a compute device. Here's another example of a compute device. So that's great.

Typically, I would use the GPU to do something. I would create the CPU to do something. But what if I have multiple GPUs, or if I want to use a GPU and a CPU? I don't want to have to go create a special connection with the GPU and a special connection with the CPU. I just want to view them as one compute device.

So in OpenCL, we call that a device group. It's basically a logical device. So you can do that. And guess what? Then all the resources and your functions get created for that device. So you just do that once, and then you can go execute the functions wherever you want.

Next thing, memory objects. Well, there are two kinds of memory objects. The first is array. It's basically very similar to something that you do when you do a malloc. You get a pointer. So in array, you get a pointer. It's a linear collection of elements. You can reference each element using a pointer.

The thing to note is, you know, on a CPU, the reads and writes will be cached. But on a GPU, typically, it won't be cached. And so there are performance implications of that, which you need to remember or recognize when you write your compute kernels or functions. programs. And then 2D images, sorry, images are, there are two types, 2D and 3D. And the difference between an image and an array can be classified as the following.

In the image, the data is stored in a nonlinear format. So you do not get a pointer and cannot directly access the elements. So there are built-in functions that the OpenCL language will allow you to specify what do you want to read, where do you want to read, and where do you want to write. In addition to that, you also get what we call a sampler, so controls that dictate or determine how you want to read from an image. So given a 2D coordinate for a 2D image in an XY or a 3D coordinate for a 3D image XYZ, you can also specify how to convert that to the actual position in the image using what we call addressing mode, which is very similar to what GL--OpenGL has, and the filter mode if you want to do a bilinear filter or not.

And the last object, well, the function, the data parallel function that we have is called a compute kernel. So here's an example. So ignore some of the tokens like underscore, underscore kernel, underscore, underscore global for now, and I will talk about that. So it just looks like C code. And a program would be analogous to a dynamic library. It has a bunch of functions. Some of them are exported. So the functions that are declared with the kernel qualifier are your exported functions that you can actually call.

So how does one write compute kernels? Well, we wanted to make sure there's something people are very familiar with. And so the OpenCLC is basically derived from C99 and has additions to it. And some of the additions are vector data types. And vector data types can be two component, four component, eight component, up to 16 components for integer and float types. And you can do all your basic operations you would do. Like if you say A plus B, it works for a vector type as it would for a scalar type. Then there are image types. So you can actually use image data types and reference them and call the built-in functions to read and write. There are address and function qualifiers. The function qualifier that only exists today is the kernel qualifier, which is basically something that tells us that it's a compute kernel.

And address qualifiers, I will talk about that in a few slides. And there's a rich suite of built-in functions. So all the math functions that you would see today in C99 are supported. There is functions for doing synchronization that you would need for threads in a thread group. There are atomic functions. There are conversion functions to convert from one data type to another, like into a float and so on. So let's take an example. Remember the sum of all integers? So this was the first loop.

So what would that code look like? So there's the underscore underscore kernel qualifier that says dpsum, and there was the input array, which is called a, and the tempSum is basically the array that will hold the partial sums together, and finally I'll store the result in result. So ignore the global and local qualifiers for now. So each thread is identified by a unique ID, and it has a global ID, So if you had n global threads you had spawned, each gets a unique ID, which can be one-dimensional, two-dimensional, and three-dimensional. And that has an advantage. And we'll talk about that in the next example I give. And each thread in a thread group, which is the explicit working group that works together, has a unique ID as well. So we generate the partial sums here, the first partial sum. So we'll have eight of them. And like I said in the example I gave, then you loop over until you finally get two partial sums. And then thread zero computes the final sum. So here is what the code looks like.

So the first loop was getting a partial sum across n threads. Then we summed the partial sums, and then we generated the final sum. Let's take another example. Let's say I want to do an image filter where I want to do a reflection across the Y axis. In this case, I have an image.

So each element in this particular filter, since I'm doing a reflect, can operate independently. So each element or pixel in the image can actually be an independent unit of execution. So the threads I want to spawn here would be the width times height of the image. I think it would be better to describe this not as a one-dimensional problem but as a two-dimensional problem because I can actually, if I have a two-dimensional identifier, I can actually use that identifier as my coordinate to get into the image. And that's, you know, you can do that. And so you call read image to read and then, you know, the right reflected to the reflection and then the right. Well, with image, what happens is, you know, the underlying data format can be different from -- but when you read or write, you know, the output of the read is always four floats, and the input to the write is always four floats, and automatic conversion happens. Well, what if I already know what the format is and I don't want to use images? So let's say it's RGBA 8888, 8 bits per pixel. Can I use something else? Yeah, you can. In this case, you can use declare as an array.

So you have many choices available. Now, these are not optimized examples, so don't go and write this and expect full performance. So let's talk about-- remember, in some of the slides, we were talking about global and underscore local. So what are these things? Well, in OpenCL, we have memory address spaces. And this describes a different memory subsystems or hierarchy that may exist in the computer device. On some computer devices, they may be collapsed. And I'll talk about that. So first thing is, remember, each thread is an independent unit of execution and has what is called its own memory, or we call it private memory. OK? Then the threads that are in the thread group that want to be able to share data, well, they need some buffers to be able to share data with. And that's called local memory. And local memory-- so in the case on GPUs, the local memory would be actually a user-managed cache where your computer is going to manage it, whereas on the CPU it's just basically cached memory that maps to your DRAM. And then there's global memory, which where your input and output data finally get read from or stored. And there's a property of global memory. Let's take an example.

Let's say I want to do a convolution filter. The filter weight can be stored in global memory, but it's basically just being read. I'm not modifying it. And it has a unique property that each thread, as it's operating, applying the filter, it wants to read the same filter weight. So it wouldn't be nice if there was way to be able to cache that and be able to access that really, really fast. And that's what the constant qualifier is telling, indicating. So that tells the OpenCL compiler that there's a unique property behind this constant.

And these memory address spaces can only be specified to pointers. So anything you declare as a pointer, you need to tell us what the memory address space is. So we looked at the high-level architecture overview. We looked at what the objects are. We looked at how to write computer kernels. So I wrote a computer kernel.

I think it's going to perform really well. Well, how do I get the computer device to execute it? What code do I need to run on the host to execute it? There are three things you need to do before you need to execute. First is you have to initialize. Well, that means I've got to decide what compute device I want to use. I want to create a context on it, and I'll talk about what I mean by context. I need to allocate the resources, which is my buffers.

I need to load my program, build my executable, get the kernel functions, and then I'm ready to execute. So let's take this example, the dp mall, where we were multiplying one array with another. There are three arguments to this function. Okay? So I want to execute it. Let's say I decide I want to use the GPU. So, you know, I create the device group using a GPU. So I get a device that indicates the GPU. And now I need to create a context on it. And a context is basically-- it's a conduit or a channel which lets me send commands to the compute device. Multiple contexts can operate-- be operating in parallel.

So I create my context. Now I need to create my resources. So those are the three resources I create. Now remember the two input arrays, A and B, I already have data, but in this case, the data's on the host side, and I need to transport it over to the GPU because I'm going to execute this kernel on the GPU. So when I create my array, I say, "Here is the input data. Go copy it." Okay? And I can also give additional properties saying, "Well, A and B in this case are read-only, "whereas the mem object 2, which is the result is read and write.

Now I want to load my program. Well, so I have my source. I load it. I build the executable. And I then create the kernel. Create kernel is basically very similar to-- if you do a load library, then you call get proc address to get your address of the function. That's effectively what it's doing.

Now before I call my function-- so when I make a function call, I call the function and I specify the arguments. I specify the arguments in OpenCL using something called CL set kernel args. So in this case, the input arguments are the three memory objects I created, and their sizes are the size of the memory object type. So all memory objects that are created on the host side are opaque handles. And they get mapped to the right pointers by the OpenCL framework when actually the function gets executed on the device. So on the host side, you never have to deal with pointers.

All right, so I want to execute my kernel now. Everything is set up. So I need to specify the context I want to use. I want to specify the kernel I want to execute. Or what compute device I want to execute it on. Well, why do I care about that? Well, if you just created a device group which has only one device in it, then you don't. You can just say null, because there's only one place you can execute it on. But remember we said you can actually create a logical device that can actually contain multiple GPUs or GPUs we use, we need to be able to then specify where we want to execute this. And that's what this is telling. How many independent units of execution? Now whether I do implicit mapping or explicit mapping, i.e., whether I care -- if I don't care if the threads communicate or not, then I don't need to specify the thread group size.

So just null. And how many dimensions? This is -- what is the dimension that identifies the thread ID? In this case, it's an array. It's a single-dimensional index, so it's one. And there are other parameters, but that's how you execute the kernel. This basically queues a command to go execute the kernel. Now the execution model in OpenCL is there are two models allowed. One is you can queue things, commands. Commands are always queued in order. You may choose to then say I want to also execute them in order. So for example, if I execute kernel A followed by execute kernel B, B can guarantee that that B will only start once A has finished. So if there's dependency in terms of if the output of A becomes input of B, then you can set the property with the context saying I want to execute in order. Or you can say I want to execute completely out of order. It's asynchronous mode. In which case the only guarantee is that commands get queued in order, but they can be executed totally out of order. And what happens then is each operation, like an execute kernel or reading and writing get identified by an event handle, an object that is returned back to you. And you can use these things then to determine the synchronization you want to apply.

So remember the task parallelism example I gave where you wanted to do HDR tone -- image histogram followed by tone mapping? Well, each of them could be a -- each of them actually are data parallel kernels or functions, but they're independent tasks. So I would call execute kernel on the histogram, get an event that identifies this execution, and then say, okay, execute HDR, the tone mapping algorithm, but only after this event has elapsed. And those are the missing arguments, the dot, dot, dot basically lets you specify what the events are. And that's specified in the specification that's available on the attendee website.

All right, great. I executed my kernel. I got data. I want to visualize that data now. So how do I do that? I'm going to use OpenGL. So how does OpenCL work with OpenGL? Well, we want to make sure that any GL objects, such as a texture or a buffer object, vertex or pixel, can actually be used as a CL array or a text image, which you can. Well, not only that, we want to make sure that they both point to the same bits. We don't want to be doing any copying, so sharing has to be really, really efficient. CL can generate some data which can actually be directly used as the input to GL without requiring any copy, and that's what it's implemented. So how do you share these resources? Well, the first thing you need to make sure that the GL context is created using a CGL share group.

You want to make sure that the compute device group that you are creating with CL points to that same thing, which is basically saying, I'm going to point to the same devices. Then when I allocate my arrays or images, I use this flag called memalloc_reference, which basically says that I'm not going to allocate any data memory behind this, but I'm going to use whatever memory has been allocated by GL. And then sharing is very easy. I attach, so I can attach to a GL image or a GL buffer object, and then use that as an argument in my CL set kernel args. And I can execute multiple kernels. When I'm done with it, I detach. And that's basically telling GL that, "Okay, CL is done with it. Now you can use it."

So in summary, we have a new compute framework that works across GPUs and CPUs. It uses something we're familiar with extensions. We define numerical position requirements so that there's a consistent minimum error bound as guaranteed across compute devices. And this will be the open standard for compute on desktop platforms. Well, let me introduce Henry Moreton from NVIDIA. He's a distinguished engineer in the architecture group at NVIDIA. And he's going to talk about GPUs and OpenCL and how they are such a close fit. Thank you. Henry Moreton: Thanks, Chakri.

Good morning. So I'm Henry Moreton, obviously. And I have a few things I'd like to talk with you about. First of all, I would like to give you some idea why you should be interested in OpenCL. And then we'll drop into kind of a brief discussion of CPUs versus GPUs and some of their sort of respective characteristics. And then I'll spend the majority of my time talking about the GPU as an OpenCL processor. The image at the bottom of the screen is kind of a schematic view of a series of processing clusters that are kind of typical in the GPUs that we build today.

So why is this interesting? Even at the outset, I mean, this is the genesis of it. This is the beginning of OpenCL as a computing platform. But it happens that OpenCL is supported on all shipping NVIDIA GPUs. That means that any GPU that ships today is capable of supporting OpenCL. OK, so that's nice. But in fact, this has been the case for a while now. There are 70 million OpenCL capable GPUs already out there in the marketplace in people's hands. And that number is going up at about 2 million a week. So there's a non-trivial-- possibly a non-trivial user base.

Now, as far as the Mac platform or the Apple platform, these are the current products that have OpenCL capable GPUs in them. from relatively high-end, very high-end GPUs down to kind of entry-level integrated systems. Okay, so what about GPUs and CPUs? The 8800GT, I think, is actually what's shipping in the Mac Pro. It has 112 cores and supports about half a gigaflop and has some pretty significant memory bandwidth. If you compare that with kind of a contemporary desktop CPU, it has two cores. It has a little bit less than a tenth the floating point horsepower. And rounding, it has about a tenth of the memory bandwidth. So for very data parallel, data intensive, bandwidth intensive, floating point intensive workloads, it's not as capable a platform. Both of these support IEEE floating point. And I'll stress that Anafi actually was fairly clear about this. When I say IEEE floating point, I'm referring to rounding modes. The GPUs don't yet support floating point exceptions, those sorts of things. So it's not the full IEEE standard. It's important to understand that they're both programmable using OpenCL, which is actually great because one of the things that the CPU enables is the use of OpenCL across all Macs that are shipping today.

Okay, so what about the GPU as an OpenCL processor? Today, the GPU is composed of a large number of, processing blocks that we call texture processing clusters or TPCs. And the name, the original name sort of stems from its graphics heritage. Now, each TPC has a series of texture filtering blocks and a texture L1 cache. You can see it on the diagram. And then they're further broken down into multiple, I called them thread group processors, internally, and in video we call them SMs, which stands for streaming multiprocessor. And a thread processing cluster can have varying numbers of these thread group processors. You know, on the right, you see a TPC composed of two of them. On the left, in sort of smaller scale, there are three. And depending on the market segment that you're going after, one will be more appropriate than the other.

Now each of these SMs or thread group processors is made up of a collection of scalar processors and there are eight per thread processing group. Let's see, what else do we have here? You can see the local memory that EFI was talking about. It's explicitly embodied in the architecture. It's a program managed cache. We also -- implement all of the -- I called them address qualifiers here. EFI referred to them as address spaces. But basically the hardware implements both the private address space, the local and global, as well as optimizing constant accesses.

Now, when I say it implements the private address space, there are instructions in the instruction set that actually reference memory, and then the hardware takes care of mapping that to memory in an efficient fashion in terms of interleaving the memory references from all of the threads that are executing in parallel so that the memory accesses are efficient. The same is true of local and global memory, and I'll talk a little bit more about that in a moment.

So if you take and put together a whole group of TPCs, you can build up a chip. And what we see here at the higher level is a set of four TPCs. which are then set on top of a memory subsystem that's quite wide, which is how we achieve the remarkable memory bandwidth the system supports.

And you can also see on the left side in gray this little block that says atomic. So the language supports atomic operations on memory, and we implement those in what we call fixed function hardware, which sits very, very close to the memory, so that the read-modify-write of the atomic operation can be extremely efficient. Amen. You'll also note that all of the bulk of the-- let's see, how do I describe this? The host CPU is connected to the GPU via PCIe, but everything else in this diagram is on die, with the exception of perhaps multiple gigabytes of memory.

So given that, we can build a variety of different platforms by incorporating different numbers of TPCs and different numbers of memory partitions. So you can scale the bandwidth of a product. You can scale the compute capabilities of the product, depending on your point in the marketplace. So going back to the Mac configurations that we were talking about earlier, We have ranging from 130 gigaflops up to 504. And you can see the varying numbers of SPs in these differing products. These are all fairly powerful machines, but there's still a wide range of performance from the bottom to the top.

Okay, so what's, you know, I talked a little bit about the kinds of things that we've implemented in the GPU in terms of providing direct support for OpenCL. What other sort of attributes are there? What's interesting? Well, one of the really critical and extremely valuable attributes of OpenCL and our architecture, frankly, is that there's a transparent scaling in terms of performance across the product lines. You can take the same piece of code and run it efficiently on a single TPC low-end system. That's what this 8SP MGPU, we call them motherboard GPUs, a relatively low-end machine. You can take that same piece of code and run it on even conceivably a dual GPU system that has 256 processors and supports about a teraflop of performance.

And the code will just run faster. I mean, it isn't perfectly magical. I mean, there are some limitations to scaling. But one of the wonderful things about this is when we ship a new product that has higher performance, has more cores, your application, assuming that at least some of you are application developers, just runs faster without any effort on your part. Right.

OK. Another attribute-- so actually, flipping back to this for just a second to clarify or kind of illustrate why this comes about is, EFI had mentioned the thread group. So the thread group is the unit of execution that's guaranteed to execute at any given instant. And those thread groups run on these thread group processors.

And you might have a machine with a single thread group processor, or you might have a machine with, say, 32 or 64 thread group processors. Because there's no dependency among thread groups, we can run them serially or we can run them all in parallel, and that's where this scaling comes from. Okay. Another attribute on the GPU side is that we support thousands of threads.

And I mean it very literally, thousands of threads. Now, what does that buy us? Well, you can use a very, very sort of deep workload to hide the latency of compute operations. For example, if you perform some costly -- say you perform a sine function or a multiply add that has maybe 10 or 20 clocks of latency.

Well, while that is sort of grinding through the machine, I can run 10 other threads to hide the latency of that operation so that we never have trouble with dependent operations on the GPU, or rarely do. The same holds for memory accesses. Now, as EFI mentioned, in current shipping GPUs, there are few memory caches. There is a memory cache that sits behind the constant accesses, and I'll talk a little bit more about that in a moment.

But accesses to global memory go all the way out to the DRAMs. And it is by having hundreds and literally thousands of threads that we can hide the latency of memory access. So OK, thousands of threads, that's great. One of the things that we've done in the design of our processors is to make sure that the threads are extremely lightweight. You can imagine that if you have 1,000 threads or tens of thousands of threads, the threads have to be very lightweight. The threads are all managed in hardware. They're created, scheduled, and even the synchronization operations among threads are all managed in hardware. So there's no visible overhead to the application for these very large numbers of threads.

Okay. Now, the hardware also, you know, in terms of the way we design hardware, we look at the kind of typical operations that are being performed by applications and optimize the hardware to make those operations as fast as possible. So one of the things that you often see in a program, a kernel, as EFI was describing, is that... all of the threads that are executing along are often, in fact, the vast majority of the time, they're at the same program counter. They're executing the same instruction. So we optimize that.

If a wavefront of SPs all reads from the same memory location, all reads the same instruction, it does it in a single clock. If, on the other hand, some of the threads within that wavefront take the if path and some of the threads take the else path, it takes two clocks, or it takes two instruction memory fetches. So we make the common case go extremely fast, and we make the other cases go as fast as possible, you know, sort of providing for the right trade-off in terms of the instruction cache design. Because ports into an instruction cache are expensive, and the caches tend to grow roughly linearly with a number of ports. I'm not sure. We also optimize the accesses to the local memory as well as global memory in the sense that typically threads will access a series of consecutive locations in memory.

So what the hardware does is it coalesces all of those requests into sort of one cache line fetch. And if, because of your algorithm, it has slightly less structure, if they access multiple cache lines, we take exactly as many clocks as there are cache lines referenced. And the hardware takes care of managing all of that. So there's no need for the programmer to deal with creating masks to perform a vector load but mask off the elements that they don't want to actually be loaded. So none of that overhead in terms of lane management falls to the programmer. So not only do you not have to program it, but it's also dramatically more efficient. Now, as far as the constant accesses, EFI had mentioned that it's very beneficial to identify regions of memory that are read-only and also regions of memory that are dominated by a large number of threads accessing the same memory location. It happens that in graphics hardware, the heritage of the constant cache comes from the transformer lighting engines of like eight years ago, because in those engines, there was this chunk of memory which held matrices for transforming vertices, and every thread that was executing was going to access the same coefficient every clock. It happens that that kind of characteristic memory access pattern shows up in memory image processing operations, but in actually a very, very wide variety of compute loads. So what we do is optimize for that. However, if there's some renegade thread within a wavefront, a group of threads that are executing all along at the same time, if it goes off and reads from some other location, the hardware manages that gracefully. It simply takes another clock, typically, to read this other memory location. But again, it's all managed in hardware on behalf of the application. So there's no additional compiler or programming necessary to get the performance there.

And then finally, obviously, we have huge amounts of bandwidth. And we can stream data from the DRAM onto the chip, into local memories, into registers, and operate on it at a sustained rate, such that you can keep all of the compute units fully occupied during the execution of a kernel.

So I guess to recap, You should understand that our GPUs in OpenCL were designed together. I've spent a great deal of time at Apple, literally a couple of hours a week. Fortunately, they're not too far from campus, our campus, that is. I think it should be apparent, if you haven't figured this out already, that the GPUs are a tremendous platform for running OpenCL. They're extremely efficient. If the workload is appropriate, you get great performance.

And you get, you know, in terms of absolute performance, as you move up the product line, you get sort of rewarded for moving up the product line with even better performance. Also, in terms of efficiency, and this is, depending on what environments you're working in, a really critical issue.

They have, they're extremely efficient from the perspective of what we call perf per watt. One of our design criteria when building GPUs is to optimize the performance delivered against the power consumed because, you know, if you... If you have a laptop, you don't want it to get too hot. You need to be able to cool the devices that are inside it. And it turns out that power consumption, battery life, thermal dissipation, all of these things kind of determine what you can get away with in a given platform. And then finally, to revisit the open standard aspect of OpenCL, it will be supportable across all of our products. So the pictures across the bottom of the slide kind of are exemplary of the various product lines from NVIDIA, and we plan to continue to support, to continue to ship OpenCL-capable machines. And that's it. That's all I had today. Our next speaker is Kevin. Thank you. Thanks.

So hi and thanks. So you learn about OpenCL. It's a very, very cool and very exciting technology. And so you might want to do your OpenCL development in your usual development environment, such as Xcode or whatever you would prefer. But there's this great tool on the US called Quartz Composer that can help you. So Quartz Composer basically can be seen as a visual programming environment. So it's pretty much a graph that organizes a processing flow where each of the nodes of this graph is a patch that takes some inputs, do some processing, and returns some outputs. And so Quartz Composer is used for doing graphics animation or image processing or compositing pipelines, and also data visualization.

But the great thing about Quark's Composer is that most of the time these things are done with no code at all. And so for those of you OpenCL developer who wants to know, learn about OpenCL, learn about a language and discover that, or go further, or explore what OpenCL can do for you, Quartz Composer is a great place to look at, because it will handle all what you don't want to handle. For instance, OpenGL attachments for drawing, or vertex buffer object, texture, texture caching pool, OpenGL context, and also the OpenCL program setups and device setups and so on. So you just go inside Quark's Composer, and you have an application where you can directly, simply go and type your kernel code and start playing and interfacing with other patches.

So another great thing about Cross Composer is that it integrates all the -- most of the OS technologies, so, for instance, QuickTime, CoreImage, CoreMedia, CoreAudio, and so on. So you can use and leverage these OS technologies in your PenCL development, for instance, as inputs to your kernels or as environments to visualize or to better explore and present the result of a simulation.

So Quast Composer adds OpenCL support. And so what it means is we have an OpenCL kernel patch that's simple as just one patch, so one of these processing nodes. And in this patch, you can write OpenCL kernel. And Quast Composer will automatically parse this kernel, find its argument, and understand what are the inputs ports, the output ports, and automatically populates the patch so that you can have some abstract processing unit that you can insert in some processing flow, so then some Quartz Composer composition.

And so that's very easy. You open Quast Composer. You create an OpenCL kernel patch. And right there, without having to set up anything else, you can start doing some OpenCL and start leveraging Quast Composer to visualize the result to eventually go pretty far. So we'll show that in a demo in a second.

So one other great advantage of Cross Composer is that everything's at runtime. So you don't need to build and run whenever you make a change in the kernel or change somewhere. You can simply go in the setting of the badge and inspect the kernel code. And as you change, as you type in the kernel, Cross Composer will automatically call CL to recompile the program in the background so that your simulation or whatever processing in OpenCL you're doing will pick up the change at the simulation or how the processing goes. So I'm going to demonstrate that right now on the demo machine.

So here I'm going to open in-course composer here. So this NBody demo that you've been seeing previously in the show, so you have 16,000 particles, so 16,384. And here are the composition look. So you have the antibody kernel, which is here, and simply the result, which is the position, the output of that kernel, is passed into a renderer that will simply render that thing, render the result of the OpenCL simulation.

And that's as simple as it can be, as simple as it is. And Chorus Composer also provides you with patch to interact with the visualization, so you can very easily do some rotation like that. And so let me show you now the settings of that OpenCL kernel. So here I have all the code of the kernel that is a copy and paste of the OpenCL kernel from all the other demos. So there's no translation to do.

Just paste that stuff in, and it works. And here I have a line that I commented out that simply takes the time step and multiplies it by a constant, which is 5. So I can just uncomment that line. And immediately, the simulation at the current time picks that up and goes five times faster. So if I can comment that up. So we're back to the previous time. So it's very easy as you go to chain the kernel as it goes and to see the results. It's a very nice way without having to build and run, to iterate really fast, and to go to where you want to go and probably further much faster.

So let me start the simulation again. And show you other power of Cross Composer is also that you have access to all the Cross Composer pipeline and other possibilities. For instance, you can create very easily this little UI that allows you to change some parameters. But you can also leverage the Cross Composer pipeline to create some interesting effects. And so for instance here, I'm going to create a trail of all of all these particles over time. So here you can see in three dimension, over 200 N-body simulations stacked over time and composited together. So Quast Composer is going to handle everything, the vertex buffer object caching, and the display. So you don't have to worry about anything. So it's-- so let me put that maybe later so you can-- So it's a very nice way to leverage Quartz Composer to go in some other direction. So how it works, to do that motion blur, it's a simple-- here I go and it's a simple Q patch that will stack up 200 vertex buffer objects, so 200 results of that simulation, and then an iterator, which is a patch that will render what's contained within that Q for all its elements, so all the elements contained in that Q, so 200 times. So here you have-- or 90 times. So here, for instance, you have 200 vertex buffer objects of 16,000 particles drawn in front of you by Cross Composer in real time. So let me go back to the slide, please.

Can I go back to slides, please? Thanks. So more details on the OpenCL kernel patch. So by default, the OpenCL kernel patch is going to handle everything automatically. So it's going to find out the outputs and input ports from the kernel prototype. It's going to figure out the thread dimensions and the output dimensions, so the dimensions of the output arguments. But you have the option to override everything. So you have the option in the settings to override the local thread dimension, all this things.

By default, it's very simple, but you can give all the customization that you want. So yet very powerful. So we really want Chorus Composer to be a very, very good OpenCL development tool. And so we added to the Chorus Composer editor itself new functionalities to make that support really optimal. So in particular, we added this ability to define a global OpenCL kernel--OpenCL program, sorry. So what it means is that--so if you have a kernel patch and you write an OpenCL kernel within that kernel patch, this code and this kernel will be accessed by the patch itself. But using this global code view, you can write an OpenCL program, so maybe multiple kernels, and these kernels will be accessible by every patch in the composition, so will be shared. So if you change the kernel in the program, it would be reflected automatically to all the patches. So that's a very nice way to have a centralized location for OpenCL development. And so it allows you also to bring in any sort of OpenCL code within Quark's Composer and start using it and leveraging it in the composition.

So another thing we added to Cross Composer is a new patch specifically targeted at OpenCL simulations. So a simulation pretty much means that I have a kernel that iterates over time over the same objects, for instance, position and velocity. So in the NBody case, that's what we do. We have position and velocity. Each time we use the NBody kernels to update that position and velocity, so each time step.

But CrossComposer only have access to a slice of that time step, so to a patch at a given time. And so what we need is that so that this patch can have at the next time step in its inputs the result of this same patch at the previous time step, so at the previous execution of the composition. So we added a very powerful feedback patch that allows you to do that sort of feedback. And so the feedback patch at T equals zero is simply a pass through. So we'll pass the arguments coming on the left, so on the input ports to the output ports and pass it to the -- to the kernel. But at T different from So if it's bigger than zero, it will feedback the previous outputs on the inputs of the batch. So that allows you to do this sort of evolution over time. And it's very powerful.

It can also be used in iterators if you want to iterate the same simulation or the same processing at a given time step. And so that allows you to do a lot of things. And so I'm going to show you an example of the sort of thing that we can do with that feedback batch.

So here I have a cloth. So that's as simple as it is. We use the feedback patch to do some cloth simulation using leveraging OpenCL in Quartz Composer. And so you can play with the cloth around. You can move it. So it's kind of fun. It's kind of elastic, but it's really nice. So you can do all that stuff. So the feedback is there used to propagate the result over time, but also at a given time step to iterate the simulation over all the springs. So here you have a set of points. So there are springs.

And so to enforce these springs, we use the Feedback patch to propagate over all the cloth. So you can have fun. So to import that image, it's as simple as an image source patch that will bring in the image. And so you can simply put that-- so here's the composition-- as an image port on the mesh. And it's automatically textured by Cross Composer. So in that composition, just like in the NBody, there's no code beyond the OpenCL code. So here I can play because Quartz Composer is integrated within the OS. I can, for instance, apply a core image effect on that texture, and it's composited and everything very easily. And you see the alpha blending on the side. All right.

Okay, and so to sum up, so Quartz Composer is great to help you visualize the result of an OpenCL, OpenCL simulation, OpenCL kernel, anything. It's great to help you interact with that visualization. And as I showed you in the composition, it's also great to understand the different parts, the different processing parts of what's going on. So, for instance, You have the optimization step. You have the flow. You have the-- here, the positions are updated and so on. So you can understand what comes next because of that graphical representation. So the things are organized. It's also great to help you communicate your processing to other people. You can also write notes in the graph that allows you to document what's going on. So Quartz Composer is great to leverage OS technology with no code, as I demonstrate to you. And so Quartz Composer integrates technologies, but it's also integrated in the US. It's integrated in Cocoa using QC view. So you can take that composition, put it in Xcode, and have an application out of it, again, with no code. It's also integrated within Core Animation using a QC composition layer. And it can also be used as a pure renderer in your application. So without even any OpenGL, pure OpenGL processing pipeline in applications to take some data and to return some data leveraging OpenCL while using Cross Composer to create this graphical workflow.

And so the message is really that Cross Composer is this -- can be this bridge that takes OpenCL and brings it to new limits and also that integrates it in new areas, for instance, applications or, for instance, some very new possibilities there. So we're -- so the Cross Composer is in the C, so you can try it. You can try the NBody example. And we are having a session this afternoon that will show you some very nice things of the new things we added in Cross Composer personal effort, in particular interactivity.

So in a very short amount of time, we're going to create some interactive OpenGL for the browser. And I'm going to show you an example of what the kind of synergy Cross Composer can bring while bringing OpenCL into other area, in particular with this ability to integrate by going to a last demo to the demo machine, please.

So here I simply took this demo we're going to bring this afternoon and added the closed simulation to simulate-- if it's coming up-- to simulate some paper or some paper cloth thing. So we just took this cloth simulation, wrapped it as a patch, and then we don't have to worry about what that patch contains, and put that within that simulation. And here's an example of what it can do. So you can now, you know, give some, you know-- so it's a little elastic, so it's not real paper, but, you know, you can have some fun, and so you can bring these things here, and, you know, take the other people, and they grow automatically. So all these things are physically simulated using OpenCL. takes OpenCL, which is not only for scientists, it's for everybody.

It's for everybody who wants to bring new possibilities to their applications, to whatever they want to, they're working on, speed in sorting, speed in visualization, or new possibilities in things that you thought were fixed. So for instance, an image don't have to be on the quad, it doesn't have to be on the quad, it can be on this, on this thing. And for instance, so, and here, so you can simulate some real page curl. So here we have some, you know, page curl going on. So you don't have to fake the page curl to try to find the function.

You can just take that image and say, "Okay, let's do a page curl to reveal what's beneath." Well, that's all the things. And then, oh, he's gonna, and if I release, he's gonna drop down. So, and so, to do that page curl, literally what we just do, we just take the two corners and put them up, and that's all what you need. And so that's also what bring, what the simulation brings to you, the fact that you don't have to worry the, the functions and all the things, because you have the simulation, because now it's abstracted, with that patch, you can leverage it to do lots of, lots of very powerful things, but very easily, because you have all the things which is, which is hidden. So and, and that's it. So can I, can we go back to slides, please?

So that was this little extra picture. So we have a new session this afternoon. Please come if you want to learn more about the technologies and about the new features in Cross Composer to do these sort of things. And now I'm going to bring back Jeff on stage for Q&A.