How to Thread Your Application with the MP APIs - WWDC 2004

Application • 53:35

Learn how to leverage the threading architecture in Mac OS X with single and multiprocessor hardware to significantly boost the performance and responsiveness of your application. This session will discuss the threading models available in Mac OS X and will show a step-by-step approach to threading. We use the Carbon multiprocessing APIs to thread a tight loop to gain a performance boost, and discuss how to use semaphores to notify the main thread once computation is done.

Speaker: Xavier Legros

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hi, everybody. Welcome to session 435. My name is Xavier Legro, and I work in developer relations. Most of my job is actually to go around the world and kind of tell you about the latest technologies and encourage you about opting for adopting modern Mac OS X technologies. What has been coming quite often is that very often people are afraid of threading their applications.

So we've been doing a lot of what we call workshops, where we take groups of 20, 25, 30 people, and we kind of teach them one topic. We have Cocoa workshops. We have HI Toolbox, Carbon workshops. When we do this presentation, we get very, very good feedback, and very interestingly, I think it really helps adopt actually in threading your application. So that's going to be the topic today.

So what are we going to talk about? So today we're going to talk quickly about why you as a developer should care about threading. We'll go through some threading terminology, some buzzwords to make sure that we all speak the same language here-- French, obviously. And we'll go through a couple of examples of threading architecture. I'll go through three of the main architectures that I think are used around threading. And I think the best part is going to be I'm going to try to take you step by step and try to teach you how to thread your application using the MP APIs.

And then we'll have some dos and don'ts and something that I think is a cool demo. OK. So why should you use threads? Well, because if you use threads and you thread your application, we're going to get you 50% of one of these brand new G5s and a brand new display.

Usually everybody's like-- that's a nice box. And you're probably thinking, really? No, not really, of course. But why should you thread your application? Well, the first thing that comes to mind, of course, is scalability. We are shipping now a lot of these boxes with two CPUs inside, specifically you as a developer with your nice application.

If you use only one CPU, it's like having our users use half of like $3,000, so like using half of a machine. I think a lot of applications could actually use threading in their app. And we're going to go here through some of the main concepts and explain why it's not so much of a big deal sometimes to thread part of an application.

Here what I'm showing you is actually a couple of results with some graphic transformations. Here in this case, we have like a Gaussian blur, and we go to a motion blur. And you can expect when you thread part of your application for like high intensive tasks, between 1.3 to 2.3 times faster.

And here you're probably wondering, why 2.3 times faster, right? If I have two CPUs, how can I get like more than twice as fast, right? And here in this case, if I'm not mistaken, we're getting super-scalar results. The main important thing to understand is that in our boxes, we have a little bit more than just two CPUs.

We have a very, very strong architecture that enables us to actually take advantage of both of these G5s in the box. Specifically here-- why do we get things such as 2.3 times faster? Well, for the simple reason that actually each CPU is going to have its own bus to go to the memory controller. That makes a big, big, big difference. In certain cases, maybe your algorithm is going to be CPU-bounded. In this case, when threading your application, just imagine that you're giving twice as much bandwidth to the main memory.

So why use thread? So obviously, customer's expectation, as I talked about, and scalability. Once again, look, the three of the G5 shipping right now all have dual CPUs. This is a big deal. And I think you've seen the industry-- I mean, Intel has been doing some announcements about dual core. This is something that really the industry is moving towards. So please keep that in mind in your future development. OK, threads. Qu'est-ce que c'est? So we're going to throw a couple of buzzwords. And trust me, by the end of this presentation, you'll probably be speaking French.

Okay, so what's a thread? So a thread, think of a thread as an independent execution code path. And this is very, very important. If you're new to threading, and if you have no clue about how you should be able to thread your application, think about your function that you're going to want to thread. Is it considered as, you know, something that can be executed independently of the rest of the code? And specifically, think of it as something that's going to have its own stack and register set.

So what's a process? A process is going to be actually a collection of threads with the resources necessary to run. Specifically, a process, think of it like when you launch your application, boom, that's a process. And inside that process, you could have different, actually, threads going on. The cool thing here is that a process has its own address space. That means that actually the threads inside that address space can access, actually, global variables if you had to. And memory that you pass on the main thread to view the thread is the memory from the same address space, obviously.

Okay, now before we go and we explain why and how you should use threads, let's think a little bit about when you should not use threads. Okay, obviously in the case where it's going to add complexity to your application, you don't want to get in that business. There is no need for you to spend six months threading your application if it's going to be like every time you want to add a feature, another six months of trying to find out and managing the thread, doing the thread management.

Obviously, things that are going to require locks are a bad idea. And here in this case, think of it as like in a database and a lot of people trying to write to the same record. Depending on the granularity of the lock, obviously you're going to have to be very, very careful.

[Transcript missing]

Some other options, you could use cooperative threading, which is probably what most of you have been using on Mac OS 9. I put timers here, and I'm talking here about Carbon event timers. And this is very often people don't understand, but they're like, well, I'm doing a bunch of processing on the hard drive and I want to show the user the progress and give them a chance maybe to cancel the operation or something. You could use a Carbon event timer for that.

OK, use a Carbon event timer that's going to fire every second or twice per second. It's up to you. And then the toolbox will call you inside your Carbon event handler, your timer event handler, and actually that will enable you to update whatever you want on the screen. So that's why I want you to put timers.

Okay, now, so hopefully you get the idea, and if you're here in this room, obviously you want to learn, and you're interested in threading your application. Now, I'm going to go through what I think are three of the main threading architectures that we see out there. Parallel task, we've shared parallel buffer.

You can read the slides better than me. I think it's better if I show you a picture. So in this first case, one example to think about this parallel task with parallel buffer architecture would be to think of maybe, let's see a simulator, a flight simulator. You get data in one buffer, and that data is going to be computed, and the result is going to be, maybe you're going to compute atmospheric settings, okay? So this is totally an independent path.

You have, like, you know, one IO buffer, one output buffer. They don't depend on the rest of the data. Then on thread number two, what you could do is compute, let's say, the ground, and you get, you know, the data, let's say, from the Internet or from, like, you know, some geostationary satellites.

So you get the data, and I don't know if you can see it, but that's one of the thread. That's your IO buffer. Then you do the processing to compute maybe some fractal terrains, okay, like some 3D world, and then at the end, you get, like, you know, a G world, an offscreen, or, you know, CG context, or, like, an open GL surface, whatever you want, et cetera, et cetera.

You could have, you know, thread, you know, number three, for instance, do, like, the computation of... VeloCity, or whatever you want it, collision, you know. And here the main idea in this architecture is that think of it as, you know, you have N different input buffers and N output buffers.

And they just do like, you know, like your different processing on each of these threads. All right. This architecture is actually the one that works better, I think, in my mind. And I think, like, a lot of applications could use that. And here in this case, we have a buffer of data. Okay, it could be, like, an image, and you want to apply a transform. Or it could be pretty much anything.

It could be, like, a huge array of, like, floating points. And you want to, like, you know, compute the cosinus, the sinus, the tangent, and, you know, generate at the end, like, you know, this huge output buffer. Here in this case, this is what I'm going to use in my demo. So, and not to ruin it, what's going to happen is that I have this application that computes a fractal. So why do I have an input?

An input, what I do have is, you know, this buffer, which is actually a pointer on my offscreen. Okay, I have a graph board, an offscreen, and it points to the beginning of my image. And, you know, to compute the model broad space, which is the fractal I'm going to be computing, I just need to compute, like, you know, if, like, each pixel is in or out of a model broad space. Well, what I can do very quickly, and in an easy way that hopefully I will show you, is that I could take that initial buffer and pass it, actually, to n threads.

I'm going to divide, actually, my picture in n different parts, and n different threads are going to be computing, actually, the data for me. And then at the end, because of some magic, what I could do is pass different pointers inside my image, you know, think of it as, like, you know, just, like, slicing the initial image and passing one slice to each of the threads.

I don't need to, like, you know, recombine, actually, all the results because I just have, like, you know, the initial pointer to the offscreen. But that's something, you could apply pretty much to anything. Think of it, like, you know, let's say you have this hard drive and you need to compress all the files one by one.

Well, you could spend, you know, 10 threads, and, you know, each of the threads will take one of the files from the directory. Think of it, for instance, if you're doing HPC or if you're doing, in this case, like, you know, some sci-tech computation, and you have this huge array of data that you need to, like, you know, crunch through it and apply, you know, like, some FFTs or, you know, 2D transformation or rotation of the data set. Well, you could actually use the same, actually, architecture here, to go through your data set. You know, you spawn, and you're, like, you know, you're going to slice your input data and pass it to n threads, and then recombine the data at the end.

The last one sounds more difficult, but actually works pretty well. Here in this case, we're going to have sequential tasks with multiple hour buffer. The type of usage, the type of application that could use that, are applications that need to execute, let's say, any different task on an initial data set. One example that I like to give people is, for instance, take a word processor. Let's say you have this word processor, and you have this You have to run different tasks onto, like, you know, the data set. So you open a file. It's a huge file.

It's, like, you know, it makes a megabyte, two megabytes, whatever. It could be 100K. And what you're going to do is that you have to run, like, spell checking, then grammatical analysis, and then maybe after that you're going to want to translate the result into French, German, you name it.

Here, in this case, what will happen is that the input buffer, the initial buffer, will be the first paragraph, okay, of a document. We're going to pass that to thread number one. Thread number one is going to do the spell checking. When the spell checking is done, you take the output buffer and you pass it to thread number two, who's going to be doing the grammatical analysis, for instance. Okay?

So what do we have? At this point in time, we have thread number two doing grammatical analysis on paragraph number one. And then thread number one will be grabbing, actually, let's say, paragraph number two of a document and do the spell checking on it, et cetera, et cetera. So think of it as, like, cascading the result of a previous thread.

And here, in this case, the end threads are actually dependent on the result of the end minus one threads, but after end operations, all the threads will be actually full doing work. and then the output buffer will be obviously the French text or the corrected English text, whatever you want.

Okay, so now let's talk about the different implementation and what APIs that you can use as a developer. What's important to understand is that your Mac OS X, the three different implementations I'm going to be talking about, actually are implemented over Pthreads, which is really good news. I mean, if you've been coming from Mac OS 9, this is like, you know, Mac OS X is like a truly multi-tasking system, preemptive multi-tasking system, and it's great to have, actually, that implementation.

And obviously, if you're coming from a Unix background, you're probably very pleased with that. On top of that, what we have is a different type of implementation. So Java has like the Java threads implemented on top of Pthreads and they have their own APIs, obviously. Carbon has what we call the MP APIs, and I'll talk a little bit more detail about that in a second.

And Cocoa with NS threads actually is the same thing, a set of APIs implemented on top of Pthreads. Usually, like, what I get at, like, the first question I get after my presentation is like, "So we have all these threads. "So what should I do? "Like, why should I use MP instead of Pthreads?" Well, it's not--there is no answer for that question because the idea here is that we give you as many choices as we can, and it's up to you guys as a developer to, like, find what fits best for you.

I'm going to show you--I'm going to be using the MP APIs because when I started, I had no clue and, you know, I needed to actually thread my application. And I wondered, like, you know, what I wanted to use. You know, Pthreads were a little bit too level for me and documentation was kind of hard, you know, just, you know, coming from the Mac database. You know, I'm not really a Unix developer yet.

And the MP APIs really offered a nice abstraction, and so that's why I decided. But if you're more comfortable with Pthreads and you've been developing with Pthreads, please do so. Use Pthreads, okay? There is no big deal. And then Cocoa or Java, depending on, like, you know, what type of application you're developing, it's up to you.

OK, for Carbon development, and I should rephrase that, because we've had some folks actually doing Cocoa that use the MP APIs in one of our workshops. Since it's a C API, you know, Cocoa applications can use it as well. You can pretty much do everything you want from Cocoa. But if you have a Carbon application, the MP APIs are available in multiprocessing.h, and I hope it is clear for you guys in the back.

And, you know, we offer, like, you know, services and objects such as MP Sema for MPQs and MP Tasks, and I'm going to be talking about that in more details. And once again, important to understand that, you know, below that, what we did, in fact, is offer you an abstraction level on top of P threads, OK? So you're going to get, like, you know, the result and the quality of P threads.

Okay, so threading implementation. This is where things get interesting. You have two approaches. The first, and I remember from talking to some folks out there, you know, think, "Oh, you want me to thread my application, but this is going to be a nightmare." I mean, you don't realize, you know, we have, like, you know, the menu management. I need to keep track of what's going on. I mean, this is going to take me, like, a year to thread my application.

So the first approach, of course, is the difficult one, I think, which is, you have an application that is not threaded right now, and you're going to re-thread everything. And here, the main idea, of course, is to, like, you know, give your users as much responsiveness as you can.

But I think there's a better way actually to start threading your application, depending of what type of applications you have. But another task would be to actually just thread CPU intensive operations. And here the main advantage of that approach is that you don't have to re-architecture your whole application, OK?

So let's say you have an application and you do some computing, and you need to compute, you know, I don't know, like a 3D-generated model, or you have to do some compression. Well, what you could do is when you get-- and the user executes that task, you could start just threading that part of the processing. And I'm going to show you techniques actually to enable you to do that without having to re-architecture the rest of the app, because we're going to work on that part of the code, and we won't touch the rest of the application.

So if you wanted to thread your whole application, you have a couple of concepts to implement. Thread management, obviously, what thread is going on. If a user wants to redo the operation and the thread is not done, you'll have to actually kill the thread and restart it, and this kind of things.

And then you have to do some synchronization, of course. When a thread is done, maybe it's going to spawn five threads, and they're going to do each part of something. You're going to have to notify back to the main event loop that a thread is done, or there is an error, or it crashed, you name it.

And of course, you'd have to make sure that you implement thread-safe services in your application as well. That will force you to think, well, that global data is being accessed not only as read, but as write as well, and I'm going to have to put a lock on that so my threads can access it. Now, let's go with what I think is the simplest approach, which is thread just one part of application. Thread an operation in your application that takes relatively a long time, that is really CPU-intensive.

So the way to go about that is identify a tight loop that uses a lot of CPU, or just takes a very long time and always at 20%. The main idea here is that if you fit in that category, I think it's very, very straightforward to actually just divide that loop.

But you'd have to ensure that that loop can be divided. In my example, for instance, I don't have any data dependencies from one pixel to the other. If you wanted to do something a little bit more elaborated, where a value of a pixel depends on the one that is maybe 10 rows below, or five pixels before, it may be a little bit more difficult to achieve, because you'd have to make sure that actually that pixel value has been computed.

So typically, what are you looking for? So here, in this case, in my code, I had a compute model broad that was taking a bunch of parameters. And here, I had a loop that was actually doing the work. OK, and I was going line by line. Typically, the best way to look at it is that try to find-- and you know your code. Obviously, you don't have to do searching for loops. But the main idea here is identify something in the aspect of you have this big loop to a large number and that does some processing.

Here, in this case, remember what I said at the beginning. You need to make sure that the code can be executed independently. Here, in this case, I had to ensure that compute my fractal, the API, the function that does all the work, could actually be executed as a separate entity.

Which, no big deal, because it's pretty straightforward. So let's try to see a nice graphic here on what's going on. So what's going on when you're not threaded? Remember, I have one thread, one process. One thread in the process that has been launched, the process being the application.

In that case, I get a command from my Carbon event handler that says, hey, do some benchmarking, or compute the factor. Then I get into the main thread. I compute Mandelbrot. Then after that, I compute. I have the API that does the real work, that's going to do that loop that goes through each line. And then we go back to compute Mandelbrot. The buffer has been filled. It's been computed. And then I go back to the main event loop, and I display the buffer results.

So now, how are we going to have to recapture that part of the code, that routine, in order to be threaded? OK. So what's going to happen, same as before, somebody's going to do the benchmark, like compute the model broad space. What I'm going to do then is that I'm going to spawn two threads.

I'm going to divide my buffer into two different parts. The first one, I'm going to spawn thread number one, and that's going to be in this routine that is pretty much an exact copy of actually the one I had before, but just with an adjustment of parameters for like the beginning and the end of the computation.

And here what I do is that I pass the offset, you know, a pointer to the beginning of a picture. And then the last, you know, another parameter is actually the end when I want to stop. And here in this case, it's like the size of a picture divided by two, you know, for the number of loops.

Thread number two is going to be spawned as well. And here, as you can see, the red value has been changed. And what's going to happen is that I pass the second half of the picture. And here, you're probably wondering, well, why just two? And this is actually something that happened during one of our workshops, where some folks were wondering, well, the problem here is that you think that there's only two CPUs, but what happens if one day you have more CPUs?

And that's true. You should not make that type of assumption. Your code should be able to actually divide and slice a trend time. And it's not very difficult. Just count the number of CPUs, and you can actually divide your picture like that, which actually I did as well.

The cool thing here is that remember, I'm gonna spawn these threads, but I don't want to get in the business of doing management of threads, okay? I don't want to re-architecture my whole application. So I still want the application to be blocked. So, when I get inside that routine, CalculMondelbrot, I want to spawn my two threads, but I want to wait there.

I don't want to go back to the main event loop because I don't want to be able to, I don't want the user to click again and recompute, because then I'll have to do all that management, find out if the threads are done computing, restart them, I mean, kill them, restart them with the new parameters. So here, my idea was, very simply, I wanted to take advantage to the fact that the 2.5 of two processors inside.

So what I wanted to do is that made that computation as fast as possible. I didn't want to re-architecture everything. So here, what I'm gonna do is that in ComputeMondelbrot, CalculMondelbrot, I'm gonna wait. I'm gonna sit tight and I'm gonna wait. And we're gonna see how we're gonna do that.

And then, obviously, remember what I said, when the threads are done, we need to find a way to signal or notify the main thread that actually we're done. Because remember, we're inside the routine, CalculMondelBrot. I spawn two threads. These two threads are going to be just doing some work.

But it's like 10 milliseconds to spawn the thread, and then we go to the next code. And then there is no way for me to get back to my routine, because that's an independent execution code path. Remember that. So we need to signal. We need to get back to a main thread and say, hey, I'm done.

OK. So how are we going to achieve that? Step number one, and hopefully it's big enough for you guys in the back, step number one is that you're going to have to initialize the MP libraries. And here, the example I'm taking, I'm going to be using actually the MP libraries, multiprocessing.h. I think it's in Core Services, in the framework Core Services.

So first thing I'm going to do is count the number of processors. Then I'm going to create a queue. And I'm going to explain in more detail what that is about. And then after that, I have this loop that goes from zero to actually the number of processors.

And I create a task. Think of a task as a thread. Well, kind of. Let me go in more detail about that. The MP APIs have a cool abstraction level. I mean, I really liked it personally, because I think it made my life very easy for implementing that feature.

Think of it as this way. What happens is that we're going to have a queue where we're going to submit jobs. And the MP library is going to be the one actually dispatching that to the different libraries. And then the MP APIs are going to be the one distributing that load actually to the different tasks.

So this is where I'm going to schedule my jobs to be executed. And then after that, the MP APIs are going to be actually the one distributing that load actually to the different tasks. So this is where I'm going to schedule my jobs to be executed. And then after that, the MP APIs are going to be actually the one distributing that load actually to the different tasks.

Even if you have a dual tool, for instance, you can create four tasks, or eight tasks if you want it, or six for that matter. It's up to you. And I'll show you that in the demo some actually interesting things about the overhead for creating more tasks than processors.

So the reason I think that this is kind of cool is that we had George Warner, who works in DTS and all psychophrenic optimizer guy. He wrote some sample code because I went to see him and I said, OK, I'd like to thread that. Like, what do you think I should do? Like, what should I read?

And he wrote actually yet another abstraction layer on top of the MP API. So it's pretty cool. So to do the job that I showed you here, you can just do that or call it MPJobsInit. And I'm going to be posting, actually, the sample code, that code specifically for you guys.

So you can use that. It's like a very easy set of APIs to actually submit jobs and initialize the stuff. I think it's three or four routines. It's very cool. And then he has an MPJobsInit, as I showed you, and an MPJobsSubmit. And actually, that in the sample code is going to be submit my thread, spawn my thread, submit a job to actually the queue.

So we'll see that in a second. Then step number two, we're going to have to move our tight loop inside a new routine. So remember what I said, where the CalculMondedBoard, that does the work. Now we're going to have to create something that can be executed independently. And here what happens is that I'm going to create a new routine.

I could have overridden the other one. And in this case, because I want to be able to reuse my sample code, I'm going to pass two void pointers. Because then I can do dynamic typecasting, and I can use that code later on in another project if I wanted.

That's going to be the routine, the function that is going to be called, actually, by the threads. This is my execution code path that's going to be executed independently of the rest of my application. So that routine is going to be the one doing the crunching. So what should you do in there? You should prepare the data.

What I mean by that is that I'm going to reticast, actually, the void pointer to some internal data so I can get back the beginning of the loop, the end of the loop, a pointer on the picture. Because it's a Mandelbrot space, the imaginary number to compute the deltas and find out if the number is in or out of a Mandelbrot space. So the real part, the imaginary part, and et cetera. Then I do my crunching. So it's yet a loop that's going to be executing here. And I'm going to compute if the pixel is in or out of the space.

And then once I'm done, I signal. So I need to find a way to-- because that routine, once we get out, we're lost. We're in the blue. So I need to find a way to say, hey, you know what? I did my job. You know, I'm done. I computed like half of a picture is finished.

Okay, so step number three, what I'm going to do is that, you know, to simplify once again, I don't want to go back to the main event loop. I'm going to create a new routine, which is a completion routine. And what that's going to do is that it's going to sit tight. It's going to like, you know, you know, just wait in there. It's going to be a routine that's going to be called for my main thread.

Okay. I don't spawn that, you know, I'm going to be waiting there. And that's going to be my routine that's going to be waiting to be signaled. Okay. And once again here, that enables me to keep my existing architecture and not to have to re-architecture the whole application.

OK, so here you have actually the way that I'm going to do to schedule the work. First, I'm going to create a semaphore. And the semaphore is going to be that object that I'm going to keep between, actually, my threads. And I'm going into more details about the semaphore in a couple of slides.

The API I'm going to use, mpjobsubmit, is actually the one that is in the sample code that I'm going to give you guys, that enables me to just actually submit a job. And here in this case, what it takes is actually a proc pointer on my routine, which is a CalculMondel.thread proc.

And the two parameters you see after, remember, are actually a pointer on the data. Don't make the same mistake that what I did, which is I created a pointer. So first, remember, each of the threaded routines has its own register and stack. So that means that you want to pass a pointer on memory.

Because once you've got in there, you want to make sure that that memory is unique. So the error I did the first time was that, you know, I get, like, you know, create a new pointer. I set my data inside, so I say, like, start of the loop at zero and end at, like, you know, half of a picture. Then I spawn my thread.

Then I use the same pointer, and I, you know, just modified, like, you know, the parameters inside. I said, you know, start from half of a picture. But the fact of the matter, when I was doing that, I was modifying memory that was actually being executed in another thread because I passed that to my first thread. So don't make that mistake. Here, in this case, I create two pointers, P and P2, that have been typecasted to a void point, to a void star.

And then after that, you have to understand that the MPJobsSubmit, you submit the job, it doesn't wait until the job is finished, OK? That's the main idea of threading, actually, that part of the routine. So it comes back. And then I spawn in the second thread, it comes back, it doesn't wait for it to be finished. And then we wait for completion. And this is actually the routine that's going to block. That routine is going to stay and wait for me to be finished. OK?

So the semaphore is this opaque object that's going to enable us to actually be notified when the thread is finished. I had this first version of a slide that used a semaphore. Semaphore in French is like the lights with a state. Bad idea. Think of it as like a little box with state changes. So like maybe a state table. Think of it as a state table. And this is what we're going to use. We're going to use that object to actually find out when the threads are done.

So here in this case, we're going to call MPCreateSemaphore, which is in multiprocessing.h. The first two parameters are the maximum state and the initial value. And here in this case, the max state is going to be two. It's going to correspond to two threads. And the initial state is going to be zero. I want to start at zero. And it sends me back actually a pointer on my data. And here what happens is that that data is global.

Because what happens, I want it as a global because I want it to be able to access from the different threads. Okay. So the main thread is going to be the main thread. And the spawn thread. Okay. Remember, because all these threads, they're actually in the same memory space. Okay. So I can do that.

So now, remember, we have actually the threaded proc. How do we notify? How do we signal that actually, like, you know, you're done? How do you change the state in the semaphore? Well, very easy. MP signal semaphore, which is available as well in, like, multiprocessing.h. And you just pass, actually, your global semaphore.

So, waiting on the semaphore. That's what I call the waiting game. You have two ways to wait on the semaphore. mpwait on semaphore is the API you're going to be using if you want to sit tight and wait until you're being notified. If you pass k duration forever, what happens is that you're going to wait, that code is going to block until somebody changes the state in the semaphore. So here is what happens. You remember, we spawned the thread number one, we spawned the thread number two, and we call this API, mpwait on semaphore. So we're waiting there because we passed k duration forever.

When the signal is done in the thread, what happens is that the semaphore state changes. It goes to one. Then that API comes back because the state has been changed and puts it back to zero. And I'm going to show you that in the next lecture, which is better.

The k duration immediate changes the state in the semaphore as soon as you call the API. It doesn't block. So for instance, if the state was at two, and you would call mpwait on semaphore with k duration immediate, it would actually subtract the state, so it would go back to one and then to zero, if you were to call it twice.

So now, let's look at our next graphic again and let's see what's going on in the threaded case. We're in the main thread. We've been past the buffer when CalculMondell brought. We create the semaphore, two states, initial state zero. OK, what's going to happen after that is that I'm going to spawn thread number 1. And here, the MP weight on semaphore should be after my mistake, but I'm going to spawn thread number 1 and thread number 2, as we did before.

So now, what's going on in that point? At that point, we have thread number 1 doing some computation. Thread number 2 doing some computation on the other half of a picture. And the main thread is blocked on MP weight on semaphore. Which is good. That's what we want. We don't want to re-architecture, so that's good. Now, what happens?

Boom, MP signal semaphore. We're done. We computed half of a picture. We're finished. We say, hey, you know what? We're done computing. Do whatever you want now. My part is done. The thread is finished. What happens is that that's going to actually increment the state in MP weight on semaphore. I'm sorry, that's going to increment the state in the semaphore. So from 0, we come to 1. MP signal semaphore is done. Our whole thread is finished. It's done. That routine. We're not there anymore.

But then what happens, the state changes, so in our main thread, mp-weight on semaphor changes and comes back and doesn't block. And in doing so, we put the state of the semaphor to zero. then after that and you know in this case I said you know thread number one finishes but it doesn't really matter okay semaphores are reentrant so if both finish at the same time the MP library knows what to do so don't worry about that so now let's say and you know it doesn't matter to effect thread number two finishes before one we don't really care so now let's say that MP semaphore thread number two is done you know we're done with our loop then we call MP signal semaphore in that in that routine what happens here is that that increments actually the count on the semaphore we're back to one oh the semaphore state has changed then what happens well MP when on semaphore the second one we hide comes back the state goes back to zero that means that actually like the CalculModelBroad is done so we know we don't block when we're blocking that routine we go back to the main event loop and we display the result so remember that everything below or like you know at the same level CalculModelBroad has not been changed our main event loop the rest of our application we didn't have to do anything That make sense?

Good. OK, now let me show you a demo of that. If we could switch to the demo number one, please. OK, great. First thing first, I wanted to mention that actually Richard Kurtz, who is one of our longtime developers on our computer, sent me that code. And thank you, Richard, for that.

We were working on some things, and he sent me that code. And then I decided, hmm, wouldn't it be good to use that as an example for threading? And so then what I did is that I thread the application. So thank you, Richard. There we go. So here what we have is just a basic manager board space.

If you do some research, what happens is that you should know that before doing anything, you should put some kind of benchmarking if you want to do some work on performance. Here in this case, what I do is that I have a benchmark. Let me move that a little bit. Everybody can see?

Here I put a benchmark and so here it's a pretty easy space. You know, you have to understand that the difficult part to compute is actually the part in black, okay? So here what happens that I have a flight, so I compute, I think that picture, I compute the picture like something like, you know, 10 or 20 times, I don't remember exactly, but whoa, what happened here? - We crashed or we disappeared? - Oh, that's a good demo. No, it's here, I'm sorry. I don't know. Maybe I clicked too fast. Let me remove the dock.

OK, so what happens here is that I estimated a benchmark, actually Richard wrote that code, and it just tells us how long it is to compute. And here, you can see that it took like 0.95 seconds to compute that space. Once again, important to understand that the black part is the difficult part to compute.

So obviously here, there's nothing very difficult to compute. What I wanted to show you, too, is one of the tools that ships with our system, which is called ThreadViewer. You guys know about ThreadViewer? Raise your hand if you know. OK, good, pretty good. Seems like all of you know about threading already.

And here you can see that what I did is that I initialized at the beginning the MP libraries with like three processors. But for Kicks, I actually have this routine that enables you to create like more threads if you wanted, but I'll show you that. So what happens here is that we can see actually the work that is going on.

I'm going to go to like maybe a difficult, more difficult part, and I have a cheat sheet to make things faster. So here we're gonna try to find, like, we're gonna try to fill the screen with more black, so like, you know, we really use the CPU power. Computing the white part is like pretty straightforward and easy. It would just take a second.

Okay, I think it's good enough. Let's not worry too much about it. Okay, good. So we're here, let me remove like Veltevec. And so here I'm gonna benchmark it and I want to show you that here, we're using only one thread in the bottom, okay? You can see here, actually the green is like, you know, when user space and so we're computing here, which is kind of sad because that's a typical example where actually you want to use threads for that type of like, you know, computing. Obviously it makes a lot of sense. So here you can see that, you know, to doing my benchmarking takes quite some time and that's a dual G5.

And I have something like, you know, something like two gigs of RAM. So obviously you guys don't sell software that compute model broad space, or maybe not that close, but I think you can like, you know, probably relate to like some part of your code that you could use that in.

So here we're done. You can see I run with the thread and it took like, you know, 6.31 seconds. And, you know, I have a min and max, like, you know, computed. What we're going to do is that, you know, I'm just going to use the threading, and I'm going to do the benchmark again.

I'll show you the code after that. But what you can see here is that now both CPUs are being utilized. The white space you can see between the threads is because the fact of the matter, what happens is that-- I'm doing a flight, so what happens is that the picture takes a certain number of seconds, but I do that processing something like 20 or 50 times.

So that's why we go back to the main event loop because the threads are gone, so you see one thread at one point. And here you see four seconds. So we went from what is it, like 8 point something to four. So we get almost twice speed improvement. I did some testing before which was actually rather interesting.

but between the FP results, I got an average, depending on how difficult things like 1.7 times faster between the different results. The cool thing too is that if you then on top of that put Altivec, you get some dramatic performance because then you use both Altivec units, 128 bit computing per cycle, and you get to something that gets very, very cool. So here in this case, you know, depending with like what you're doing with Altivec on the threaded, I get to like something like like 1.8, 1.9 times faster depending on the space.

So I can show you that, you know, now if I do the benchmarking, you know, we're at eight or nine seconds. And now if I do altivec plus the threading, we get a huge speed improvement here. : So we got like six times faster between the threading and the altivec.

Okay, so that's a cool demo. What I want to show you is that it's always hard, and I got the question which is like, "Well, so how many slides? How do you slice your picture?" I mean, you have two CPUs, but what happens if you, let's say I'm going to have four slides and I want four threads, and I have only two CPUs.

And I was thinking, "Well, that's true, so what's the overhead?" And this is where you see that actually Mac OS X is truly great at multitasking because I'm going to put four threads and four jobs, and I'm going to move altivec because it goes way too fast, well actually it doesn't matter. But if I benchmark here, and here you can see that I have four threads going on, okay.

you're going to see that there is not so much overhead. And in certain cases, actually depending on the memory usage, you get actually some pretty good results. So it's very interesting because what I'm getting to is that you could ship code that is threaded, OK, and let's say you're going to decide on two threads, and you divide your picture or whatever you're doing, you're processing in two threads, and then you go back and if you run it actually on your powerbook, you'll see that there is not so much of a big overhead. In certain cases, it's probably going to be the same speed, depending on what are the processes that are running on the system, of course. So this is very important to understand. Truly great multitasking system.

OK, I showed you that. Let me show you the code quickly. This is the library I told you about. And these are the two files I'm going to be actually posting for you and I'll tell you during the Q&A. I don't want to take too much time right now. But here we have a couple of wrappers. And here you have the MP jobs in it.

So what we do, just as I showed you, what you're going to create is the number of processors that are scheduled. OK, it's going to be one or two. And then after that, we create the queue. This is actually a global variable. And this is where, actually, you're going to submit your code for execution. And then I just like this basic loop that actually creates a task per processor.

Okay, now let me go back to the compute code. This is the fractal, so we get in here. Is that large enough? Everybody can see in the back? We're good? Is that clear? Hello? Yeah? Everybody can see? Okay, good. Thank you. So here I just have a global that I checked, you know, it's because for the demo purposes I check if we're in the threaded case or not.

And here I have actually something that computes the number of jobs, and that's another global that I use that is set actually through the menus. As I told you, I create a pointer for each of the data structures I want to pass to each of the threads. Then I have a loop that submits the jobs. And this is very cool because seriously, with the MP wrapper APIs that I had, it took me a couple of hours to implement that.

We had a Cocoa guy actually that was here that has a software that does compression for extending to a phone. I mean, it's pretty cool, good stuff. He has a wavelet compression algorithm. In less than four hours, we actually changed his code to use that type of threading.

And it took us actually like three hours, but that's because I was coding. I mean, somebody else will do it probably way faster. But here, the cool thing is that you can see I'm using MPJobsSummit, which is an API that is part of the code I'm going to give you guys. And I just passed, actually, my routine and some pointers.

And here, you see the job data, which is the stuff I have. Let me show you. I don't know why certain comments seem to be hiding things. Oh, I'm sorry. Okay, let me open the project again. Great. Awesome. I'm going to go back here, and I want to show you the submit code. I want to show you here.

It doesn't see it. It seems like... Okay, so we're here. So what happens, this is actually, remember the routine that's gonna be called by each thread. Okay, remember, ExecutionCodePath, that's gonna be done independently. So this is actually what's gonna be called. I just passed actually the address of that routine. Very cool.

First thing I do is that I repurpose actually the data. Obviously, you could pass the fractal data, which is a structure I created, but because I wanted to use the MP.C, MPJobs.C code in other programs, I decided to use void star as a way of not depending of a data type.

This is actually cool. I mean, once you see the code, it's pretty cool. So I repurpose all the data. Okay, we're good. I recompute actually the delta for the start, remember? I have to adjust that routine of the algorithms from the start to the end of a computation.

And here, don't worry about that, I should have removed that, but the main thing is that we're here, I do the loop. Why is it fast? Well, because, you know, what happens is that think of it as like two different code paths. One is going to start and do like the first part of a picture, and the second one, the second part, and I change that in the beginning and end of sets. And we do the computation in floating point here, and with a velocity engine here, and then when I'm done, I signal the semaphor, remember what I showed you like in the program. Okay? If you could switch back to the slides, please.

OK, some recommendations. So the MP init, the stuff I showed you, when you initialize the MP library and count the number of jobs and submit, create the task, don't do that for each time that somebody requests to do an action that's going to code your threaded code. Do that in your main when you start the program.

And then when your program quits, just clean after yourself. Don't recreate and reuse. In this case, I do that because I wanted to have an example where I use eight threads, for instance, and show you the results with only two CPUs. But in a typical case, you would not want to have this overhead.

So do be data driven, and what I mean by that is that think of it as Do your setup, your memory management, create your arrays, and then your threads should be actually the routines that do the real work. You don't want to start to be in a thread and wait for a Carbon event somewhere or be notified by another thread. You want to use really the threads for doing the data crunching.

You don't want to sit in there because then what's the point of having a thread if you're waiting for something to happen from another thread? And that could be a case, but the main idea is you want to use the bandwidth of the G5s. You want to have the threads do the data crunching.

So before you spawn the threads, set up all your arrays, set up the memory, and all the things you need to do to make the data crunching effective. And then when you come back, we'll do the closing, or before the thread dies, do the cleaning. you know, clean the memory that you have allocated from a thread and that kind of thing.

Okay, so let's go back. What happens in the threaded case? You're going to love that slide. Remember, we create the semaphor. OK, initial state 0. That's good. We spawn the two threads. OK, thread number 1, thread number 2, and then what? Well, we wait on the semaphor, but what happens here? We're blocked. OK, as I told you before, that's fine. Our initial goal was to really get the data crunching. We wanted to do that operation as fast as we could. And we didn't want to re-architecture the whole application.

So let's say now, step number 2, you ship your application with a step number 1. That's the first step. OK, that's cool. Your customers are very happy because some operations are up to two times faster. That's cool. Everybody's happy. But now, let's make the whole experience better. Let's try to actually thread the whole application. What should we do in this typical case? Well, that's going to be the easy one. But what we're going to do is that we're going to create a semaphor. We're in CalculMonded mode. That's good. We're going to spawn the thread that does the waiting.

Remember like, you know, before that routine, I had something that was doing the wait with the two calls to wait for completion. What I'm going to do now is that that code, I'm going to put it in a routine and I'm going to spawn that routine as being like, you know, one thread.

So what that thread does is that, you know, it sits down right now. It's like just waiting. Then what we're going to do is that we're going to spawn the other two threads, thread number two and thread number three. And then when we do that, what happens? Well, we go back to the main event loop.

So now careful because that means the user could actually go back and say, hey, benchmark again, and the threads could still be running. So remember, we go back to the first part of the presentation where I said, hey, guys, you have to be careful. If you want to thread the whole application, you're going to have to do some thread management. It's possible, okay? It's just I want you guys to understand that there are different steps and different ways where you can plug and thread your application. So let's say we did that work. Well, very cool. And what happens when it works? Well, we're going to signal.

Remember, the signal is going to bump actually to one, in this case, like the semaphore. Then what happens is that the NP-way-on semaphore in our routine, because you know, once again, the semaphore is global, is going to actually come back. So like, you know, that code doesn't block. Now we're blocked on the number two. NP-signal semaphore is actually called inside thread number two.

The state changes to one, and then what happens, the NP-way-on semaphore is going to turn it back to zero. At that point in time, we have a thread that tells us, that thread here, thread number one, says, "Hey, my other two threads are done doing the work." Now what do we do? We have that, but we need to notify the main event loop. Remember, we have to tell the event loop, "Hey, I'm done." So how are we going to do that? Well, very cool. There is a very nice Carbon event code, which is called postEventToQueue.

And what that's going to do is that we're going to create a Carbon event, pass it to this API. That's going to send it to the event manager, and then that manager is going to dispatch it to our main thread. This is what you're going to do when you want to update UI, for instance.

PostEventToQueue, very cool. You pass a Carbon event, you just need to install a Carbon event handler. The Carbon event handler could be installed on a window, on a control, a widget, an HIView, on the application, it's up to you. A lot of flexibility. And then inside my application, the main event loop gets notified, and then I can display my picture when we're done.

So, some do's and don'ts in that case when you start threading the whole application. Be careful with the UI, okay? It's okay to draw with quarts. You may have some issues depending on what you're doing, but it's okay to draw with quarts from different threads. And we have some, actually some sample code on DTS.

on DTS on our developer.apple.com website enables you to check that out. And George is gonna be here and can give you the complete URL. OpenGL is okay as well. And once again, if you want to notify the main event queue for drawing a button, for updating a scroller, please use postEventToQueue.

PostEventToQueue is very cool because you can call that from wherever you want. You create your own Carbon event with your type, you know, like it's up to you. And then, you know, you have your Carbon event handler in your window, and your application is just going to be called. This is the way to do, like, you know, user interface from different threads.

All right, quick summary, it's not that best. So once again, thread your application when it's appropriate. Obviously, some of the examples I gave you here maybe don't apply to you. Don't start going in a frenzy and start threading, even if it's easy. And the main idea is that I would encourage you guys to go back and think maybe, for a couple of minutes, think, what part of my application is taking a long time right now? What part can I do better for the user? And once again, you can have two motivations for threading your applications.

It could be responsiveness, because you're doing a lot of things, and sometimes the user can't do anything, the menus don't go down, and you're blocked. And you want to improve that user experience. But then the other one is for, in this case, for instance, that I showed you, you're doing a lot of CPU-intensive tasks.

Maybe you're doing the computing, or this huge array, like matrix manipulation, or your job is to compute. You get an MRI, and you have to find out if that MRI has cancer or something. You get the idea. And it can take a long time. That's the typical case where I want you to think and think in terms of, can that be divided?

Could I use different threads? Is my code path can be executed independently? So think about that. I think a lot of developers actually sometimes just don't think about it, because we're going all these features. But I think it's very, very important with what our users are buying now in these new G5s, that actually we all think about responsiveness and high performance by threading. And once again, I'll be posting the sample code probably tonight. Actually, I think I can do it now at 5. I don't think I have a session after that. And I'll go into more details about that in a second.

All right, if you want more information, we have some stuff on Carbon threads and the multiprocessing services. And ADC home, you know, is developer.apple.com. I'll let you read the slides. If you're interested in Cocoa, obviously Cocoa has some threading as well. If you're interested in the POSIX P threads, you can just do a main.

The main page is actually pretty good. I mean, I know if you're coming from Mac OS X, you're probably thinking, "Main pages? I don't want to go to the terminal." But personally, I encourage you to actually check it out. It's a very, very good start when you look for information. The Darwin CVS repository, if you really want it, but technically, I think the opengroup.org has some pretty good news and updated information on P threads. Okay. We have some technical notes, the 2028 for the technical architectures, and we have actually a tech note with the MPCF routines.