Writing Threaded Applications on Mac OS X - WWDC 2003

Core OS • 22:50

Learn how to leverage the threading architecture of Mac OS X with single and multiprocessor hardware to significantly boost the performance and responsiveness of your application. This session discusses the threading models available in Mac OS X and when you might choose one over the other. We go beyond discussing the MP API and demonstrate optimizing MP performance techniques as well as when to employ threads and when not to.

Speakers: Mark Tozer-Vilchez, George Warner

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

This is session 111, Writing Threaded Applications for Mac OS X. As you know, writing a threaded application is one way that you can maximize resources and performance on any system, whether it's a single or dual processor. This session will go over a lot of the tips and tricks, the do's and don'ts of writing a threaded application and why you would want to do it and times when you wouldn't want to write a thread. So hopefully you've been able to take a look at the resources online at the developer site. This session will supplant a lot of that information and I'd like to bring up George Warner who will deliver the presentation this afternoon. Thank you.

Before I get off this slide, I just want to touch on my new job title. Some people have noticed, asked me about. For any of you that don't know, I'm the engineer formerly known as the Mixed Mode Magic Fragment Scientist. For obvious reasons, those technologies have kind of faded out somewhat. So people have asked, suggested that I update my job title.

Excuse me. So the Schizophrenic Optimization Scientist now. A little debate over whether it should be plural or not, but... So the agenda will be covering some terminologies. There's some main conflicts in some of the threading models. The why, the why not. We'll be covering different architectures, talk about the do's and the don'ts. We'll be doing some demos and then we'll break for some Q&As.

So the original thread manager was a cooperative thread manager and everything was scheduled cooperatively and we referred to it threads. When we came out with the MP implementation in 8.6, we didn't want to call them threads so we called them task. Now we're running on a Unix system where everybody knows what a task is and there's a bit of a clash in the naming space. So for the purpose of today's presentation, when I'm talking about a thread, we're talking about an MP.

It's an independent path of execution. Everybody is pretty much up on what a thread is. When I talk about a process, it's a collection of threads plus the resources, basically an application or a task that's running. In the Unix world, you'd call it a task, but I have a tendency to call it a process. And I want to point out multiprocessing is a general case of multitasking.

Back in 8.6 when we introduced all this, a lot of people I'd suggest, well, why don't you use an MP thread? And they go, well, I don't have a MP thread. I have an MP box. And it took me a while to educate the market and make them understand that multithreading really doesn't have anything to do with multiprocessing per se. There's just advantages that if you do have another processor, the multithreading API can help you take advantage of that.

So why use threads? There's a customer expectation that your application is going to be very responsive, that when I click a button something's going to happen. And there's nothing worse than when you get the little spinning beach ball. Scalability. Like I said earlier, if you actually have a dual processor box, customers really want to see both processors getting utilized. So if you write your code threaded, there's a better chance that those second processor resources will be getting utilized.

So, way back when I first started working on the MP stuff, these benchmark numbers came out and they keep getting regurgitated and reused every year. Anybody that's come to a previous session has seen these numbers. The really interesting number there is the top one. On a dual processor box, we went to 2.3 and everybody goes, wait a minute, how can you get better than two times the performance? You only got twice the horsepower.

And that's exactly what I thought when the first time I ran the numbers, so I ran the numbers again. And when I went and analyzed the numbers and tried to figure out what was going on, what we finally determined was we don't just have twice the processors, we also have twice the cache. So, our data job just happened to not quite fit well in a single processor, one megabyte cache, but it fit extremely well in a two megacache. And so we got actually better than twice the performance and that's called superscalar. So, why? I use threads.

This is what I call the shelf space theory. The more things you have scheduled and running out there, the more CPU time you'll get. So that's one way. If there's two apps running with the same priority, then they'll both get approximately 50-50% of the time. If you've got three threads running and there's only one other thread running in the same priority, then you're getting three quarters of the time. So it's kind of a cheesy way to borrow more time. No way I would ever do that.

So synchronous request, talking about the spinning beach ball, if you call an API in your event loop and stall the processor for whatever the time is, all of a sudden you get the spinning beach ball and your customer goes, well, this isn't a very responsive app. I don't like this one.

So you can put blocking calls in your own thread and let the main thread continue running, dispatching events, updating the screen, and let the user do what he wants to do. And when the synchronous call finishes, he can signal back to the main thread that, hey, I've finished doing this. You can update the progress bar. You can continue running on.

So polling is bad. Another one of my favorite examples. Anyone that's ever taken a vacation with a nine-year-old, this is the equivalent of are we there yet? Are we there yet? Are we there yet? And for me, bad is annoying. So blocking is much better. You get away from all the asynchronous calls and playing with callbacks and dealing with does it happen when I thought it would and chaining events, et cetera. You can just put everything in its own thread. A good example of this was in, I think it was 8.1 or so when we actually went to the finder copy.

And once upon a time you do a finder copy and you go get coffee because it was only to do the copy. When a very clever engineer, I won't mention his name, it wasn't me. I did say clever engineer, didn't I? Wrote the asynchronous copy and basically he chained all these IO completion with this state machine. It's about four pages of really complicated code that gave you a headache to even be in the same room with.

And it was really sweet and everybody liked the fact that you have multiple copies going on and all this kind of fun stuff. And that was great up until 8.6 came out. And we took those four pages of state machine and replaced it with like six lines of code.

We spawn it off in another thread. We do source file open, destination file open, read from source, write to the destination, loop until the end of file, close the source, close the destination, done. Pretty simple. The only other thing going on with all that was updating the progress bar over in the finder and that was all happening behind its back. Very simple code, easy to maintain.

We let the interns do that. So when to avoid threads. So you want to avoid threads when it adds complexity. If you have global data in your application that you're going to share between multiple threads, then you're going to have to have locks on it. And that can get, you can introduce deadlock conditions and so you've got to program to prevent that and et cetera. So global data requires locks. If it's just shared, it may not require locks. If you've got one thread that's only reading the data, it may not require locking the data.

Other reason to avoid threads if you have non-thread safe APIs. Then there's alternatives. We'll get to that. added overhead, it takes about 200 microseconds on a dual gigahertz machine to create a thread from the mako thread up. And the preemption time is about 40 microseconds on a dual gig.

Now, interestingly enough, I had the numbers for the new hardware, but I couldn't put it in my slide for what should be obvious reasons. I couldn't rehearse to an audience that hadn't been briefed on the new numbers. So on the new box, it's around 170 microseconds on the creation and it's still exactly 40 microseconds on the new box because the preemption has to save twice the number of registers, the width of the registers.

So I was pretty happy to break even on the preemption time, but we'll continue working, hopefully bring that down some more. The memory footprint. Each stack that you create gets a 512. 12K virtual stack. Obviously, that's not using your physical, but it will eat up your virtual memory space. And so you don't want to create a whole bunch of threads and run out of memory.

Kernel Resources. Every thread you create is a Mac OS thread. It has kernel resources that we allocate to it. The hardware context storage, which is all those registers, and I said 32 here, but now it probably changes to 64. About 2K per thread. And that memory is physical and it's locked down and so we want to avoid using too many of that.

Hundreds of threads. When I talked about the preemption and the 40 microseconds, that's a cumulative. So if you've got thousands of threads running, you're going to probably spend as much time preempting and switching between all those threads as running actually useful code. So you don't want to get crazy with inventing, with spawning threads.

So other options, cooperative threads, if you've got unthread safe APIs, you can use the cooperative thread manager and schedule those cooperatively. You can use timers, Carbon timers, Cocoa timers, et cetera, to have tasks run at predetermined intervals. Threading architectures. Parallel task with parallel I/O buffers. An example of this would be if you've got multiple independent tasks that don't have really that much to do with each other.

I think one of the very first multiprocessing projects I ever worked on, I want to date myself, 20 years ago, we were actually running a driving simulator, had an AI running all the cars, and we had a physics engine doing collision detection, literally, and then the Newtonian physics and then the graphics rendering thread, et cetera.

And when I tell people this, they said, "Wow, you guys were doing that 20 years ago? That's pretty impressive." And then I kind of qualify that. We were doing about five minutes worth of video in a month. So things have progressed considerably. No real time back then, or not much.

So it's real slow time. Parallel task with shared I/O buffers. Now this would be an example of where you've got the same thing to do on a lot of different pieces of data and you can split it up into little pieces. Like if you've got a graphics image and you can like take little postage stamps and feed each little postage stamp off to a different processor or a different task and they all do their crunching and do everything to do and then put them in the output buffer and put it all back together on the other side. And this is the best model for dealing with exactly that case and you typically want as many tasks as there are processors.

Sequential task with multiple I/O buffers. I call this the pizza oven. Some people call it the assembly line. One of my other engineers in DTS wrote an application that the first input buffer was nothing more than some FS specs that you got from a drag and drop. The first task took the list of FS specs and for each one found if it was a file or a folder.

File or a folder. If it was a folder, it would feed it back in. Or if it was a file, it would send it to the output of the first task. If it was a folder, it would iterate over everything and then send each file to the output. The second task would take this iterated or unflattened list of file specs and start reading them in and streaming them out. The next task would take the data that's streaming through, it would compress it and then the next would send it over a network port.

And then et cetera, et cetera. On the opposite side of the network, it would uncompress it and then it would unflatten it and then it would create the directories. And so it would basically do a backup across the network and with the compression and everything else, we actually got better throughput than most of the finder did doing a flat uncompressed copy across. But this is an example of a sequential task where you're basically doing one thing after another. after another after another to the same data.

Threading architectures. Many applications have both parallel and serial or sequential execution paths. So it's basically everyone has to know their app best and know which model works best with what they're doing. So word processor, you could have a grammar or spell check running independently of font, texture, current rendering, et cetera. So driving simulator, like I mentioned earlier, you could have AIs driving the other cars. Same time you got a physics engine running, same time you got a rendering engine running that's putting bits on the screen.

So implementations, I'm not going to go into an API account or anything like that. Depending on which architecture you prefer programming in, there's thread implementations in all of them. Java has Java threads, Carbon has NS threads, and Cocoa has NS threads. Oh, this is all mixed up. They just wanted to see if I was paying attention.

Okay. And so you use the one that's appropriate for your environment. I have to say, having looked at the implementation of these, the top layers are extremely thin. They're sitting right on top of P threads. So I wouldn't worry too much about the overhead in that model and I wouldn't go to P threads unless you really, really, really, really want to. So... So the common concepts between the different threading models is basically thread management, creating threads, destroying threads, setting the thread priority or weight, et cetera. The synchronization primitives, you got mutexes and semaphores and queues and all the event groups, et cetera, ways of communicating between different threads.

And you've got thread safe services. So this is like currently malloc and the memory allocators, the file I/O, the synchronous file I/O are all thread safe, et cetera. So this is probably the most important part of my talk, the do's and the don'ts. This is kind of a collection of all the things that I ran into from developers coming to me with issues and sitting down and going over code and figuring out why we're having problems. So I mentioned earlier that 200 microsecond creation time can add up if you're dynamically creating and destroying a lot of threads. So a good way to avoid that is to preallocate and use pools. Same thing with the memory and et cetera.

Try to be as data-driven as possible. You want to write your code CPU-driven, so you want to feed as much data to it and have it ready to go when you get ready to do. Typically, this isn't a problem. Most people write their code in what I call this format. They have a prologue where they open files, allocate memory, et cetera. They crunch, crunch, crunch, crunch, and when they're done, they close their files, dispose of memory, and release, et cetera.

Some of the other implementations on other operating systems for some reason love doing the kill and the cancel kind of things where they have one thread stopping another thread. Our system doesn't play well with what I call asynchronous behavior. We didn't design it that way for good reasons and we really, if you want to take advantage of the way we do things, you really want to avoid using those kind of APIs. When at all possible, use the synchronization methods instead. So use a mutex or semaphore, et cetera, to control.

So when you use Semaphore or UTX on a data structure, you have to kind of achieve a balance between having too many and too little. If you have one big huge data tree and you've only got one lock on it, then the chances of someone else needing that if you've got multiple threads running is a lot higher than if you do your locks down in the branches of the tree. So you don't want to lock everything because you'll probably spend more time locking and unlocking things than actually accessing your data. So there's a trick to kind of find the balance. Amen.

This is back again to are we there yet? The void that's been waiting. One of the common mistakes I see is occasionally we'll have a developer that wants to wait on two things at once. And what they'll typically do is they'll check one with a timeout and after the timeout they'll check the other one with a timeout and then they'll look back and check the first one again.

And now they're basically still polling. Are we there yet? Are we there yet? What you can do instead is have two threads, either thread waiting on a different event, different mutex, and when either one of those threads gets its signal, it signals the third thread that I've got my signal.

And so when he gets that signal, that's the OR, he can continue. If you're waiting on two things, if you want both events to happen before you continue, then you can just block on one and wait for it to finish. And when it finishes, go and block on the other one and wait for it to finish. And if the second one's already happened, you'll just continue running. So there's no reason to, there's never a reason to poll.

GUI. This is another one of those areas I get a lot of questions on. It seems like everybody that ever jumps in, the very first thing they want to do is try to do some GUI things. The good news is, I mean, things are getting much better. In the Mac OS 9 days, it was just the definite don't. Nowadays, we have Quartz. You can draw output to the screen. We have OpenGL. You can output to the screen.

And if you're Carbon, I'll pick on Carbon here. If, like, for example, you're doing that file copy and you want to update a progress bar, a real easy way to do that, use post event to queue. And that way you can tell your main event thread, update the progress bar.

So one nice thing about the Carbon event model in particular is I can send it 14 update events and it doesn't put 14 events in the queue. It's smart enough to know I've already got an update event in the queue and just leaves that one in there. So you don't have to worry about overflowing the queue. So I've got a demo. This will be the not-threaded version. And you see the menus are dead. I can't drag the menu down. Oops, I could drag the menu down. Finished.

I've got a threader running down in the bottom so you could see what's going on. We've got three threads. I'll move my arrow down here. Everybody can see that? Okay. So if I click here, you'll see taking up all the CPU time down in the main thread here.

[Transcript missing]

You can drag the window around. Menus work. And that's the behavior that your users are going to expect.

So as you can tell, there hasn't been a whole lot of new information. The best thing we can say is I think you know what we've been working on. We haven't broken anything that we know about. Everything that there works, works the way it has pretty well and hopefully we'll continue working on that. I'll keep working on the preemption time to get that down, but other than that, that's about the extent of it. So hopefully we'll cover anything else you want to know in the Q&As. So thank you.

So just to give you a wrap up of some of the sessions, the Kernel Extension Programming Techniques, that was on Monday, so on the DVD you'll be able to get the opportunity to take a look at that session. For contacts, George Warner and DTS. You may already be familiar with George through your contacts and email and myself as a desktop hardware evangelist. Let's go ahead and start Q&A and invite a few folks up.