Writing Threaded Apps on Mac OS X - WWDC 2002

Darwin • 53:12

Learn how to leverage Mac OS X's threading architecture with single and multiprocessor hardware to significantly boost the performance and responsiveness of your application. This session covers the different threading models in Carbon, Cocoa, and BSD and discusses guidelines for determining which threading APIs to use for maximum benefit.

Speakers: Matt Watson, Robert Bowdidge

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

This is session 112, Writing Threaded Apps for Mac OS X. My name is Mark Tozer-Vilchez. I'm the hardware evangelist in Developer Relations. Good morning. As we all know, Mac OS X is a fully threaded operating system. For that reason, you should be writing apps to take advantage of that. We ship dual processor systems as well.

So writing a fully optimized application will take advantage of that second processor. I just want to show you a couple examples of, you know, an application taking advantage of that second processor and an application not. So the first example is a simple QuickTime movie. We'll go ahead and play all the movies here.

Can I get the demo screen? Demo two. Thank you. So I'll just run that application. Let's run Thread Viewer, take a look at what's going Video is not important. But what I want to show you as an example is clicking back to the QuickTime movie, holding down the menu, you'll see another thread actually get started there for the menu, but the movie is still playing. Going to another application. This is just showing a molecule rotating. Again, we'll go ahead and attach Thread Viewer. And you can also see the processor is getting pinged there.

So you have a single thread. What happens when you hold down the menu button? Drawing stops. That second thread actually does not get created. What gets created is actually the thread for the menu system, which is the OS, actually taking advantage of that. So for your application, what you want to do is have the user interface give the user the feedback that things can still be occurring. By being able to drop down that menu, perform some actions, and still have the window being drawn is a very huge advantage, particularly when you have dual processors. And for applications like this, a scientific application, that comes in really handy. Go back to slides, please.

So I want to bring up Matt Watson, Core and Technologies Engineer, who will present to you the information today on threading, how to thread your applications, and give you all the new features that we've added to Jaguar for threading. Thank you. Thanks, Mark. I'm Matt Watson. I work in the Core Technologies Group at Apple.

I've been at Apple about ten years now, mostly working on the BSD, Darwin layer and the threads implementation down there. Today I'll talk to you a little bit about writing multithreaded apps in OS X, some of the APIs we provide, the different API layers, things you need to know, what makes our platform a little different from other platforms.

So why do you want to use threads in your application? The first one is something that Mark mentioned, the customer expectation. If they spend a bunch of money on a dual processor system, they're going to want to get the benefit of that second processor. As he showed you, there's applications that can take advantage of the second processor fairly easily, but you can also write an application that's just a port from an existing classic app that will just block the API, excuse me, block the UI unexpectedly. So you really want to design your application to take advantage of all the features of the system.

Scalability. As you know, Mac OS X is a fully preemptive, multi-threaded operating system. So when you use an API that lets you take advantage of multiple threads, if you happen to be running that application on a dual processor system, you can take advantage of that second processor, sometimes without even you knowing it. Preemption. Because the way the system slices time between threads on a kind of a round-robin basis, your application, if it's multithreaded, will get more than another application's share of that time slice.

There's some APIs in the system that are synchronous that you really can't avoid. So, for instance, hostname lookups classically have been these blocking APIs where you just have to wait for the response to come back from the network. And if you use a second thread, that means your application can go and continue and do other things while you're waiting for that response to come back.

Polling. As Avi mentioned in the keynote, polling is bad. You really don't want to have your application sitting there iterating on waiting for an event to occur. So one way to avoid that might be to create a second thread that just waits for the event and then notifies the main thread when that event occurs. Threads are really, really good for data-driven tasks.

And in a dual processor system, you can really get a huge performance boost if the way your application is designed is such that the data can be partitioned across multiple threads and each processor can just crank on the data. And I'll show you a little example of that. This is a slide we've shown before.

For a multiprocessor system, the scaling you can get, the numbers at the end of the slide are the, the multipliers for the benefit for the second processor. Now, one thing you might notice in this, these are basically Photoshop filters that have been optimized for a dual processor system. The second bar there shows you a 2.3x improvement. Well, that sounds kind of strange. How can you get more than 2x when you have two processors?

Well, the, the trick is you get a second cache when you get that second processor. So if your application was using data that didn't quite fit in the primary cache of one processor and you split that task up into two tasks, and both of those tasks now fit into the caches of both processors, now you get a little bit of benefit where you're not calling into main memory all the time to get your data.

There are some cases where you might want to avoid threads. While it sounds attractive, you still have to be careful when you're using threads. There's some added complexity. For instance, if you're porting an application that was single threaded and you want to add multiple threads to it, you have to be careful of the global data in the task. You're going to have to add a locking mechanism if you want multiple threads to access that global data, and that might introduce some complexity that you'll just have to manage.

Threading adds a little bit of overhead to an application that you have to be aware of. There's kernel resources associated with that. There's context switching associated with that. So when you're adding threads, you have to decide whether the benefits of the extra thread are worth it for your task.

In a lot of cases, there may be other options that are better suited for what you're doing, like timers. If you just want to have something that periodically puts up a little display or increments timers, on the screen, that might be perfect for a clock display. You might want to use a timer rather than a dedicated thread sitting there just firing every second. Given that a lot of applications, a GUI application will have an event loop that's already handling events fairly constantly, you can plug into that pretty easily.

If you find that you're creating hundreds of threads in your application, you might want to rethink the design of that. You really want to make sure that your thread model follows the data, meaning if you have data sources coming from disk or network or the UI, that's really the only place that the system can take advantage of multiple threads. If you're creating 100 threads to read parts of a file off the disk, that doesn't make sense because the file is just going to come from the same part of the system. anyway, so that all those threads are gonna be serialized in effect.

A little bit about thread overhead that I mentioned. The way the system switches between threads is called a context switch. So every thread has a set of register state and resources that's associated with it. And whenever the system switches between threads, it has to save the context that was being used in the previous thread and restore it when that thread comes back to life. In the PowerPC architecture, the floating point and vector state is pretty expensive. We have a lot of registers. The vector registers are large. So if you're heavily using that in your threads, you may notice that the context switch time is going to cost you.

There's also memory footprint associated with each thread. Besides the actual thread structures that are being used in the system, every thread by default has 512K of stack. And that may sound pretty large, but the memory itself is virtual, meaning the system doesn't charge your application that memory until it's touching the pages.

But if you look at the address space, you have 4GB of address space, so half a meg of stack size is going to get eaten up pretty quick. There's APIs I'll talk about in a little bit that help you reduce that stack overhead if you know that your thread isn't going to be using a lot of stack.

Thread creation time is also a factor. There are APIs that let you create threads, let the threads exit, but you may want to be aware of the actual cost of that. And if you want, you can create what's called a thread pool, where you just keep a set of threads kind of on the side and just signal them to let them know that they have work to do. Because if you just use the system resources, you're creating all of these resources over and over every time you do a thread create.

All the APIs in Mac OS X that deal with threads have some common concepts. They let you create threads, destroy threads. They are synchronization primitives. So depending on the API set you're using, there's mutex locks, there's condition variables that can signal that events have occurred between threads. And every API set has the set of thread-safe services. And as we work on improving Mac OS X, we try and make that set of APIs more and more rich, whereas you shouldn't have to worry about which APIs are thread-safe and which aren't.

A little bit about the internal implementation. As I said, threads are the scheduling primitive. The kernel doesn't deal with tasks per se as a scheduling primitive. It deals with threads. So, as I mentioned before, if your task has multiple threads in it, the kernel is going to look at your task more often through the scheduling loop.

Since the system is fully preempted, that means a thread can get interrupted at any time. Now, there's some high-level APIs that represent cooperative threads, but underlying the system, those cooperative threads actually get interrupted in between instructions. So there's no guarantee of when that thread will get interrupted. If you're using the cooperative API, we're trying to encourage people to get off of that and use the higher-level preemptive APIs.

We use a priority-based scheduling model. The default scheduling for a thread gives every thread an equal priority kind of in the middle of the priority range. If you want to enhance that, you can increase the priority of a thread. If you know that you have a thread that's not very important, you just want it kind of going off in the background doing some work, you can decrease the priority of that thread and it won't impact higher-level priority threads.

Another implementation detail is we use a one-to-one threading model, meaning we map one kernel thread to one user-level thread. And that's somewhat different from other implementations that you might have seen where there's multiplexing, where multiple user threads map to a kernel thread. The main reason for this is simplicity of the design, and it scales a little bit better on multiprocessor systems. With the multiplexing implementation, there'd have to be some signaling to the kernel when the user-level threads switch context. So we don't have that in our implementation.

Mach is the API that you may have heard about in the kernel. That's where the threading implementation lives at its very lowest level. You can inspect Mach threads and the Mach thread attributes in a task, but we generally recommend that you use the higher level threading APIs instead.

The reason behind that is if you use a low level Mach API, you kind of subvert the higher level API's usage. So if you change a priority of a Mach thread directly and the P thread wants to get the priority also, it might get confused. Those priorities might get out of sync.

Mac is full symmetric multiprocessing. It was designed from the ground up to handle systems with symmetric multiprocessors. So all of the code that we use for locking and synchronization is designed for that use. We have a single kernel for uniprocessor and multiprocessor. Basically, that simplifies our customer support.

We have a single binary that we ship. We have one install. So basically, you can take a disk that you installed on your dual processor system and you can install it on your Mac OS X. Bring it over to a single processor system and it'll work just fine. You don't have to worry about whether that was specifically designed to be UP or MP.

One of the implementation details of Mach is we have lazy floating point and vector context save. I mentioned this was one of the costs of threading in the context area was how you have this floating point and vector state that has to be saved and restored on context switches. Well, Mach has a little optimization where if your thread hasn't used any floating point or vector state, we know that and will lazily save and restore that depending on only if the thread has used those features.

The Mach scheduler is what our scheduling system is based off of. It came from the OSF, the Open Software Foundation. We've heavily modified it since we originally got it, and it's been tuned for the higher layers of the system. It uses a global run queue, meaning that we don't have a run queue per processor. We have a run queue for the whole system, and basically as threads block and other threads become runnable, the system will schedule it based on the available processor.

There's a notion of an idle processor signal, so if a processor happens to be not in use, there's a signal that occurs that lets the system know that, okay, I can now schedule a thread on this processor. One of the things that's a little bit different in Jaguar is we've tuned the thread affinity.

Thread affinity is basically the notion that you want a thread to continue running on the same processor that it's been running on. The reason behind that is for cache affinity, because when you're using a cache on the system, you're not going to be able to run a thread on the same processor that it's been running on.

So if you have a cache on the system, you don't want to have to migrate your data kind of back and forth between the two caches on the processor. So in Jaguar, we try and be a lot more aggressive about keeping threads running on the same processor. We don't have any high-level API for binding threads to a specific processor, but you'll notice if you're using the CPU monitor app and you have a single-threaded process running, that thread is not going to bounce back and forth as much between the multiple processors.

The Mac schedule is responsible for the preemption. So just to reiterate, your threads can get interrupted at any time, and you have to be able to react to that. So if you're using atomic operations, you're protected from that. But depending on global data, you really have to make sure that you're using the synchronization primitives that the Thread API provides. I'll talk a little bit about Pthreads now. This is the lowest level API that we recommend people using in the system. It's what the Darwin thread model is based off of.

Like I said, it's, it's used by all the thread APIs. So the Carbon, the Cocoa, and the Java thread APIs all call into the POSIX implementation. The, the benefit of that is whenever we make improvements to the POSIX thread implementation, all the other thread models can take advantage of that. So if we make mutexes faster, if we make context switching a little faster, all these high-level thread APIs can take advantage of that.

I call it a POSIX-like implementation because we don't have a conformance test suite that we pass when we implement this API, but when we refer to it, we want to use the POSIX naming and conventions because it's fairly easy to go down to your local bookstore and pick up a book on Pthreads and just start using it on Darwin.

Like I said, we have a one-to-one mock-to-Pthread implementation. Simplifies the user-level APIs. Simplifies debugging. All the applications that you see would have to be much more complex if we had a multiplexing thread implementation. Pthreads is, like I said, a fairly common API. It's easy to go find a reference for it and pick it up fairly quickly.

There are some misuses you might have to be aware of I'll mention in a little bit that are fairly common for people who are new to the API. One thing that we currently have, if you're looking at the CD that you got at the conference, we don't have any system-wide types yet.

So POSIX specifies an implementation that provides process-wide or system-wide mutexes and condition variables. So process-wide means that they're visible within the same task. I can have a mutex that I lock and unlock in the same task. System-wide means multiple tasks can put a mutex or a condition variable in shared memory and use that. This is pretty common on databases or server applications, but as of today, we don't have that yet. Okay.

One thing to be aware of in Pthreads is synchronization is not cheap. You really want to make sure that your threads are data-driven. You want to take full advantage of the processors in your system. You want the threads just to be cranking on the data and then signal when they're done.

You don't want threads to be sending status updates all the time. There's other mechanisms to do that. You can peek at the thread and check its status if you just want a user-level UI notification. But if you're ping-ponging between threads back and forth, that's going to cost you a little bit.

By default, the specification for Pthread says that threads are what's called joinable, meaning the system will worry about that thread and keep track of it for you. And if you really want that thread just to go off and do its job and you don't want to hear about it anymore, you have to use what's called detach API, meaning you either create the thread detached or you can detach the thread after it's been created.

This basically lets the system know that you don't really care when that thread exits anymore, you just want it to go off and do its job. What you'll notice if you don't detach is that these threads will just hang around in the system consuming these resources until someone joins them or cancels them.

There's a stack issue, like I said, with threads where by default you get a fairly large stack, and this doesn't cost you too much as long as your application isn't using very many threads. But once you get up to, you know, a lot of threads, the stack size actually becomes a limitation because your virtual address space starts to get smaller. The API and P threads that let you shrink your stack size will allow you to do that at creation time. So you specify, I only want to use 16K of stack for my thread. And this is probably enough for you to do that at creation time.

But this is probably enough for most usage. So if you're creating threads a lot, you might want to take a look at this. Checking predicates is very important in P threads. A predicate is the global notion of a condition that has occurred or you're waiting for. So you usually have a global variable, and if you are waiting for a condition to be signaled using the P thread condition mechanisms, what you do is lock a mutex, check the global variable to see if the condition has occurred. And that condition is only manipulated within that mutex. If the condition hasn't occurred yet, you then wait on the condition, which simultaneously unlocks the mutex.

Now, if you don't have a predicate, that causes some scheduling problems, and your application will not respond like you want because you basically kind of cheated the system by saying, oh, I know what I'm doing. I'll just wait on this condition. Well, there's race conditions implicit in this API, so you have to have a predicate. And you always have to make sure that it's checked.

Another reason is because the condition wait implementation can spuriously wake up. The specification actually says that a condition wait can wake up with a success result even though the condition hasn't been signaled. And the reason behind that is on a multiprocessor system, it's very difficult to implement this atomic unlock the mutex and wait on the condition, so they give you a little bit of leeway. And once your condition has been signaled, you also get a successful wake up, so you also have to check that predicate every time your condition wait returns.

There's an API for canceling threads, which basically means I would like that thread to stop running. And you do this from an outside thread. The two models are deferred cancellation and asynchronous cancellation. Deferred cancellation is good because it's a lot more well-defined. You tell the thread more as a request that you'd like it to cancel. The thread, if it gets to a point where it's safe to cancel, will then exit itself, clean up its resources, and go away.

Async cancellation, asynchronous cancellation is not considered very good because you don't really know where the thread is when it's running. If it's not at a safe point and you tell it to go away, it could be holding a critical system resource like a lock and you basically just yank that out from underneath it. So we discourage the use of that, but it's part of the API specification.

A little bit more detail on Pthread Cancel. In Jaguar, we've added more cancellation points besides Pthread Test Cancel. So the system-defined cancellation points are described in the POSIX specification. They're kind of the common Unix system calls like read and select, all these things that could normally block for an indefinite amount of time. You want to make sure that you can wake up a thread that's blocked in that API.

One of the common uses is Pthread condition weight as one of these places where you might want to cancel a thread while it's in the middle of this API. One of the interesting things about a Pthread condition weight, like I said, it unlocks the mutex that you were using to protect your predicate. So when you cancel a thread, it reacquires that mutex.

So you have to use what's called a cleanup handler when you cancel a thread that's blocked in a condition weight. That cleanup handler will unlock the mutex and let the thread exit cleanly because you don't want to have a thread exiting holding a mutex because that means that no other thread in the process will be able to acquire that again.

Another added API is Pthread at Fork. Now, this is kind of esoteric in that it's usually only used for library implementations or plug-ins that need to really worry about when the process has forked a new child process. The problem with this is it's very hard to do correctly because usually all the subsystems that are using Pthread at Fork need to be intimately aware of each other so that they don't introduce any deadlock conditions.

Because if I'm depending on API such that that API needs to acquire or release a mutex before the process has forked, you need to make sure that all the at fork routines have been registered in the proper order. So we kind of restrict its use and mainly provide it just to supply the API.

Another set of new additions are the attributes on conditions and mutexes. This was a big request from a lot of our customers who were saying that they were finding it difficult to port existing applications because we didn't provide this API. Like I said, right now the system-wide attributes aren't supported, but if you look in the pthread.h header file, you can see the attributes that we support.

One of the interesting ones is the get and set stack attribute. Now that is an addition in what's called the single-unit specification version 3, which is kind of an aggregate that contains the POSIX thread specification. This allows you to get and set the stack size and stack location.

So if you're creating and destroying a lot of threads, you may want to actually pick a place that you want the stack to start at for that thread so the system doesn't have to allocate stack for it. And you can also specify the size of the stack. So if you know your thread is consuming very little stack, you can optimize that a little bit more.

Another thing that's been added is per-thread signaling, a very commonly requested feature. The three APIs that are implemented under this are PthreadKill, PthreadSigMask, and SigWait. PthreadKill is the mechanism that you use to tell a thread that you want it to be interrupted. The thread, if it handles that signal, will be interrupted, do what it normally does in a signal handler, and then continue execution. PthreadSigMask is the API that specifies which signals a thread is willing to handle. So a common use is if an application is multithreaded and it can get external signals from outside the process, you dedicate a thread with PthreadSigMask to handle those external signals.

SIGWAIT is a mechanism that will allow you to basically block a thread until one of these signals occur. So you can just dedicate that thread to handling a signal and it doesn't have to do any work besides sitting in its SIGWAIT routine. This is very helpful in porting a lot of Unix applications like Apache 2.0 makes heavy use of per-thread signaling.

Another commonly requested feature was read-write locks. Pthread read-write lock will allow you to use global state if it's commonly read but not as frequently written, then you can optimize this using this API. Our current implementation prefers writers, meaning that if you have multiple readers waiting for a lock that there's one writer on and another writer comes in, that writer will take precedence.

The idea is that that will allow the system to progress a little bit faster because writers usually need to go and make a change and continue, whereas readers, you know, you want to have more than a few readers reading at a time where you only want to have one writer writing at a time. This is now mandatory in the single-unit configuration.

We have a unique specification for the threads extension, so we decided to implement that as well. There's a couple new mutex types. Pthread mutex error check is very handy. It'll allow you to determine whether a thread would deadlock when it attempts to lock a mutex, or it will let you know when a thread that unlocks a mutex wasn't actually the mutex holder.

This functionality has been in Puma using the debug lib system, if you know how to do that. There's two versions of the system framework. One that has some debug help in it. If you use the DYLD image suffix environment variable, you can specify that you want your program to use the debug flavor while you're under development.

The recursive mutex is there mostly for completeness, but usually I've never found a case where you couldn't design a locking implementation that didn't need recursive mutexes. It's usually for people who are porting something that depends on recursive mutexes and you haven't quite spent the time to reimplement that yet.

For references in Pthreads, the man pages have been updated in Jaguar, so you can now use those as a pretty good reference. The Darwin CVS repository is where the implementation actually lives, so you can follow the progress. Like I said, we'll be making some changes since the CD that you got before Jaguar is finished, and you can track that in the LibC and the XNU projects.

XNU is the kernel that has the high-level or low-level, depending on your perspective, implementation of some of the primitives that are used by the LibC project. The Open Group has a nice website with the single-unit specification that we kind of model, and if you want to go there and get a nice bookmark there page for the Pthreads API, it's a fairly good reference. And the News Group Comp Programming Threads is where a lot of Pthreads experts like to hang out.

I'll talk a little bit now about the Carbon MP API. It's one of the high-level APIs that lives on top of the Darwin POSIX threads. One of the notions that is a little bit different in Carbon is the MP tasks are what I was referring to as threads previously.

So in Classic, you had MP tasks that ran in a process's address space, and those are basically what I've been referring to as threads. A difference between Classic and Mac OS X, obviously, is you now have separate address spaces, so MP tasks won't be able to communicate like they could in Classic between multiple address spaces, so you have to be aware of that.

Like all the thread APIs, there's several mechanisms that exist to coordinate MP tasks. There's semaphores, which are simple signaling mechanisms. Message queues, which allow you to create fairly complex worker-client implementations. There are event groups, so you can wait for lots of different signals to happen using a single entry point. There's also critical regions, where if you have a set of code that you only want one thread to be in at a time, you can specify that as a critical region. It also provides recursive entry, so it knows that the same thread can enter that critical region.

The MPTask API also provides some handy atomic increment/decrement operations. So if you have a very common operation like a reference count that you want to increment or decrement, rather than taking a full-blown lock, changing the variable and unlock, you can just use one of these atomic operations. Examples of the APIs in Mac OS X that use the MPTask are the Synchronous File Manager and OpenTransport. Both take advantage of this API. The Carbon API has a list of thread-safe services, and that's documented in Tech Note 2006, if you want to take a look at that.

Next I'll talk about Cocoa Threads, another high level thread API, kind of a peer to the Carbon threading. NSThread is the Cocoa Thread class. It's very simple to use. There's very few APIs that you have to learn to use this. Most threading models have the notion of a start function. So in Objective-C you have a start method that your class will implement. In NSThreads, they keep track of the Objective-C exception model. So when you raise an Objective-C exception, that raise occurs in the specific NSThread rather than kind of throughout the whole process.

There's an exit notification, which is kind of like the joining mechanism I talked about. When a thread exits, you may just want to know that it's gone away, so you can register for that notification through the normal Cocoa notification mechanisms. Each NS thread also has a per-thread dictionary.

So this is kind of an extension of the notion of per-thread data, where a thread can have a set of data that's just for that specific thread. And in Cocoa, we provide a nice little wrapper that lets you put an object-oriented NSDictionary on top of that data for key value type access.

There's an AppKit extension in NSThreads which basically is a hint to the system that says I'm going to be creating a background thread that may be doing some drawing. So if you look in the NSApplication class, it's called DetachedDrawingThread. And that's, like I said, it's basically a hint to the system that lets it know that the main thread may need to respond to requests from this background thread to do drawing operations.

Each NSThread has a separate run loop. So you may see some interesting behavior that you have to worry about where if you post a notification, it doesn't go to the full application run loop, the main run loop that the event handling is occurring in. It only goes to your thread's run loop.

So there's other mechanisms that you can use if you want a notification to be broadcast more broadly outside of your thread. You also need to start the run loop explicitly. So when you create a new thread and you want that thread to have a run loop to handle it, you need to start the run loop explicitly.

So if you want to handle things like these notifications, that run loop hasn't been running yet. You have to kind of think about how this thread was created and decide once you've set up everything to go ahead and start its run loop and let it do its thing. There's also an auto-release pool associated with every thread.

So as a thread is created, the memory management that's part of Cocoa that allows you to auto-release objects, meaning you can kind of defer the deallocation of that object to a later time. And in an NSThread, every thread has a pool that's associated with it. So when you auto-release an object, it goes into the thread's pool. When the thread exits, that auto-release pool also gets destroyed.

Some of the future things we're working on in all of the thread APIs are the notion of priority inheritance. This is kind of a problem where once you've created threads with different priorities, you may find that a higher priority thread is blocked waiting for a lower priority thread to release a lock.

But because of the scheduling behavior, that lower priority thread won't get to run while this priority thread is blocked. So an implementation that could be designed would allow that lower priority thread to temporarily inherit the priority of the higher priority thread. Now, this would allow the higher priority thread to let the lower priority thread release whatever resources it was holding and continue.

System Shared Resources, like I said, we're working on that. It's very commonly requested by server and database clients, so we really want to make sure that we provide all the APIs that they need. Performance. There's always room for improvement in performance. We're really trying to make the threading APIs as efficient as possible.

There's been some talk about things like implementing thread pools so that you can have a nice high-level way to create thread pools in the POSIX-type API. Other things, just low-level making the context switch faster, making threading overall improve. And like I said, when we make performance improvements at the pthread level, all the other thread layers take advantage of that.

So if you want to follow this, you should watch the Darwin Repository because that's where the threading and the kernel implementations live for the Pthreads and the Mach implementation. So now I'd like to bring up Robert Bowdidge, who works in the Developer Tools team, and he's going to show a demo of a multithreaded application and how to inspect what's going on at the time.

Okay, thank you, Matt. As Matt mentioned, my name is Robert Bowdidge. I am a member of the Developer Tools Group, where I'm responsible for the performance tools on Mac OS X. Now, what Matt has gone through is a little bit of the how on threads. He's also told us a little of the why on threads, about why we actually might need them. What I'd like to do is take that one step further and try to give you some examples of good thread use, to give you some examples about how Apple has actually used threading in its own applications.

The way that I'm going to do this is I'm going to use a performance tool called Thread Viewer that's on the developer CD. How many of you have actually tried this? That's good. How many of you understand what it gives out? Good. Sort of good. What ThreadViewer does is it provides a timeline visualization of what's going on in an application. It tends to be kind of coarse-grained, but it's very nice for understanding sort of overall trends of what your app is doing. The first app I'm going to look at is the Finder.

Now, the Finder's actually a really interesting example for threading because, first of all, it's our primary access into the file system. You know, it has a direct manipulation interface, and so we expect that when we drag a folder around, we want to imagine as if we're actually dragging a folder from one place to another.

And so the Finder must be as responsive as possible for us to be happy. On the other hand, the Finder also has the problem that it's dealing with two of the slowest devices on the system. It's dealing with the disk, which when it goes out to get accesses is going to take a while to come back with data.

And it's dealing with the network which is even worse because nothing may come back. And so the finder has to be extremely good and has to try as hard as it can to try to make sure that it's as responsive as possible while dealing with these resources that tend to be extremely slow.

So what we have at the top window here is Thread Viewer showing what's going on in the finder. As you can see, you have a timeline view going from the distant past into the present. And if I actually do some activities in the finder, We see some actions going on.

And what you see here is you see little colored blocks on each of the horizontal lines representing each of the threads. The blocks represent what was going on in that thread during a point in time in the past. Thread viewer checks every 20 milliseconds to see what's going on in the thread. The green blocks, such as this one, let me turn on the demo mode to make this a little easier. Display in large demo mode.

There we go. The large blocks, or the green blocks, represent that the thread when ThreadViewer checked had been actually executing at that point. And when we actually click on it, we get a backtrace illustrating exactly what code was executing in hopes of giving us some idea of what the application was doing. The yellow blocks indicate that when ThreadViewer stopped the program, nothing was happening. The thread was waiting. However, the thread had been running since the last time it checked. So this indicates some execution had been occurring.

The dark green blocks indicate that we were running in the kernel. that the thread was actually running, but it was running down in the kernel. And that usually implies that we were doing things like accessing the disk, for example, where it could be a long running or where it's actually a task that's blocking and the application has to wait for the kernel to come back with data. Then the pastel colors tend to represent waiting. The green, for example, represents you're sitting in the wait loop.

You can tell from the stack backtrace here that the CFRUNLOOPRUN call happens to be in the backtrace, and the thread is actually waiting on the Mach message call. It's just sent off a message to the Windows server saying let me know when something actually happens, and it's waiting for the response. And then other colors such as red indicates waiting on a semaphore.

So what we see here is that the thread-- or is that the finder is actually using multiple threads. As with most of the applications with GUIs, you have one main thread, which tends to have most of the drawing code on it. And so this is the thread where you want to do all the activity for doing the direct manipulation, for moving things around, for resizing windows and the like. And that thread tends to have a fair amount of activity on it. The other threads are a little more interesting, though.

For example, here we found that every time that we opened a new folder that we'd never seen before, A new thread's created. And the reason for that is that when we open a folder, or when we go to a folder that we've seen before, the finder actually goes off and queries that directory to find out what the files are that are inside of it. Excuse me?

And so what happens is the finder actually spawns off this thread. The thread then goes off and it asks the disk, what are the files in that directory? And because that operation may block, the finder wants to make sure it actually is on a separate thread so that if we're going off an iDisk, let's say, where it might take a long time for that to actually come back, this is not going to block the main thread and cause the user interface to hang on us.

This also happens when we go into a directory after we've done it for the first time, because the contents of that directory may have changed since the last time we looked. And so the thread can actually go off and gather the data. There's actually a second thread, which we can see was doing activity about the same time that each of those threads was formed.

The Finder team calls this the Sync Thread. And that actually is the thread that maintains the database of information on folders that we've already opened. And so the Sync Thread is actually responsible for spawning the threads to gather information and also update the database, which the main thread can then actually query to find out what the contents of a given directory are.

This is a second reason to use threading, the idea of separating out the logic of dealing with the database of current file contents from the idea of dealing with the UI or from the idea of actually querying the data on the disk. And so having the Sync Thread actually logically separates things. So that's the second reason we actually wanted to use threading.

Let's imagine we do a duplicate. So go up to the finder and do duplicate. We see a relatively long-running app. And we see that there are two threads spawned, and in fact, both have lots of dark green blocks, which indicate we're doing a lot of blocking, waiting for the file system to do work.

Now, when we're doing a copy... Actually, let's let this continue. When we're doing a copy, this is an activity that probably we want to do in the background. We don't want the user interface to hang while we wait for the copy to finish. and therefore the finder actually spawns this off on separate threads as a background activity, puts up a dialog box to let us know what's going on, and we can continue to do work while the copy goes on in the background.

Now, one of the things you may be asking is, well, why were two threads spawned? We only were doing one copy. So one of the other reasons that you may want to use threading is the idea of being able to do operations in parallel. So if we're doing a copy from, let's say, a single disk, copying from one place to another, every time that we do a read, we have to do a write. And because those are going to be sequential, we have to do read, then write, then read, then write.

However, if we end up copying from one volume to another, we have an opportunity for parallelism. Because as we're writing each block, we can start reading the next block off of the disk. And we can do those in parallel because they're dealing with different devices. And so the finder actually uses two threads so that as block, let's say block n is being written to disk, block n plus one can be read into memory at the same time. And this can double throughput. And the same approach of using two threads is used for all the copies to make sure that it solves that case and because there's no particular inefficiency in not using it.

Now, one of the other issues that the Finder team ran into is the issue of legacy code. How many of you actually have code that you didn't write or that you're responsible for and you don't quite understand? Oh, good. So one of the issues that some of the teams that use Mac OS X have internally is that some of the frameworks that we use, some of the bits of library code, have been inherited from Mac OS 9, which were inherited from Mac OS 8, which were inherited from and so on.

And, in fact, there were some parts of the system that actually ended up being...were written without thinking about multi-threading, that multiple people might actually be trying to use the APIs at the same time. And one of the problems that the Finder team actually had was the issue of icon services, which is the part of the code that actually goes off and gets resources out of files, gets icons and the like. And, in fact, they ran into some interesting problems on 10.0 and 10.1, where they had to do the accesses to the icon services on the main thread, because they couldn't guarantee that there weren't two things trying to do it at the same time.

and as a result, many of those operations ended up being done on the main thread and blocking. This is actually one of the things I believe they're fixing for 10.2, that they've actually got a better way to deal with the icons. But this introduces the question of even with, even when you have an application that may be very good for multithreading, you may run into issues where there's blocks of code that you can't actually multithread because they weren't designed for this and you can't actually do multiple access to them.

And so this is something to keep in mind as you look at code and as you look at multithreading issues, if there's going to be issues trying to deal with things that shouldn't be threaded. Okay, so that's the finder example. Okay, the second example I'm going to do is the QuickTime Player. And to be self-referential, what I'm going to use as a demo is last year's movie. So, let's see what's going on in QuickTime Player. Let's again run ThreadViewer. We'll attach to the QuickTime Player.

And we see that multiple threads are running. Okay, so what's going on here? And if we actually stop the program... Actually, I'm not going to stop at BS--well, we'll try it. We'll see what happens. We can get some ideas about what's going on. Once again, the main thread, thread zero, is where all the drawing goes on. and where all the user interface happens. So if we're doing resizes, if we're clicking around on buttons, that's where the activity goes on. The rest of the threads are actually created by the QuickTime libraries themselves.

If we actually let this run, eventually we're going to see some disk accesses on one of the threads. There's going to be occasional green blocks as QuickTime goes off and gets data. So as with Finder, the QuickTime code actually wants to make sure that when it goes out to get files off of disk, that it doesn't block, that it doesn't cause the user interface to hang waiting for those blocks to come in. And so the QuickTime thread for getting the data can actually grab large blocks of data, can copy them in, and doesn't have to worry about what happens if that blocks. There goes one.

Yes, read. There's a second thread that tends to be doing a lot of activity, probably the top one, that actually represents the decompression algorithm. So that's actually being run on a separate thread, which follows the idea of trying to separate different tasks on different threads just to make it easier to reason about the program.

And then finally, when the data is decompressed and it's being sent to the screen or being sent to the audio device, rather than taking large chunks of data as we did with the disk, the audio device probably wants very small bits of data and wants to send them to the audio device as quickly as possible to minimize latency. And so the writing to the audio device... Let's see if we can actually find that.

Actually, here we are. Code in the I/O thread. This is probably actually the thread that's doing the audio data. That actually ends up being on a separate thread so that it can keep up its real-time behavior. A final thing to notice about this version of Thread Viewer that is on the May 2002 developer CD is that not only do we have a thread identifier, but we also have this little set of letters and parentheses. That actually represents the scheduling algorithm for each of the threads.

With the finder, we saw TS, which stands for basically a time-sharing thread, which runs whenever the kernel gives it a chance to. If it runs too much, then the kernel bumps down its priority and gives somebody else a chance, and so on. The other threads are what are called round robin.

They're sort of a real-time thread. In this case, what ends up happening is that when you create the thread, you actually say, "I need 20 milliseconds every second to do my work." The kernel tries to schedule that time. For the real-time nature of this, for dealing with the media, having those have some sort of real-time guarantee is actually a good thing. Now, the times when I actually put this in, the folks on the kernel who deal with scheduling were a little aghast because they were afraid that we would all go off and immediately make our threads high priority to try to get as much CPU as possible.

This is a very bad idea because the system's been designed so that the things that actually need the time should be running as high-priority threads. And if you start running your stuff at high priority, if you decide that maybe you want your user interface running with round robin, you can actually run into some cases where you might have priority inversions where you end up deadlocking other parts of the system. Or, worse yet, you might be a bad citizen. So let's imagine your application's running and you've somehow set it up so that it's using extra priority. Then somebody comes along and says, "Oh, I want to burn a CD in iTunes."

Well, the CD device actually has some relatively strict scheduling that has to be done to make sure that data actually gets out to the device. And there are actually times on DVDs where if you don't get that data just right, you can actually mess up disks and corrupt them in ways so that you basically have a piece of plastic. So as a result, you want to be very careful about how you might actually use scheduling so that you don't cause people to generate bad DVDs.

So there are different ways that you can actually schedule things that are known to be real time. Use them extremely sparingly. So that's my example for today. Hopefully that gives you some ideas about how you might actually use threading in your own applications and how Apple actually uses it in theirs. Thank you.

Thank you, Robert. So let's give you a roadmap. Go back to the podium. So a couple -- obviously since this is Thursday, some of these sessions have already occurred. But for those of you who will be looking back either through the website or through the DVDs, these are some of the sessions to look at for performance optimization.

Who to contact? Matt Watson has been kind enough to provide his own email, so he's available and myself also available. From time to time, I do hold kitchens for performance optimization or specifically for MP or threading kitchens. So if those are things that might interest you, if you need more assistance, please contact me. I'd be happy to work with you to set up a schedule for that.

Some more areas for documentation, the Carbon thread links and Cocoa threads and all of these URLs are available at the main URL site that is being provided for the conference. Some more tech doc notes. So again, the takeaway from today's session is threads can work for you. They can also work against you. So use them where it makes sense. Use them where the user experience will be enhanced.

Threads don't necessarily mean that you have to have a dual processor to take advantage of them. They will assist you in that case when you do have a dual processor system. But as Robert showed you in his examples, these could be applications that are taking advantage of the fact that they need to read and write on a single processor machine. So with that, I'd like to start the Q&A.