IOKit: Kernel Extension Programming Techniques - WWDC 2003

Core OS • 44:42

View this session to learn how to design IOKit KEXTs for optimal performance. We discuss how to thread your driver to best fit with the Mac OS X kernel and make efficient use of memory. Learn how data is efficiently moved from user space to the kernel, and then to a device using I/O Kit.

Speakers: Craig Keithley, Godfrey van der Linden

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon, and welcome to the I/O Kit session. I'm Craig Keithley. I'm the I/O Technology Evangelist in Apple's Worldwide Developer Relations Group. When we started about five minutes ago or ten minutes ago, I could have probably identified everybody in this room. It's hard to go up against Xcode.

One of the things that's toughest is properly architecting and writing I/O Kit kernel extensions to get optimum performance. You can improve on this, you can improve on your techniques by doing multi-threading. It's a common, I almost want to say misconception that you can't do threading in kernel extensions. You can, and to go into that, we'll bring up Godfre.

My name's Godfrey van der Linden. I'm an I/O Kit architect, and like the vast majority of people who I expected to be here, I would probably prefer to be an XCode right now myself. Another thing is the handouts suggest that I shall be talking about memory. Even though memory is interesting, most of this presentation will be about threading.

The new piece of hardware has got some interesting memory issues, and I will be available in the porting lab after this and on Wednesday if anybody wants to talk to me about how to set up memory maps on the new hardware. But there is no formal session on memory in this presentation.

Okay, so as an introduction, Mac OS X has something like 100 threads running on the system at any one time, even when it's idle. I mean, I was running top on my system today, and it had 140 threads going. That's what makes kernel programming so much fun. When you've got 140 threads operating inside that environment at one point in time, it can lead to some interesting mind-bending problems. That just is why I enjoy kernel programming a lot.

So in this session, I'm going to be discussing threading generally, how sort of high-priority threads work inside the system, and also I'll be talking about I/O Kit threading. And then finally, I shall discuss the... Teardown, synchronous teardown of device drivers when you get a hot unplug. It's sort of, it's not really threading, but if you don't know how to do it properly, it's very easy to get nasty explosions.

So what I'm hoping you'll learn is a better understanding of how threads schedule on Mac OS X, how I/O Kit does its synchronization. IOKit's synchronization model is very unusual. I've never seen it in any other operating system, and probably because it's my invention, I think it's very, very cool. And also Mac OS X's hot unplug. We took two attempts at it before we got it right, and I don't think it's really been presented before. So I think it's really, it works very well. So this is its presentation.

Okay, so the first part of the presentation will be on threading, specifically how threads work inside the operating system. I won't really be talking about threads inside the kernel so much as How threads interact, how the scheduler interacts with it, how the dispatcher works. Mac OS X doesn't differentiate between kernel threads and user threads.

Yes, we have different priority bands, but we don't really differentiate the way they operate. Over the next few slides I'll be discussing the thread priority bands, how the dispatcher works, what the scheduler does, what it means to take high priority, because it's different to what most programmers think, and also priority inversions, which is a very standard problem that we're having now.

So these are the priority bands. As you can see, there's quite a few of them, and there's a few there that aren't really threads at all. The primary interrupt and idle threads aren't really threads. But you can consider them thread contexts, at least to the extent of when a primary interrupt is running, no thread is running. And also, when the idle thread is running, by definition, no thread is running.

The bands that we would like you to be in is the regular user area, but most high-end hardware, and let's face it, that's what I/O Kit is all about, tends to have tighter requirements and they like to go for higher priorities. Now we're experiencing a lot of I call it priority arms races.

People are going for higher and higher priorities and it's really degrading the overall system performance. So I'm hoping in this presentation to convince you to get out of the real-time band, if at all possible, and down into the top of the user band, because I think that's probably the best place for most of us to be. Okay, so Mac OS X is a dispatcher-based system. There is a subtle difference between what a dispatcher is and what a scheduler is. We do have some sort of scheduling, but we're basically a dispatcher-based system.

What that means is that a dispatcher takes the current thread that's executing, blocks it, then selects the next thread and runs it. So what does blocking mean? Well, generally a thread in our operating system, as I said earlier, I had 140 threads running. But in fact, they were all blocked waiting for some user advance, waiting for some I/O to complete, waiting for the lid to close or open, or for the battery to run dead, for all I know. And that's what we mean by blocked in our operating system. Those threads are put onto wait queues and we really just ignore them. So yes, we have 140 threads running, but they're really all asleep, which is the best way to have a thread in my opinion.

The next thread thing that happens is a preemption. So what is a preemption? It's essentially you've used your quantum and the system says, "Okay, I'm going to put you on the end of your priority band's run queue." We'll discuss that in a second. And then I'll select the next guy to run.

Every time the dispatcher is invoked, it selects the highest priority thread available and runs it. So that really has some very potentially nasty side effects. If you have an infinite loop at high priority, and that high priority thread runs to completion, then nothing in lower priority will run. So let's say you overload the real-time band. Remember earlier the real-time band is the highest priority band in the system.

If that overloads, then you're not going to get any time at all for the I/O band. And the I/O band is probably where you're trying to store your data onto a disk or take it off a disk. So you've just stopped the system from doing what you need to have done. And that's probably not what you're after.

So what does the scheduler do? Well, its job is sort of an oversight committee. We do have a scheduler. It will get better in the future. The scheduler we have right now is essentially for timeshare. When your thread has run for long enough, we will change your priority down a little.

That doesn't necessarily mean you're going to start running slower, or at least not straight away. If there's no other thread that's competing with you that's runnable, then you're going to continue going and it won't make any difference if your priority is down. However, if you are competing with another thread, the scheduler's job is to try to make sure that the system balances its loads appropriately.

The other thing is that aforementioned spinning real-time thread. We had this problem early on with the system. Spinning real-time threads will not give any time to the system, including to the keyboard, so that you can stop it. So one of the jobs of the scheduler is to say, "Hey, this real-time thread is taking far too much time." In which case, it will change it over to a timeshare and then say, "Oh, by the way, I've been running for eight seconds." So it gets depressed very quickly, which is a good thing because it means that you can use kill-9 and get rid of the thing.

It came quite late, really. The original development of Mac OS X when we got the real-time threads, that was quite a common problem because everybody's written infinite loops. And a real-time thread is a very, very long time. It takes a lot of time. It takes a lot of time to get rid of the thing. So we had infinite loops, and a real-time thread infinite loop meant taking the big hammer out and hitting reboot. And that's really, really painful when we didn't have journaling file systems.

Okay. So what does the scheduler do? Essentially the communication mechanism between the scheduler and the dispatcher is the run queues. Now, earlier I mentioned that the dispatcher finds the highest priority thread that's runnable in the system and then runs it. Well, that's the run queues. Logically, you can think of it as run, run queue per priority in the system.

And then if one thread is runnable, it's at one location in the run queues and the scheduler just manipulates the locations in the run queue. Also, the schedule collects statistics so that you can do things like top and latency and a number of other tools so that you can find out what the system is really doing for and on your behalf.

So in a timeshare thread, once the thread is run for-- so an example of what the scheduler does is a timeshare thread, as I mentioned, if your thread has run for sufficient quantums, we will drop your priority. Dropping your priority isn't, as I said, really a bad thing. It's only sometimes bad if you're using so much CPU power and you need that much CPU power, and another thread comes up-- say another task, the user launches another task, and then your thread will fall out.

Well, you know, the user did launch that other task. Perhaps he really does want that task to run. So let the timeshare do its job, except for when you are certain that the user really, really cannot afford to let any CPU go to the other guys, in which case you would use different things. And as I said, the infinite loop misbehaving real-time threads is another example of what the scheduler does.

So what does high priority really mean? There's nothing really that can make slow code go fast. If your code is slow, high priority will not make your code go faster. You will get slightly more CPU time, but it really is measured in percent, maybe 1 or 2 percent more CPU time. Higher priority won't give you faster code. The only way to get faster code, I'm afraid, is to run your code through a performance analysis and clean it up. It's very easy to write bad algorithms, unfortunately.

What high priority does give you is it gives you a reasonable chance of running with a very low latency. So your thread is blocked, a MIDI event comes in, for example, and Mac OS X will probably get your thread running in around less than a millisecond on average. Our max jitter, I haven't run this for a while, but the last time I saw it, the max jitter for a real-time band in no competition was running at about 600 microseconds or something. Unfortunately, in the real world, there is always a little bit of other competition at that high band. So I think we're running our jitters at around 3 milliseconds. Again, I'm not sure exactly what those numbers are.

Of course, when you have high priority, then you can end up very, very easily using so much CPU time that you're not allowing the low-level parts of the system, like disk I/O, any time at all. It's a bit of a shame. We had recently a developer raised a problem, which was they were using so much high-priority time that the FireWire thread, the FireWire work loop, which is an I/O thread, wasn't getting sufficient time to even acknowledge packets on the bus.

And when that happens, you start getting weird little disk errors, and the system itself hasn't really got time to clean up, because you're using all of the... time at high priority. We call that a priority inversion, and priority inversions are really hard to get rid of. And this is really the biggest problem with arms races. If you're in a priority inversion, that's going to cause some problems, and it may take a very fundamental redesign to the way you've set up your workloads.

So how do you decide your thread priority? Really, it comes down to exactly what your latency requirements are. It's not what performance you're after, it's what your latency is. User interface events, for instance a keyboard for a MIDI sequencer or something like that, really does need very low latency because a human has said, "I will move my finger," and if the human system, the detection system as it were, can't hear it, the sensors don't hear it within a certain amount of time, then the keyboard feels wrong. And that time is very short. I mean, for computer time, it seems enormous. It's about five milliseconds. But five milliseconds isn't really very, very long on a modern operating system, especially because our standard quantum is 10 milliseconds.

The higher priority that you want, if you really do need that extremely low latency, that's when you go for high priority. But you really want to be certain that you need a very low latency. If you're reacting to data off the internet, well, frankly, it's who cares what the performance is. You're dealing with 30-second timeouts anyhow. Now, I'm not suggesting that you go timeshare. I don't think timeshare is appropriate if you're doing some sort of stream-based information processing. However, you probably don't need to be real time because the whole internet itself is arbitrary.

And finally, there's sort of low latency. It's stuff when you're waiting for local results or something off the disk, a local firewire or something like that. It's sort of low latency without being ultra-low latency. So that's how you would use your bands. I would suggest for extremely low, you would use the time constraint stuff, the real time band. For who cares, I would probably use high user band, possibly below the Carbon async threads, but you can play with it a bit. And then for low disk stuff, I would suggest that you go to the top of the user band and you disable timeshare altogether.

These are all things that you can look up on the ATC websites to find out how to do it. There are priority aversion. I want the highest priority except for when I don't. Priority inversions can really happen almost anywhere in the system, but the most common ones we're seeing is again the real-time band. Mac OS X's real-time band is very, very good. It's extraordinarily powerful. But unfortunately to get a really good low level maximum jitter we've had to give you enough power to hang the system effectively.

And that means that your code now has to be far more complicated because you have to work out how to back out your high priority thread to give the rest of the system some time. Now traditionally on most operating systems, Mac OS 9 and Windows for instance, I/Os are really high priority.

There is nothing you can do to get them out of the way and you would have to take whatever jitter is around. With Mac OS X we have deliberately chosen to make the real-time thread the highest priority threads in the system even higher priority than I/O, which gives you extraordinarily good jitter characteristics, but it comes at the cost of complexity. to complexity.

So there's a couple of priority inversion strategies. The best strategy of all is get out of that high priority band. If you're experiencing priority inversions, drop your priority if you can, if the jitter is appropriate. Examine it. There are some wonderful tools in the system. My favorite is latency. Latency will show you a histogram of the performances in the system. If you can evaluate and have really hard numbers for what performance you need, latency will let you know what priority band will work well on your target system.

So if you can, lower your priority. That's the best thing ever. If not, you're going to have to complicate your algorithm. You'll need to split into producer-consumer model where you have small amounts of work to be done at very, very high priority and larger amounts of work done at low priority. So for instance, if you're streaming off a disk, you would have a low priority thread in your system. And yeah, I don't usually recommend having multiple threads, but this is the time to use it.

You'd have a lower priority thread in the system that's feeding a high priority thread, but at the cost of introducing some latencies. The high priority thread would just take whatever data it needs when it's available. And that way you get a producer-consumer. It's pretty good. It's complex. It works very well indeed.

The worst choice, it's really bad because it doesn't give you 100% of the CPU, is to deliberately say I'm going to let the system have some time. So approximately every 10 milliseconds or thereabouts, one buffer every two buffers or however it is that your workload is divided, go to sleep for a millisecond.

Cause you sleep for a millisecond and then you will guarantee the system or at least some other threads some time. I don't like that. The problem with this solution is that it's not going to give you a good result. If you're not competing with anything else, then that extra 10% or 8% after the system has used it is gone. And you can't use it.

And you're only doing that to save yourself the complexity of a good producer-consumer queue or lowering your priority in the first place. You see, if you lower your priority, you can use 100% of the CPU. There's a really nice anecdote. iTunes started early on with timeshare 3. I'm not sure if you've done it. I've recently done it with the AAC encoding. I was ripping my entire record collection over to 128-bit AAC.

And the system was very, very performant while it was going. And I never had any idle time on the system at all. It was just 0% idle. And that's because the ripping was a very low priority. So look out for that. Lower than regular priority is actually a lot of work. So if you're using 100% of the CPU, you're going to have to do a lot of work.

Anyhow, that's just the introduction to threading. There is a lot more to be said. I could talk for hours. But unfortunately, I don't have it. So we'll have to move on. The next set is work looping. This is essentially how I/O Kit does its synchronization. I shall be discussing the work loop and the event sources in this part. If you're a traditional I/O Kit driver, this is the mechanism we're recommending. And it's really quite hard to avoid now.

Unfortunately, WorkLoop itself is an unfortunate name. The way it was originally designed, we did have a thread that all I/Os went, and we could guarantee single-threaded access to hardware because we only had one thread that talked to the hardware. But the difficulty is that the I/O systems were taking context switches, which were slowing down all I/Os. So we came up with this idea of the gate.

And the gate allows us to schedule I/O on hardware directly without having taken a context switch. And, you know, what's a gate? Well, a gate's a lock. It's a recursive lock. It's not really very complicated at all. And it's sort of obvious, but it took us a while to come up with it, and it made a big difference in our performance.

So what WorkLoop really is in our system now is it's a container for the gate, which is a recursive lock. It's a list of event sources that need to synchronize with respect to that lock. And, by the way, it has a thread. Yes, okay, it has a thread. In fact, the thread's optional.

One day in the future, I'm going to get rid of the thread and only do it if you have interrupt event sources. So a WorkLoop's gate, the single threading is provided by the WorkLoop's gate being closed across all event source action routines. I shall define what that term means in a little while.

Traditionally, Unix, traditional Unix solution for MP is to have one big lock, one Uber lock that protects the whole operating system. So whenever you need to do anything, you would take the Uber lock and then you would be safe until the Uber lock gets dropped and there would only be one lock. And naturally you get contention and only one thread could one on the system at a time.

The other end is Muck. Muck has hundreds and hundreds of micro locks and extraordinarily complicated locking hierarchy so that you can make sure that you get locks in the right order. It's got lots and lots and lots of tiny little locks, which is great, but they're very heavy. It also is extraordinarily complicated. Locking hierarchies are nasty and they have to be done in one direction, which means for I/O systems, completion routines are painful.

So we needed to come up with something different. What we came up with is the work loop. We have one work loop, one gate, as it were, per major interrupt delivery part of the system. So a PCI SCSI card, for instance, has a work loop. A USB controller has a work loop. A Firewire controller has a work loop. So on a typical running system, we have maybe 13 work loops.

This is a compromise between the hundreds of micro locks that Mark uses and the two Uber locks that BST uses. It turns out it's very, very powerful because this allows us to deliver completion routines. So all of our drivers stack on top of this one lock... So by far the majority of I/O Kit drivers, as I say, don't create their own work loop.

They use their provider's work loop. Now, if you've used I/O Kit for any length of time, you would have seen the client-provider model and the client-provider stacking. And you will see that this statement is recursive. If I call my provider and the provider also doesn't implement GetWorkLoop, it calls its providers, and eventually you get down to the bottom of the system and you say, hey, here's the work loop, use this. So high-level drivers always synchronize against the bottom of the system.

As I mentioned earlier, only PCI devices and motherboard device drivers tend to create work loops. In most cases, your hardware will not need a work loop, and it's probably better if you don't create one. In fact, if you do create a work loop that builds on top of another work loop, you can be in for a whole world of hurt. I'm sure we'll have a RAID developer around here if you want to see somebody who really experiences pain discuss device teardown with a RAID developer.

So you can use the systems. Because the statement is recursive, there has to be a way of terminating the recursion. There is a system work loop that you can grab hold of. Just by walking down the stack, eventually you hit the roots of the provider tree, and bingo, there's a work loop.

It's not a bad work loop to use, and we really do encourage you to use it, because we like to limit the number of threads in the system. This is a good thing for system performance. However, if you're using... It's a shared resource, so don't be too greedy with it. If you expect a lot of interrupts to be used or you have very tight timing requirements, it's probably better not to use the system work loop, but to create your own.

So an event source. An event source has an action routine, which I'm now going to define. But essentially it's an action routine that's synchronous with respect to the work loop. All event sources have an action routine and an owner, and usually are registered on a work loop. In fact, an event source is really only meaningful when it's registered on a work loop, but of course people can temporarily register it and then remove it and register it and remove it, because it's a fairly lightweight operation registering an event source.

An action routine is just a call out function. When you create an event source, you're saying to the system, "I expect this event to occur at some time in the future, and when it does, call this function." And that's what an action routine is. All action routines in the system are synchronous with respect to all registered event sources on a particular work loop. If you're familiar with Java, you may have seen the Java function.

You may have seen Java's synchronous routine concept, where you can have a number of routines in a class and you say, "This is a synchronous routine," or "These routines are synchronous with respect to each other." Only run one of them. That's how I think of event source actions. All of the event sources up and down the entire stack are synchronous with each other. Now that sounds as though it's a recipe for contention, but it hasn't proved to be so far. But there are some tricks there that you need to be aware of.

In general, don't go to sleep while you're in an action routine. Very bad things happen. Again, we recently found a driver which was going to sleep in an action routine for 8 milliseconds, and that introduced 8 milliseconds worth of latency. We do have ways of pointing fingers in the system so you won't get away with it for any length of time.

Oh, okay. And when you register an event source with a work loop, you generally just do I/O service: get work loop, and that's the mechanism that gives you the entry into the recursive statement I was saying. That's how you find the work loop. One of the things Actually, we'll cover that later.

Okay, so the first event source that most PCI hardware developers, I was about to say real hardware developers, which is a side of my background, unfortunately, think of is, okay, how do we get interrupts? Because it's one of the fundamental things that vary from OS to OS. Our filter interrupt event source is the mechanism we recommend for PCI hardware. The event source is used to deliver hardware interrupts to a driver.

It takes the interrupt, causes the work loop to schedule. This is the only thing that causes the work loop to schedule, in fact. So at primary interrupt time, it's very quick. It just comes along and increments a number, and it says, "Hey, work loop, you've got some work to do. Kick." And then it goes back to sleep again, which automatically gets back into the dispatcher that I mentioned earlier. The dispatcher says, "Hey, look, I'm looking for the highest priority thread in the system," and it's a work loop. The work loop starts scheduling.

So the latencies are very, very short, and the filters generally don't have to do any work at all. But we do recommend that you must always implement a filter because you don't know if your hardware is going to be in a shared chassis or not. And when you're sharing interrupt event sources, it's a very good idea for you, if your hardware supports it, to say, "Hey, this wasn't me." Just return false from the filter.

Now, the action routine is synchronous with respect to the workload. You're going to see this statement a lot. But the filter is totally asynchronous. It's a primary interrupt. You have to do special things to stop it from coming out, which is why I would recommend single producer, single consumer queuing or something of that nature with the filter routine. If you need to synchronize with the filter routine, you've got to be very careful.

Okay, so now the other major event source is the timer event source. There's lots of reasons to use a timer. Poll mode drivers, which we don't recommend, but people are doing it, so that's one of the reasons for using it. But the most common one is hardware timeouts. Oh dear, nothing has responded in 30 seconds. I have to do something.

I/O Kit timers, the timer event source, is built on top of the kern/thread call APIs. They're very wonderful APIs. I highly, I just love them. They're very, very lightweight and they're a great solution. There is a problem though. If you remember back to my earlier diagram, thread call threads are very high priority. They're higher priority than workloads. Which means if your timeout and your interrupt occur at exactly the same time, the timeout will schedule first.

So, best thing, check to see if your hardware is done in the timeout code. And if it is, fine. You've beaten the interrupt before it got delivered. If not, timeout has occurred. Okay, here I have to make an embarrassing admission. This is my bug. It's been my bug for a long time now and I will fix it soon. There is no synchronous way of cancelling a timeout.

Really, it's just painful, it's embarrassing, I'm turning red up here. The safest way to delete a timer is to let the timer expire and then on another thread delete it. Don't rearm the timer. Sorry, I have to give you the warning because it is the big caveat with these things, but it's really a problem.

And I'm hoping to fix it, but I can't go backwards in time and fix it in Jaguar and Cheetah, so I'm afraid if your drivers have to run back in time in Puma and Jaguar systems, then you are going to have to let the timer expire. And guess what? The timer's action routine is synchronous with respect to the I/O workload. Same as usual.

Okay, the command gate. Command gate's rather interesting. A lot of people think it's a lock. It isn't really. It's just a sort of container, a pointer to the lock that is in the work loop. Remember I said the I/O work loop should be called a work gate? Well, the command gate gives you access to that work loop.

So for all command gates on a particular work loop, there's still only one gate. Command gates allow you to run code synchronously with respect to the workload, but without a thread switch. It just takes the gate, allows you to run some code, and then you will drop the gate fairly quickly. Now, I admit that the run action, run command API is clunky, especially if you're used to writing locks and just saying, "Hey, take the lock, drop the lock, take the lock, drop the lock."

It turns out run action has really come to our rescue several times. First of all, debugging recursive locks where you mismatch the lock/unlock pair is really painful. So with run action, you can't get it wrong because it's a subroutine. It just says take the lock, call the subroutine, return the lock on the exit path. There is no avoiding it. So you can't get it wrong.

The other thing that it gives you is that it gives you -- when you use show all stacks, it's a really wonderful command for tracking down deadlocks and other problems that are running in the system. Show all stacks will show up run actions. They will be there on the system. And we have caught so many deadlocks because of show all stacks.

And run action is there on the system. Whereas if you just take a lock, you have to memorize everybody else's drivers, even ones you don't write, and say, "Oh, look. This routine, 15 levels down in the stack, it takes a lock. And I know that because, well, I can run it. I can read minds." Run action, you don't have to read minds. There it is. It's in the backtrace. You know. Okay, this is the really cool part about Command Gates. It's Command Sleep, Command Wake Up.

Another thing that sort of came a bit late, it's when a client thread is calling into your driver, it often says, "Hey, I want some data." And your hardware hasn't got any data available yet for streaming for whatever reason, like the device you're talking to is slow. So what you can do is you can block the client thread by calling command sleep, and it will block until some event occurs.

Now, this is in fact the mechanism I was talking about that the dispatcher uses. This is how you block a thread until some event occurs. Now, there's lots of other ways of doing it, but this is the one that's built into the way that CommandGate does its job.

Data acquisition drivers are a typical case for this. We don't really have hardware direct call outs. One of the most common requests we got is, "Oh, we can't write our application because your interrupt routines don't call out into user land." Well, no, we're not going to call out into user land. We can't allow that thread to disappear into some code that we don't trust.

But Command Sleep, Command Wake Up gives you something that is very, very close to that. If you have a sufficiently high priority thread blocked in Command Sleep, then when you take your interrupter routine and your hardware turns up and says, "I have some data available," the schedule using Command Wake Up to wake up the thread is just so fast, it's amazing. So you can use Command Sleep to emulate interrupt call outs out to user land. Have the user provide a thread. It's your application. You provide the thread. Block it in your kernel extension using Command Sleep. Very lightweight. Wake it up using Command Wake Up.

So that's it for I/O Work Loops. What we're about to do is how we use this stacking model of Work Loop to synchronously tear down device drivers. Oops, sorry, off by one. Sorry, I was going to... Okay, so I remember I was mentioning BSD locking. BSD does its locking with... currently does its locking with funnels.

Mostly it doesn't affect I/O Kit developers. However, kernel extension developers generally must be aware of the funnels. There are two funnels in the system. We do share, if we go dual CPU, you can issue an IO on the network funnel on one processor and on the system funnel on the other processor, and it's our compromise on the traditional BSD Uber lock. Funnels are good, but they're really not locks, and this is not the right forum to discuss funnels. Writing funnel code that can switch between the system funnel and the networking funnel is difficult.

I can't say impossible because NFS works, but it's bloody close to impossible. Funnels can cause long delays on work loops though, so you do have to be aware of that. If you've got a piece of hardware that's delivering into BSD, be it the serial ports, the disk drive system, or the networking system, you must be aware that those completion routines will probably try to take a funnel. Those funnels are going to cause some sort of latencies because there's only two of them.

Okay, now we can do a synchronous device teardown. So, oh no, my device is gone. This can cause nightmare. Tearing down a stack is just so hard, and this animation I'm hoping will demonstrate what's going on. As you can see here, I'm just trying to emulate the stacking that we have in our system. So far on the left is where your bus is. Let's call it a USB bus, and on the right you have the client thread running.

So the first step is we got a teardown. The bus is detected, the device is gone, and this is how we implemented them first, and it didn't work real well. We disappeared the device. But at the same time, we're on an MP system. A client thread has just come down, and it's issued an I/O request.

It's a bit of a problem because they're going to meet eventually, and when they do, you get a panic, and very bad things happen when that happens. Blue screen of death, whatever. Panics are really hard to debug, and this particular one is nasty because everything looks perfectly all right, but your HUD has crashed, and it's not really obvious.

So, how do we deal with this? Well, we do it synchronously. I guess that's obvious. Our solution is to use the work loop stacking. This is why drivers really can't opt out of the work loop system, not if they want to do dynamic unloading. And most of our developers like the idea that they can unload their drivers. So it means you have to be at least partially aware of work loops to do unloading.

What we do is when we get an unload, we will tell the knob that has disappeared to terminate. And the terminate does a few things, like it goes recursively up the stack marking everybody as inactive. It does that through request terminate. But basically it calls a function called do terminate.

And do terminate is a recursive function, as you can see, as I've implemented here in pseudocode. It essentially just does a head first recursion with will terminate calls and tail recursion on do terminate. You can rely on will terminate messages turning up in your driver before any of the clients get will terminate. And you can rely on did terminate after all of your clients have got their notifications.

So your responsibility in will terminate is to, it sort of depends on where you are. If you're an intermediate driver and you have a series of commands that you know are outstanding and they're in your own queues, you haven't handed it off to the next driver down, then it's your responsibility to return those I/O requests with errors immediately.

If you have client threads blocked in your driver on command sleep, command wake, command sleeps, you should return those immediately with an error as well. Wake them up and notify them that they're going to wake up with an offline error. The error we use is KIO return offline, and by the way, if you're higher in the stacks and you start seeing offline errors coming by, you know what's happening now. Somebody's got a will terminate and you can expect a will terminate fairly soon.

By the time you get to the top of the driver stack, it should be expected that all outstanding I/Os and block threads, ideally, have been returned. So that makes the top of stack driver's job much easier. Notice we haven't torn anything down yet. All of our pointers are valid. One other thing with will terminate is you should be returning errors, if possible, given the API immediately. If any other I/O commands come down while you're doing this, you should be returning errors after you see will terminate.

Okay, if the driver is on the top of the stack, you're expected to implement didterminate. Now, top of the stack varies. You are top of the stack because there is nobody on top of you, which means when you're tearing down, eventually you're going to be top of the stack again.

Now, in didterminate, you must stop all future calls down to your provider. You must wait asynchronously, and that's a bit subtle, for all provider calls to return. So, if you have threads that have gone through you, then you should be aware of those threads and you should not call close on your provider until all client calls have gone through. Now, unfortunately, you have to do that asynchronously.

You have to return from the didterminate. So, your primary responsibility, though, is to close your provider as soon as you reasonably can, as soon as you know that you can synchronously guarantee that no client threads will get through you and no client threads have already gone through you, then you can call close on your provider, but not before.

If you cannot make that determination that you would have to wait for some threads to return, then you must return from didterminate immediately anyhow. It's a bit subtle, and what then happens is when the client thread does return, you can call close on your provider. So, you can return, take the command gate and then call close.

It's really tricky to implement well. In general, you don't have to worry about it. You can make certain assumptions if you're an intermediate driver. The only drivers that really have to be aware of this is top of stack drivers. And we write those usually. Apple writes those. We've got the user clients for USB and FireWire. We have the media BSD client. And I would like to say the serial BSD client, which we own, but it's broken. I own that one as well.

And that's about it really. In conclusion, I guess threading comes down to please lower your priority. We don't want an arms race and the system will work a whole lot better if you use a lower priority. The other thing that was interesting is work loops. Work loops are way cool. They integrate well with the system. You can't get deadlocks if you're on a work loop unless you're a RAID driver.

And if you are a RAID driver, heaven help you. That's what Darwin is for, I guess. And finally, synchronous teardown, please implement it properly. Will terminate, did terminate. And by the way, synchronous teardown applies even to PCI devices. I mean, you could be PC card. But also, whenever you do a KEXT unload, you're essentially going through device teardown.

So, further things that might be interesting. We have an open source presentation that we'll be discussing how X and U works, among other things. That's coming up tomorrow. We have kernel programming interfaces on Wednesday. And we have writing threaded applications on Mac OS X. Writing threaded applications isn't a direct hit on what we're trying to do. It's very, very high level. But it should be interesting. And also there's a series of hardware talks coming up tomorrow. Bluetooth, USB, Firewire, and some feedback forums. Who to contact is Craig Keithley, and I think I'll hand over to him.