Developing Reliable and High-Performance Drivers for Mac OS X Tiger - WWDC 2005

OS Foundations • 1:02:42

Learn I/O Kit driver best practices and how to handle common challenges in this in-depth session on working with I/O Kit. In addition, Apple's engineers will demonstrate the new diagnostic tools and logging support available in Tiger.

Speaker: Dean Reece

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good afternoon. There we go. My name is Dean Reece. I'm the manager of the I/O Kit team. Some of you have probably seen me do a similar presentation in the past, some maybe a first year here at WWDC. I'd like to have a quick show of hands. How many here have written a kernel extension or I/O Kit driver for Mac OS X already or has one in process?

Okay. How many people have not written one and are wanting to learn a little bit about how to write them? All right, good. We have a mix of some veterans and some new folks. Hopefully I've got some good information for you. And let's go ahead and get started.

So sort of the top level outline of what I'm going to be talking about. I've broken the slides down into four major groups. The structure for today's talk is really, I think of it kind of like a travel journal. Over the last year, year and a half, I've reviewed a number of kernel extensions both inside Apple and third party extensions. And I've been sort of collecting issues that I've seen. I've also been listening to questions.

So this is really just a lot of these questions and issues put together. Hopefully this will give you some tips as to how you can improve your own kernel extensions. And again, the categories are kind of broad. So if I ramble a little bit, I apologize. But again, hopefully it will be useful for you.

So first we'll talk a little bit about binary compatibility. Obviously with the Intel announcement on Monday, this is something that will be interesting. I've got a little bit to talk about about Intel. I don't have any slides on it, but I do have some things I do want to tell you. So I'm going to talk a little bit about kernel extension debugging. We have some new stuff as well as talk a little bit about what's already there.

And I spend a little bit of time talking about performance. Hopefully you'll be able to take some of this back and look at your kernel extension and maybe be able to make it a little bit tighter. And also reliability. There's a few issues we've seen and there's also some techniques you can use to make your kernel extension more reliable.

So this is the obligatory "You are here" slide, just to give you an idea of sort of the layer of technology that we'll be talking about today. If you see here, the large purple box represents the entire kernel, and of course it hosts a number of libraries and frameworks that applications link against.

But within the kernel, where I'm really going to be talking, it's divided into three major plug-in spaces. We have file systems and networking, which fit within the BSD universe, and then we have device drivers and families, which fit within I/O Kit. And of course, underpinnings for all of that include the Mach kernel and libkern, which libkern, I think, is the closest analog to libc that we have. For those of you new to kernel development Mac OS X, the only framework that you can use for developing kernel extensions is the kernel framework. Any other libraries, any headers you get anywhere else are not going to be used. suitable for Kernel Extension development.

So kernel framework. So we'll start talking about binary compatibility, excuse me, binary compatibility a little bit. The good news is, KEXT management hasn't really changed in a significant way for Tiger. And in fact, it isn't changing in a significant way for Intel either. So we have spent some time improving the KEXT cache and validation logic. Some of you may have encountered bugs where you would install a kernel extension and the caching mechanism didn't notice it.

The publicly described way of making sure the caching mechanism updates correctly is to touch system library extensions. The modified date on that folder is all the cache system needs in order to know that it needs to rebuild its caches. But I've also seen a number of people say, "Well, if you go and remove the caches, everything will be fine." And in the past, there have been workarounds that was a necessary workaround. For Tiger and going forward, I think it's important to have a system library extension that's really clear and easy to use.

And then you can go back and see what's going on. And then you can go back and see what's going on. And then you can go back and see what's going on. And then you can go back and see what's going on. But it's not just the caching system that's changing.

It's the cache system that's being removed. It's the cache system that's being removed. And so I think it's important to be aware of that. And I'm sure that there are some other things that you can do to make sure that you're not just going to be removing the caches. But in fact, the caching system for every major release of the OS has changed in some ways.

And so removing the cache may be an incomplete solution. I have seen people ship products that had incomplete cache removal, which led to some interesting problems because it worked fine in Panther and then didn't work somewhere else. So touch system library extensions is the approved way of letting the system know you've installed or modified a cache. And if you find that that's not invalidating the caches correctly, file a bug report. It always, in every circumstance, needs to.

Now, of course, you need to restart the system after you've installed a kernel extension, but as of 10.3, and of course in Tiger as well, you actually can signal Kexty and have it rescan the extensions folder. And what'll happen is if that driver, the driver that you installed or modified, can be loaded against the hardware, it will be.

So this is an opportunity for you to have a better user experience if you have a kernel extension being installed for the first time, or the user has the ability to say unplug a USB device and plug it back in where the driver would be unloaded and reloaded, then the system can basically immediately make use of your updated driver.

If you have a driver for say a PCI card that you're rooted off of and it can't be unloaded, then you are going to have to reboot to get your kernel extension in. So there's a Q&A note there, QA 1319, that you ought to have a look at if you're interested in this. It will give you the details of how to go about doing it.

Now I'm sure you've heard about KPIs. How does this affect you? Well, one of the issues that we had with the kernel is we didn't really have formal interfaces. We had a number of symbols you could link against, a lot of things you could call, data that you could query, and it was a bit of a maintenance nightmare for us because every time we changed something that was necessary to improve the system or bring out a new feature, we would invariably break some number of kernel extensions. Now I/O Kit was designed from the ground up. It was a brand new thing that Apple invented, and so those interfaces were fairly clean and we've been able to stand by those fairly well. With the BSD interfaces, however, there's a lot of legacy code there.

We've obviously upgraded it tremendously at Apple, and we needed to come up with a nice clean interface for you, and the KPIs are the way to do this. So within I/O Kit and Libkern, the functionality is pretty much the same, or in Libkern's case, it's actually expanded a good bit. So if you were limiting yourself to I/O Kit and Libkern before, you'll probably be absolutely fine in Tiger and going forward.

Within BSD and Mach, however, we've actually eliminated a large number of symbols because they represented functions we couldn't maintain going forward. So I'm only going to talk a little bit about KPIs, but the main thing as far as rules of engagement for KPIs, The two namespaces, the pre-KPI namespace we call the compatibility cache or compatibility namespace is not mixable with the KPI namespace.

So your kernel extension either needs to link with all of the older compatibility symbols, that would be versions prior to version 8.0. The two namespaces, the pre-KPI namespace we call the compatibility cache or compatibility namespace is not mixable with the KPI namespace. So your kernel extension either needs to link with all of the older compatibility symbols, that would be versions prior to version 8.0.

I should point out that there's a lab as well that will help you port specifically file systems, but it will help you sort of explore the new KPIs if you haven't yet. So that's tomorrow at 3:30. Now, of course, this brings one very obvious question is if I have to support multiple releases of Mac OS X, how can I do that?

Can I do it with one kernel extension? And the answer is, of course, it depends. So we'll talk about that. So with KPIs, how do I release one text across multiple OS releases? So if you can use the compatibility libraries, version 799 and prior, then that one kernel extension should be able to work on both Tiger and previous OS releases. Obviously, you'll have to try it and see.

But if it does fail to load, the most likely reason that it will fail to load is an undefined symbol. If you use kextload with one of the verbose options, it'll tell you exactly what symbols. And probably what's happened is we've made that symbol unavailable in Tiger through the KPIs, in which case you will need to port your driver.

[Transcript missing]

So again, just to kind of graphically illustrate the nesting here, these are two separate kernel extensions. As you can see, the dependencies here, some of them are on, the red one is on 8.0 and the blue one is on 7.0. You can nest the older driver inside the newer driver or the other way around if you prefer by simply creating a plugins folder. There's ample examples of this on Mac OS X itself. You can certainly examine the extensions folder and see how this is done.

And to talk a little bit about what we mean when we say binary compatibility, this question's come up quite a bit. I know I've addressed it in mail lists and so on, but I'll state it here just for the record. So we consider backward compatibility the ability to load an existing kernel extension on newer OS releases, right? So as we roll along and come out with new software updates or new major releases, ideally your kernel extension continues to load and be useful to the user. And we certainly put a lot of effort in not breaking you.

We call forward compatibility the opposite of that, the ability to load a new driver on an older release of an operating system. I/O Kit does not provide forward compatibility. We have some active binary compatibility mechanisms that assume that the Vtable is specifically Vtable related, but the classes that you're loading in your KEXT need to be less than or equal to what are already present in the kernel.

So we don't support forward compatibility with kernel extensions. We also don't guarantee source compatibility. This is very important. Again, it's a question that's come up before. We go to some great lengths to avoid breaking you. We really understand that it's painful to bring your source forward if we break you. And in fact, we have to bring all of our own drivers forward as well if we change something that's going to break them. effort for us. So we only do it if there's a good reason.

One of those reasons would be changing the compiler. This does happen and has in fact happened with Tiger. So what compiler should you use for kernel extensions? Well, for the Intel development systems that go out that we've announced yesterday, GCC4 is the compiler to use. That's, I believe, the only compiler we're going to be supporting for development on that platform.

But for PowerPC, you can of course use GCC3.3 as well. I'd recommend using GCC4 if you can get your sources to build. There's a few subtle changes that you'll have to make. They're typically not very hard. Also, a GCC4 built KEXT should be able to load on older releases all the way back to 10.2. So we don't expect to have problems there.

Now, if any of you are still compiling with GCC 295, please be warned, we're going to stop supporting GCC 295 built drivers at some major release in the near future. So it's basically provided now only for support for your kernel extensions that were compiled in the 10.1 or 10.0 timeframe. There's some extra steps we have to go through to modify the binary at load time and those, that particular system, the remangler will be going away. So please move forward to GCC 3.3 or ideally 4.0.

As I said before regarding forward compatibility, what you need to use is the oldest SDK that you plan to make your driver available for. So if you want to be compatible back to 1028, please use the 1028 SDK. Another very important point for binary compatibility, this one's subtle and has bitten a number of people, is the OS define and OS declare macros. These are very important. This is part of our binary compatibility system. The Vtable patcher looks at the information laid down by these macros, and that's what it uses to patch up your Vtable and make it work with the running system.

If you don't put these in correctly for every class you define, at least the ones that inherit from OSObject, then at some point in the future, your kernel extension will fail to load because the vTable patcher won't have any work to do if you're building and running on the same release. But if you're building on an old release and running on a new release, it does have work to do.

So in this regard, the IOService class did change in Tiger. So if you have built a kernel extension on Panther and you try to load it on Tiger and you get this, your class is not compatible with its superclass message, at kexload time, then you're probably being bitten by this. You need to make sure you've implemented these macros correctly.

So moving to GCC 4.0, as I said, is recommended. It is a little bit more strict with C++ code. A few things that were warnings in the past are now fatal. There's some new warnings. So it may take you a little bit of work to do, but it's not too bad. One common problem, though, is casting member functions to C function pointers. And we use this in I/O Kit a lot for callbacks, for our target action parameter callbacks.

So we've introduced in Tiger a new API called OS Member Function Cast. And basically, this is something that's there to allow you to do the casting without having to worry about the compiler details. So this eliminates the unavoidable build failure in GCC 4.0. As a caveat, I will say you will see one warning. With the set of headers that we shipped in Tiger, there will be one warning generated.

Don't worry about it. It's actually OK. And if you're-- really want to get rid of every warning in your code, send out email on the Darwin Dev List, and we'll tell you the minor tweak to do to get rid of that. But it's harmless. Also, it allows you to get rid of the F permissive flag. If the only reason you had to have this before is because of the casting, you can finally get rid of that.

[Transcript missing]

All right, let's move on to kernel extension debugging. So a lot of people continue to do kernel development on single machine. It's not what we recommend. We are aware, of course, that people do this and we've actually added some stuff in Tiger to hopefully make it easier.

Just sort of to sell you away from single machine debugging, it's definitely a slower cycle time. If you're developing on a machine, you load the text and you have to reboot, I would very strongly urge you, whether you're doing single or dual machine debugging, as soon as you get your driver up and kind of limping along, make it a priority to make it unloadable cleanly.

And at every stage of your development, make sure it unloads cleanly because that's a good sign that you're not leaking things. Basically, if you can't unload your kernel extension and you don't know why, it probably means you've got some references outstanding that you really ought to be cleaning up.

But certainly on single machine debugging, that's a real win, being able to load and unload. Now, of course, you won't always get the opportunity to unload because it will panic some. And when that happens, of course, you're risking data corruption. We try very hard, of course, on a panic to not lose your data, but if you have uncommitted data in a write buffer somewhere, it will not get to disk.

So, if you're doing development on the same system that you're testing on, back up your data and Just, you know, good common sense there. But we also, as you probably are well aware, we don't have any local kernel debugger support. You can't debug the kernel on the machine that you're running.

If we can save the questions for a Q&A at the end, I'll be happy to address that. So a little more on single machine debugging. A lot of you have used I/O Log for debugging. It is very useful to kind of figure out the flow of code and, you know, is this function getting called and if so, what's the parameter? It's certainly a first line of defense for any kind of effort to try to corner a problem.

But keep in mind, it really is intended as a logging service. It's not I/O debug, it's I/O log. And it's intended to be very, very low cost to the system. So that's why there's only a 4K buffer and syslogd only runs periodically to pull data out of it and store it off to the file. So the reason that you can overflow it pretty easily is that's by design.

It's a low bandwidth logging service. And it's also not synchronous. It's not, you know, it's within a few seconds but it's certainly not synchronous. Now if you really want to use it synchronously, you can. You can make it synchronous with this boot arg. So you can use the end of your RAM command to set your boot.

And you can use the boot arg, I/O equals 200,000 hex. And actually you can look in I/O Kit, I/O Kit debug.h and there's a whole list of bits in there for subsystems that will log additional information if you care to see it. But this particular bit, the 200,000 will turn I/O logs and I believe also printfs into synchronous writes to the console.

That's very important. Writes to the console means the actual screen that you're using. So this is the, you know, white text on the black background that we used to see with panic messages. This is back in the day. Now if you want to get to the console, there's a bit that you can set in the boot arg again to force that all the time. But it's very simple.

At the login prompt, you just type greater than console for the user name and that will exit the Windows server and put you into the console mode. And then you can do whatever thing you need to do to trigger the problem. And you'll see synchronous I/O logs coming out on the console scrolling.

We also added two new APIs. This has been requested on and off. The primary one here is called OS Report with Backtrace. What this does is it gives you a backtrace of the four prior stack frames, that is prior to calling this, and it allows you to see how you got into a particular function. Let's say your function keeps getting calls with some bogus value and you want to understand what the call graph was that got you there. This will just log as a normal I/O log those four stack frames, but it returns, unlike panic.

So if you need more stack frames deep or if you want to do some more complicated logic, you know, if you're trying to implement a logic analyzer in your kernel extension, you can use the OS backtrace below. That's really the guts of OS report with backtrace. And this simply gives you an array of pointers, an array of stack frames. So you can pass in as deep as you want. You could get 20 frames, and if they're there, it'll give you the address of all of them. So you could use that to do statistical analysis on your call graph. You could do all sorts of things.

But the top one there is great, particularly if you are single machine debugging, because you don't have to take a panic to cause something interesting to get logged out. Now again, these functions have no place in commercial code. Make sure that they get FDFed out, compiled out in production code.

[Transcript missing]

Now you need to start by, and this is true really for any kind of a GDB work that you're going to be doing on Mac OS X in the kernel, you need to download the appropriate kernel debug kit from the Apple Developer Connection website. We get these out as quickly as we can after each new release or software update that has a new kernel. And it has what we call a symboled kernel.

For those of you that don't know, it's sort of a byproduct of the build process. It's the same binary that's on the system, but it has a tremendous amount of debugging information in it, the line numbers and everything. The kernels are huge, but again, it's loaded with information.

If you debug using the kernel that's on your system already, what you're going to see is addresses relative to the nearest global function. So you're going to see some global function like bcopy, and the offset you see might be, I don't know, several hundred k off of that because there are several intervening static functions that got stripped out, and those symbols are not in the shipping kernel. But the symboled kernel has all those, so it'll give you, tell you exactly what function you're dying in.

Now, of course, you'll have to also build symbol versions of kernel extensions for any keks that you've built, right? Anything that--any kernel extension that you're loading yourself, there are a variety of ways to do this. You can have the keks loader drop the symbols at the time that it loads them. If you're debugging a panic after the fact, you can generate them given the load address. But again, the tech note gives you the details on that.

Once you have those pieces and you run GDB on it, you can start to examine things. X/I means basically examine memory as an integer and you pass it an address, or excuse me, an instruction. X/X would be for integer. X/I means examine it as an instruction. You give it the address and it shows you the instruction as well as what function it's in. Remember though, look at PC and LR, the program counter and the link register are the most recent things that would have been executed.

But for all of the addresses you're getting from the stack crawl, remember that's the return address. That's the address that the processor would execute next when it returned. So remember to subtract four. That's the branching off point that got you there. And then because we're talking about I/O Kit, we're also talking about C++, you're going to be getting symbols that look like this, the underbar, underbar, Z symbols. Those are the mangled symbols. They basically contain encoded information about arguments and so on. So if you run that through C++ Filt, it can convert it into a nicely readable file. readable form there. So that's a handy thing to do.

Now for two-machine debugging, this is really the way to fly. If you can at all do this, I strongly recommend it. We've added some capabilities in Tiger recently, we'll talk about that. But obviously you have an expendable target machine, your development machine is safe and secure. I've known some people that set up an environment where they actually NFS export the root volume from their build machine, and so their target machine isn't even running on a local disk, it's actually rooted off of that NFS.

So it has no opportunity to corrupt anything when it crashes, and it makes it easy to copy kernel extensions up. So what'll happen is, as you move into a two-machine environment, you'll develop some scripts and some techniques for moving the kecks around, because it does involve RCPing or moving kernel extensions over in your favorite way, but it really is the way to fly. Now there are three ways to attach to a target machine.

Ethernet is the tried and true way that we've been talking about for years. I'm not going to spend long on that. But we've added FireWire debugging. If you look into developer extras, kernel debugging. You have to install, of course, the software developer kit. But this allows you to connect to the target machine via the FireWire cable. Now you have to install a piece both on the target machine and on your development machine, and you have to set up the link prior to the panic.

So this doesn't save you if the machine is panicked and you're trying to get the data off of it. You have to install a piece both on the target machine and on your development machine, and you have to set up the link prior to the panic. So this doesn't save you if the machine is panicked and you're trying to get the data off of it. Now you have to install a piece both on the target machine and on your development machine, and you have to set up the link prior to the panic. So this doesn't save you if the machine is panicked and you're trying to get the data off of it.

So also with machine debugging we have remote KPRNF. You can connect over serial if you have a machine with a serial port. Not many machines available like that today, but you could maybe with a G4 with a stealth card. Firewire is really the way to do it nowadays.

Coming soon, and I believe the very next Firewire SDK, there'll be the pieces that you need in order to set this up. And you can have the target machine, KPRNF, appear on a viewer on the test machine. And if you had attended the Firewire session previously, they would have talked a little bit more about this. But I believe that's coming in the Firewire SDK version 20.

That's similar to local logging, except that it has some advantages. One of the big ones is you can save your session. If you've logged 50 things, you can cut and paste it into a bug report, for example, or you can do searching on it, which you can't do with a local console. Just yet another reason to go to machine here. It's also a good bit faster. If you're going over Firewire, it's a pretty fast bus.

Now, to talk a little more about kernel core dumps, this is a network connected core dump. And the idea here is you can set up one collection server system for pretty much as many machines as you want. It's pretty common to have one departmental server. If you're a small company, you could have one for the whole company.

And you configure each of the client machines every time they panic to dump core back to that machine. Now it does take a little bit of work to set up. It's not tremendous. There's a tech note here, tech note 2118, that will walk you through all the steps. But once you've got it set up, it's a very low investment in time to maintain. And think about it. You collect every panic that you have.

Every panic that occurs on your network, you now have a full core image of. So if you're looking for those hard to reproduce panics and you hate it because you missed it and you couldn't attach to the machine, well, this is a way to collect those. And you can look at them at your leisure. The core files are obviously fairly large. But at least you can collect them.

And you can move them around as you need. Now of course, you're not talking to a real live running machine. You're talking to the memory image. So you're not setting breakpoints and continuing or single stepping. But still, pretty much everything is there. So you can look at the stacks, for example.

All right, let's move our topic here to performance. Performance is kind of hard to pin down. It means different things to different people. So throughput is a classic measure. How much data can you cram through the system? Latency is now in this day and age with AV being so important, latency is maybe a little bit more interesting. And jitter is also tossed around. Latency is basically from the first bit of data in to the last bit of data out, how quickly can you move transaction through the system?

And obviously with AV, it's important. Jitter is really how constant is that time? If you can get data through your system in a millisecond, is it a millisecond plus or minus a millisecond? Or is it a millisecond plus or minus a microsecond? Low jitter is very important because it allows you to keep relatively small buffers. Personally, I like efficiency as a measure of performance. It's ultimately what it comes down to is how well are you utilizing the hardware that the customer has bought. It's optimized for low CPU cost.

When you do this, you'll find after the fact that it usually is improved either your latency or your throughput or sometimes both depending on exactly how your code is structured. So, you know, it's certainly a good way to attack the problem. Throughput and latency are really external measurements and I think they're observational. Efficiency is an internal measurement and I think that's something that you can really get behind and go through and look at. It reduces the footprint of your system.

This is very, very important, right? You're reducing memory. If you optimize by reducing your code path or reducing your data, it's going to decrease the amount of memory that you get wired down. And in the kernel, that's very important. But the interesting thing here is CPU utilization directly tracks with battery life. and temperature and therefore fan noise as well. So you can actually say, "Well, I'm making my driver run quieter by making it more efficient." And that is actually true.

Let's talk a little more about wired memory. This is very important. Wired memory is the most expensive kind of memory on your system because it's permanently used. It's not available to the VM system to page out, to have somebody else make use of. So all in-kernel allocations by default are wired memory. You can get pageable memory in the kernel, but most of the common allocators are wired. Now, when you load your KEXT, all of the text and all of the data associated with that KEXT are also in wired memory.

There are some rounding errors here that actually make it a little bit worse than that. When you create a subclass of I/O service, because of the infrastructure you're inheriting, particularly the Vtable and the binary compatibility patch-up stuff, there's approximately 2K, it's a little bit less than that, about 2K used, just getting you there.

I don't want to discourage you from using classes. I'm a very strong believer in object-oriented design. I think it has real value. But be aware that there's this tax. You don't want to go and create 50 classes if you can avoid it. If you can get by with two or three classes with maybe some parameters that would help describe the distinctions. Also, subclass from the highest point that you can. The higher up you go in the class hierarchy, the less you're inheriting, the less the permanent tax.

And this is something you might not have considered. If you make use of internal helper classes, these are maybe data representations that you use that never migrate outside of your kernel extension. These objects are never put into a dictionary. They never get passed off to anybody else. They're just helpers. You don't even need to make those inherit from OS object.

You can actually create your own root C++ class. And you don't obviously need the binary compatibility logic if everything is contained within your text. And that can even be true if you have a suite of kecks that are always shipped together. You can share classes between them as long as they're always shipped together. You don't have to worry about incompatibilities.

Now another thing I'll talk about, because of the way kernel extensions load, there's approximately a 4K tax of wired memory there. And the reason is this: we load the text segment and the data segment of each text at the beginning of a page. And on average that means we're going to waste about a half a page at the end of your text. Of course if your text has got exactly 8K of text and 4K of data, then we're not wasting any.

But on average, if you look at a system, it's about 4K per kernel extension. And this is so that when it comes time to unload them, we don't have a page partially pinned down with some other text. Now you can help minimize this text by refactoring your text suites.

Here, I'll show you some diagrams here that give you an idea what I'm talking about. Let's say you have two keks that you ship as a suite. You've got an actual driver and then you've got a support library. In a real world example, you'd probably have more than one driver.

You'd have one library and say three drivers, something that were maybe bus specific or something. Well, if they're always loaded together, then combine them. And by doing that, you just saved one page of wired memory without having to change any code. This is just changing your project to combine.

Splitting can also make sense. If you have a suite that you're putting together in a single kernel extension, but for most users, most of the time, they're not loaded together, you can split them up. So in this example here, the red kernel extension represents the portions of your kecks that are talking to the HID system and the portions that are talking to the USB family. Let's say those are used very frequently, but maybe you don't think the ADB functionality is used very frequently on your product. Since we don't have anything that you can plug ADB into these days, that's probably true.

And as a result, you still have ADB compatibility with your product here, but you don't have the penalty of loading all of that code. It's not used. The red text would get loaded the majority of the time when your device is present. The blue text sits out on the file system pretty much never getting loaded. So again, you save not only 4K, you save the entire cost of that functionality.

The good general rule for memory footprint is keep only what you need. I can't stress enough how many times I've seen kecks go out the door with development and debugging information in them that isn't necessary. And I don't want to pick on third parties. Apple does this too. It's very easy to forget. So it's something that really you need to be mindful of.

One thing you might not be aware, the I/O registry is a wired data structure. It's in wired memory. So when you put that 64K block of data in there, that's wired memory. So don't put anything in there that you don't absolutely need. Of course, your I/O personalities in your kernel extension get downloaded into the registry when you load the KEXT. So that's one source for large blocks of data would be to have it in the personalities. So avoid that if you can.

If you have a firmware image or something that you're going to need to download your hardware once, put it in a throwaway KEXT. Put it in a KEXT that can be unloaded. And depending on the details of what you're doing, there are a variety of ways to do this. Probably the easiest way is to have the main KEXT for your device match and load against the device. And when it knows that it needs some firmware, it can register itself for matching or it can create a nub that will match that firmware KEXT in.

And in fact, it gives you the opportunity to have multiple firmware KEXTs, right? You could say, "Oh, this is version B of the board, so we're going to get this particular KEXT that we need for version B firmware." And it will download it to the board and then will evaporate and unload and free up that memory.

[Transcript missing]

I'm going to spend just a few seconds here talking about memory fragmentation. If you're writing an application, memory fragmentation is annoying, but when your app exits, all that fragmentation is gone. When you're running in a kernel, fragmentation is kind of like a permanent plaque in the system. So think of the kernel as an app that never exits, or at least you hope it never exits.

So, yeah, and the fragmentation problem is that as you allocate a bunch of small pieces of memory, typically smaller than a page, and maybe keep them for a long time, and then you've also got some allocations that you only keep for a short time, you can cause pages to be pinned down that don't really need to be. The efficiency of that memory goes down. You have 4K per page, maybe only 2K if it's actually being used. The other 2K is memory that can't be freed up. So, a couple things you can do for a small amount of data, and I want to emphasize small.

dozens of bytes, put it on the stack if you can. There's no allocations involved, there's no locking, no atomic operations, very efficient. If we're doing string processing, you want to do an 80 character string, that's OK to put on the stack. If you're talking about larger allocations, hundreds of bytes up to a page, if you're doing something very briefly, you might even consider doing a whole page allocation. We do zone allocations in the kernel, so if your allocations happen to be power of two sizes, that aligns very well with the zone allocators.

But I'll talk about that in a minute. Also, cluster your long term allocations. This is actually a very good technique. Think about your average case. How many buffers are you going to need? How many command buffers? Whatever. Allocate them in a pool. We have an example of this in IO command pool, and I believe that's in Darwin. So you can have a look at how we've done that.

What you can do is have your average case allocated and waiting for you. And then you can also allocate additional resources on demand. So if you have a spike where you need some additional resources, you can grow the pool, and you can even have some logic in there to drop it back down if you want to. I would not recommend allocating for your worst case, because that's, of course, permanent memory now that you've pinned.

Performance is always a trade off. You need to decide what's going to be most important and design for it. But think about it. That's the most important thing. Make a conscious decision. And if you have questions, ask us. We're here to help. Now, leaks. Leaks. These are evil, particularly in kernel extensions, because again, we're leaking wired memory.

This isn't a page of memory that's going to go cold and then get sort of forgotten about off in a swap file somewhere. This is memory that's permanently allocated and will never be touched again. So the symptom is the machine gets slower and slower and slower as available memory goes down, and then it'll eventually crash or the user will get fed up and reboot it. It seriously damages the user experience.

And the reason they're so evil is they often don't get caught in QA because you're doing maybe short cycle time on testing or you're not looking specifically for leaks. So everything can work correctly. Your driver may be perfect. But if it's leaking, that's a problem for the entire community.

[Transcript missing]

Locks, atomic operations in general are expensive, particularly on an MP system. Our systems are cache coherent. So you're not only taking time on the processor that you're running on, you're taking time on the other processor because you're messing with its cache. So you want to avoid taking more locks than necessary and you want to avoid taking them on the data path because the data path is where you want your low latency. Anytime you have to do an atomic operation, you're introducing jitter because you don't know if that resource is going to be available for you to grab.

Also be aware, any kernel function that says in the documentation that this may block, it's blocking because it's doing an atomic operation. It's acquiring a lock, it's doing something that is atomic and that's expensive. So that's something to be aware of as you're writing your code. Context switches are expensive compared particularly to just a direct function call. So you want to design your code to avoid that as necessary. And there's an interesting thing we call the ping pong effect where you can have one thread running on processor zero and it wakes up another thread.

That other thread then schedules and runs on processor one and starts working with the data that the prior thread created. Well, that's on the other processor and the data is not in its data cache. So you can have these two threads ping ponging between the processors and blowing cache affinity.

So if you can do what you need to do without a context switch, you avoid that, right? Because you're still running in the same thread of execution. You're still on the same thread. You're still on the same processor. And of course, disabling interrupts is evil for a number of reasons because you're basically taking the CPU away from the scheduler and saying, "No, I'm going to decide what runs, not you." And that affects our real-time performance. So disabling interrupts.

That also causes jitter. So we of course have I/O Work Loop. Everybody who's looked at I/O Kit has encountered this. It's central. It's one of our core concepts here. But it really is designed from a systemic view. For those of you that have looked at some of the APIs of I/O Work Loop and they perhaps don't make immediate sense when you're approaching it from a driver/rider perspective, it's because it was designed from a systemic perspective. How does this affect the whole system?

How does this work when you stack multiple layers of drivers on top of each other? The main thing it tries to do there is eliminate context switches in the data path. As you start from a user request and you go all the way down to the hardware, we tend to inherit the I/O work loop along the client provider chain in the registry, which tends to be the data flow. So, by and large, a request can go all the way down to the hardware on the client's thread with no context switch.

And likewise with interrupt event sources, you're not doing any work, you're not doing any interrupt processing in the interrupt handler. All you're doing is writing a filter to make sure it really is your interrupt. And then the thread schedules, your work loop thread will schedule to run. Then it can process the data, but again, that will be passed up the stack all the way up to the highest level on a single thread. You're minimizing context switches.

I/O command gate. This is an interesting one. I've had a number of requests for us to publish the lock that's associated with this. And the reason we don't do this is really two reasons. One is sort of a methodology, a design methodology that we like to talk about. I call it a gated community. The idea here is you have a set of functions within the driver that all know they're running behind the command gate.

They can call into each other without having to do any additional locking. Remember, locking equals atomic operation. And all of the functions outside of that gated community know that they are not running with the command gate held. So when you want to call into a gated function, you take the function pointer to it using that nice macro that I showed you earlier. You use command gate, you call through to it. It also has a very interesting debugging technique associated with it. Because all calls through the command gate show up in the command gate. And the function calls are all function calls.

So if you have a deadlock, a show all stacks will point exactly where it is because every access to one of these locks is on the stack. Whereas if we publish the lock, you'd be able to take the lock in your code and release it directly. Any imbalances there would cause deadlocks and there would be basically nothing

[Transcript missing]

If you haven't used the utility Top, I urge you immediately after the session is over to whip out your laptop and run it. Top is one of these essential little utilities to understand what is going on in the system.

And it isn't necessarily directly that relevant to what drivers do, but it does help you understand when things go wrong, maybe what is going wrong. You've got a demon spinning out of control. You can use it a little bit for seeing what's going on with drivers. The kernel and the collection of all kernel extensions are accounted for under PID 0, the kernel. It's the very last line on the slide.

And if you're running a stress test, if you're really beating on your driver, maybe you've got a gigabit Ethernet driver and you're really pushing data through it, that will show some numbers there. And so you can actually start to look at memory usage and CPU usage collectively for the whole kernel.

And as your driver becomes more lean, those numbers should get better. So if you run this sort of on a bus, you can actually see the difference. And if you run this before and after tuning exercise, you might actually see these needles wiggle a little bit. I would hope you would be able to.

We also have the ability to profile the kernel. Again, this is the whole system really is being profiled by Shark. But if you're running a load test on your driver, this actually will probably yield some interesting results. Now, first and foremost, you really do need to get the kernel SDKs because Shark will give you, again, very peculiar data if you don't have the symbol kernel because you're going to be getting whatever the nearest global symbol is and that may not be very close to the actual function that you're landing in. So if you have the symbol kernel, you run Shark, and you particularly put your driver under load, it will help you identify sort of the hot call paths through your driver and you will understand maybe a little bit better about what you need to optimize.

Now this is something that has come to our attention in the last year and this is why I'm bringing it up here for the first time. The I/O registry is a large sprawling data structure that is protected by locks. If your driver is directly setting or modifying a container in the registry that is talking directly to the container, the collection object,

[Transcript missing]

to the registry, particularly to collection objects in the registry, need to go through and take this lock. And there's a couple ways to do this. The easiest way, I've got another slide that shows the way to remedy this, but the easiest way to tell if you're doing this is with this sys control that we've added to TIGER.

This turns on strict checking. And what this means is, anytime one of these container collection objects gets modified while it's in the registry, if that lock isn't taken, you take an instant panic. Now, we shipped this off by default because the system was not reliable with this on.

Okay, this is how important it is. Now it's truly a lottery ticket, right? You're modifying the dictionary that contains your statistics information. What's the likelihood that somebody's reading it at that very instant? It's pretty small, but it's non-zero. So we see a number of data corruption panics and we believe most of them are this. They usually will involve an object that's been modified or freed from the registry recently.

Now there are a few types of objects that are safe to modify and there's one in particular I want to talk about because it's a handy technique to have. The OS data object points to a block of memory. Now you can modify the memory that's being pointed to by that. That's not something that's going to affect any pointers, any reference counting anywhere.

And I know some people will use that to communicate statistics information out. They'll have a struct, they wrap that as an OS data and they put that in the registry so an app can read that. That's safe to do. But if you're modifying a collection or modifying anything that would affect a reference count or the existence of an object, you're at risk.

Okay, so what do you do? When you find one of these, you need to make sure that the modifications to the registry go through I/O registry entry. That is basically your super class. I/O registry entry is the parent of I/O service, which is the parent of basically all the I/O Kit driver classes.

There's a set property in there, a set property function in there that will take the correct lock, make the changes to the registry, and then return, or release the lock and return. If you need to do complex changes, then you've got a couple options. You can either copy the dictionary or the collection that you want to modify, make the changes there, and then set them back in atomically using set property. And by the way, for Tiger, we've added two new APIs to help you with this.

One is copy collection. This does a deep copy. So it does not copy the leaf nodes. What it does is it copies all of the objects that would be at risk for this type of corruption. So if you've got an object that you want to modify, that's maybe multi-levels deep, you can use copy collection on it. It gives you a new collection object that's safe for you to mess around with.

You make your changes, then you use set property to put it back. Let's say you want to make a larger set of changes, or you need to do maybe some read, modify, write type stuff, and you want to make it atomic. We've added a new entry called run property action to I/O registry entry. And this, again, sort of like the gated community concept, this allows you to run a function in the context, that has that lock held.

And of course, the fact the active returning from that function causes the lock to be released. So this run property action will allow you to modify anything in your registry entry. It needs to be fairly quick, because while that function is running, nobody can read or modify the registry. So you shouldn't be blocking in there.

Again, I want to talk about cyclers. I won't spend a lot of time talking about this. This is something that we found tremendously useful at Apple and we're expanding our use of them. But it's really cheap QA, right? Find a way to exercise different paths through your driver, particularly what I call bookended functions.

Close, open, load, unload, anything that has some set of states that you can walk through. And build a cycler around it and as part of your QA program run those cyclers. And while you're running them, look for leaks. This is very easy to do. It doesn't take that much time once you've invested the time to build the cycler, that is. And I think you'll learn a lot about your driver that you might not have realized.

So just a few points about cyclers. It's nice to make them issue the cycles at controllable rates because sometimes you want to do race condition checking and you want to really run them at top speed. Sometimes you don't want to really stress it. You just want to say, you know, cycle this thing once every five seconds. And you're trying to do it, trying to isolate each action separately.

But, you know, again, no matter how you run it, look at the memory as you do this. And also make sure your cycler logs something on every pass because when something does go wrong, you're going to have to do it again. You're going to run your cycler over the weekend when you go home. You come back and realize it died Saturday afternoon. You're going to want to know why. So log as much as you can.

Okay, I have two checklists for you. This is something you should be doing before you ready your drivers for release. The source checklist here, I think there's a lot of kernel extensions that could have been a lot cleaner when they shipped if they had gone through this checklist. And again, this is made up of real problems that I have seen. Make sure your classes use the correct naming. We have a recommended reverse DNS name idiom. Make sure it's not my class. Make sure it's not anything that has Apple in it. That's for us.

[Transcript missing]

Please make sure you don't have any debugging log messages in there. Printfs, kprintfs, I/O logs that spam the console. Remember, I/O logs get written to a file, var log system log. So if you're logging 50 messages a second, and I have seen this, those are going to fill up your customer's hard disk.

So please just check that as part of your final checkout. Look for the log messages and see what's out there. And any other debugging specific stuff. Panic, assert, I/O log, all those things need to be only in your debug code. So make sure you're looking at that. Binary checklist. So this is something that maybe your QA team would want to do because you can do it without really having to look at the source code. Very simple one. kextload-tn. That means test the KEXT, but do nothing.

So it's not loading the KEXT, it's just acting as if it were going to. It runs all the validation checks. And it will issue some diagnostic messages describing what's going wrong. You can add the V flag to get more verbose output, but that's a very easy check. This is kind of a silly one. Run a find on your KEXT and see what all is in there. I've seen a lot of like GDB backtraces. I've seen, of course, the .ds_store files and other .files like that. I've even seen source code in there.

I don't know how it got in there. I'm not sure how somebody set up their project. But you really only ought to be shipping the folders and the Info.plist and the binary and maybe if you have any other resources. But make sure every file in there is something you know you're shipping.

Also make sure your binary has been stripped. We recommend strip-S. If you don't do this, guess what? You're shipping line numbers and source file names. You want to keep your source code closed, run stripped because you're actually putting a lot of symbolic information in there otherwise. If you actually ship it unstripped, it's also very large which is why we personally care. We don't mind if you open up your source code, but you might.

Look at your info P list. The I/O Kit debug property needs to be gone or it needs to be zero. This causes additional logging when your driver's loaded. There's no point in this being in a shipping text. In fact, at one point we played around with making those only load under certain specific debug situations, but we actually found that would break a number of shipping texts if we did that.

So, and make sure your version numbers match everywhere. I know there's several places it has to be checked. We're working on reducing that number, but make sure they're right. Now also, check your copyright string. This is not something we personally care about, but I've seen a lot of things like copyright 1999 my company.

[Transcript missing]

I'm working on a utility called Kextpert. It's basically a big honkin' Perl script. And what I'm trying to do is capture a lot of these tests, a lot of these simple problems that can be automated into a test. That's what I'm trying to put in there.

It's a lot of complexity because I'm also trying to build in version awareness to it. So you'll be able to basically run Kextpert with a set of versions you care about, OS releases you want to run on, and point it at a KEXT. And it will go through it, and for every file in there, it will tell you what it thinks.

It will give you particularly errors. These are things that you probably would have already found, but if you didn't, it'll tell you about them. Warnings are things that, depending on context, may or may not be a problem, but they're certainly something you want to track down. Suggestions is an interesting one. This is one we're going to use to promote new and improved ways of doing things.

It's going to say, "Hey, maybe if you use this property instead, you could eliminate some complexity." So it's going to look for telltale signs of maybe an optimal ways of doing things. And info is really in there for my sense. And then it's going to tell you what it thinks. So it's going to tell you what the parsing engine is thinking about this file.

So it just kind of tells you how it's analyzing Now it's also going to contain a section at the end that for every one-liner that it spits out at the top, hopefully will give you a link to more detailed information. So the ubiquitous, you know, is the founder of the I/O Kit. Dean Reece is the founder of the I/O Kit. Dean Reece is the founder of the I/O Kit. Dean Reece is the founder of the I/O Kit. Dean Reece is the founder of the I/O Kit.

So for more information, there's obviously a lot of information about these topics. You can go to developer.apple.com. There's a special page there, WWDC 2005, and that basically has all of these links on it that will help you find the resources you need. Just a few quick words here. The Apple Developer Connection website has a few specific mail lists that you want to know about. Lists.apple.com is the site you go to to set them up.

With the ATA and SCSI development list, tremendously helpful to folks working in that space. This is also true for FireWire and USB. We have dedicated lists. Obviously, we can't talk too much about future products or future plans, but if you're trying to get something going and you're getting stuck, those are great places to go. We also have Darwin as a tremendous resource. The Darwin Kernel is open sourced.

I don't know the exact percentage, but it's something like 99.99% of the kernel and the surrounding pieces are made available as open source. In addition to that, there's a number of mail lists that I personally frequent and so do a number of other people. General development, Darwin dev, USB. That looks like a cut and Sorry, I need to debug my slides.

So there's general development, there's driver development, and there's also kernel development. So I believe it's [email protected] is what that next to last line should have been. I'll make sure this gets updated on the WWDC site. And of course, there are several related sessions. Session 508 is about open source, and obviously we're talking about the kernel there. USB session 509, both of those are tomorrow.

And then on Thursday, we've got the BSD talk, which obviously if you're working on non-I/O Kit pieces, or even in some cases bridges, that'll be important. And then there's two labs. Tomorrow afternoon, we've got a KPI lab that's specifically for file system development. And then Thursday afternoon, we have a general kernel extension porting lab for Tiger. and Craig Keithley is the person in Developer Relations who is responsible for the I/O Kit area. And I, again, am the manager of the I/O Kit team.