Networking and Security • 1:02:25
The Mac OS X kernel has a powerful networking architecture that offers numerous ways to extend kernel capabilities. Learn how to exploit this architecture to develop advanced networking products such as firewalls, VPNs, and content filters.
Speaker: Laurent Dumont
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it may have transcription errors.
Good morning. My name is Tom Weier. I'm the Network and Communications Technology Manager in Developer Relations. I'm going to introduce Session 304, which is Extensible Kernel Networking Services. Hopefully many of you have made it over from the previous session in Room A1. So with that, I'd like to introduce Laurent Dumont, who is one of the CoreOS networking engineers.
Good morning. I'm gonna talk about the extensible kernel networking services. And in our case, we'll see that really mainly what we're talking about here are kernel extensions and how to create kernel extension to add some networking functionality to the kernel. So as an introduction, Mac OS X networking architecture is extensible, so we get mechanism where you can add firewalls or VPNs or content filters or network drivers. So this is, the goal here is that a little bit like in Mac OS 9 where you get a mechanism for extensions, here you can add your networking extension without having to recompile the whole kernel like you would do in a regular FreeBSD type of environment. So, yeah.
What you'll learn in this session, we'll talk in some details about the network kernel architecture. So what we have for Mac OS X and Darwin, we're gonna use both terms, because everything here is in Darwin, so it's all open source, so you're really welcome to look at all this, you know, at the source yourself from the Darwin kernel. So, and we'll see how to do and filter and intercept packets from different points in the kernel, from the socket layer, from the lower levels, and we'll see that. And also how to add network interfaces and drivers in the kernels. And one of the interesting things that you may learn here is some interesting tips about the specificities of Mac OS X kernel and some of the changes coming from Mac OS 9 or even from some FreeBSD or Darwin, I mean, that are different in Darwin from FreeBSD. So there is different behaviors and some tips here that you may be interested in.
So this is the pictures that you probably have seen many times by now. So networking in Darwin is part of the BSD kernel. So I think it's always interesting to say that there is really three parts in the kernel in Mac OS X. You got the BSD subsystem, which is a lot of the APIs that are sockets, like we'll talk about later, and the file system, and we got the Mach kernel, which is a base for VM and a bunch of other core services, scheduling and so on. And also the I/O Kit subsystem which is using the driver. So just because of this, networking has some different interaction with the rest of the kernel. You'll see at the I/O Kit layer or Mac kernel and we get some differences here. So if we go a little bit deeper in details about the networking subsystem, We see that we can roughly decompose it in four layers. We get the socket part, which as you know, and you were probably in Vincent's session earlier, and you saw that socket is the main API of networking in 10. And so we get the socket layer, and we'll see that there is some mechanism where you can extend the socket layers. You can plug yourself in the socket layer to do filtering and things like that. The other part, the second layer here from the top, is the network protocols. So this is where things like TCP/IP or AppleTalk or your protocol lives. We get another third layer which is something added in the Darwin kernel which you won't find in FreeBSD which is the data link interface layer. We'll talk about this in more details Basically, this is an abstraction layer for protocol and network interface to meet for extensibility. And also, we get the network interface layer, which on one side is gonna communicate with I/O Kit and pseudo drivers on the bottom side, and on the upper side, it's gonna talk to the DLL, the data link interface layer. So we'll go in details on those four layers.
So networking on Darwin, to summarize this, it's currently based on FreeBSD 3.2, so we get the TCP/IP stack, socket, and a bunch of other services for networking from the FreeBSD 3.2 implementation. So if you're a little bit aware of this, it's not the latest version of FreeBSD, but it's a very solid base for what we're doing here. So it's a very robust and proven implementation, FreeBSD is used in many places, in servers and things like that. So what that gives us, it's a stack which has a lot of life and it's been, you know, a lot of problems have been flushed already. So we inherit all those, you know, improvement and all those things in the Darwin kernel.
However, we got some Apple enhancements there. For plenty of reasons that we'll detail, the FreeBSD stack by itself was not completely meeting our needs. So we're trying to have a more dynamic approach where people can load and unload without rebooting and without recompiling. So we also have support for MPs. The FreeBSD stack right now is not really MP savvy. So we get some mechanism here for multithreading and MP efficiency in the networking subsystem in TAN. So we also, as Vincent talked about a little bit before, we also tuned some buffer allocation for both client and some high performance for gigabit and things like that. So that's also some of the modification we've done to the stack. And so we get this famous data link interface layer, which is a mechanism for extensibility at the bottom of the stack for adding protocols, adding network families. One of the biggest problem with FreeBSD is that basically when you get a new driver, or a new class of driver, new family of driver, you more or less have to go pretty deep into the kernel and recompile the kernel and get that to be integrated in your stack. With Mac OS X and DLL, we have some mechanism to do this on the fly basically, and you can load and unload your drivers, add some type of different family without having to restart the machine. So, and as we said, it's extensible because of those plugin architectures and network kernel extension we're gonna just talk about. So the network kernel extensions, what are they? They add extensibility to the kernel networking. They are basically part of code that's gonna link against the kernel dynamically and that will be part of the kernel. So this is a big responsibility. you as a developer are gonna make some, I don't know, filter, a new protocol or something. And using the network kernel extension, you'll link your code to the kernel code. As you know in Mac OS X, and it's been said in all the session, but there is a pretty hard boundary in between user mode and kernel mode. In user mode, basically, whatever you do, if you crash your application, you're not gonna bring down the machine. When you run in the kernel, you're in protected mode and you have access to all the goodies and all the resources and it means that if your code does something wrong, it's gonna panic and it's gonna be a very bad experience. So kernel extension in that sense are some things that should be used only if what you're trying to do cannot be done in user mode. If what you're trying to do as an application using networking can be done in user without running in the kernel, it's always best to do so. Why, because if your application crash for some reason, you'll bring down your application, user may have to restart your application, but life will be fine. If something goes wrong in the kernel, it's a reboot, and that's really something we're trying to avoid at all costs. So this is a first word of caution. We're gonna have several of them during this talk. Kernel extensions have the potential to crush a machine, so you have to be really careful about what is done there.
So what you can do in a network extension, basically you can do a filter and keys. So filter and keys will see, can modify, and inject or drop packets. So at those different layers we've seen before, you'll be able, we're gonna go through the different type of filter and key you can make.
But those filter will intercept packets at some point. And you'll be able to do whatever you want with a packet. So you can decide to let it go through. You can decide to modify it. You can decide to swallow it, duplicate it. Whatever you want to do, you'll be able to do that through filter NKE. So this is a very powerful mechanism. However, as I said, you have the potential to-- if you're sending back a packet which is bad, and it's gonna crash somewhere up the stack.
You know, it's a bad thing, so you got to be really careful about what you're doing. Network extensions are using the I/O Kit functionality, so through the I/O Kit mechanism, you can dynamically load and unload the NKEs, and there is no need to reboot. This is something that you can do by yourself, or you can have through a set of dependency, your NKE be loaded. We'll see in the next slide, I think. So again, you're running in the kernel, so big liability, be careful about your pointers. Be careful about your doing, when you're unloading, you have some special things to do to make sure that you're living a clean state after your module unload.
So, proceed with caution. So, another important point is that there is no real API. The networking subsystem is wide open. You're looking against all the symbols in the kernel. So there is no guarantee of binary compatibility in the future. Just a simple example here. If we change a structure, or even worse, if we change a macro, and you have linked with an old version of this macro, somehow this is gonna work. You won't have any resolution problem because this is a macro, but the underlying implementation of the macro is gonna be different from what is in the kernel, and your NKE may work for a while, but at some point when it's gonna hit this macro, it's gonna do the wrong stuff, and the consequences might be harmful. You can potentially panic, or even worse, panic two hours later because you freed the wrong stuff. So this is a problem here, but you get to be aware of this. The binary compatibility is not guaranteed in the future 'cause we may have to fix a stack, we may have to fix some of the things. So there is some mechanism in I/O Kits that lets you declare dependency and that will make sure that your NKE won't load if the kernel version has changed and things like that. So those are things that you need to look into when you're doing an NKE. Darwin is not BSD, Darwin is based on BSD, so some of the rules apply, most of the rules for, if you're used to BSD, apply, but not all of them, and we'll go in some of them, we'll talk about things like funnels or things like where you have to be a little bit careful with MP. So don't expect your code that you just got from open source for the kernel part to just compile and run in Darwin. It probably will, but you need to have to be careful to look at some of the aspects we'll talk about and get a pretty good eye on it and see what are the areas that may give you problem into Darwin. So don't expect this to just compile and link it into Darwin, it will work. You may have to do a little bit more.
Okay, so the NKE dynamic loadings, so as we said before, this is provided by I/O Kit. So basically a kernel module, is it a filter, is it your protocol, you know, whatever you do as an NKE, will have two required entry points, a start routine that would be called when your module is loaded, and a stop routine which is called when your module is unloaded. This is basically it. All the rest is in terms of the NK I/O Kit side of things. All the rest is gonna be the filter or whatever you're gonna use as a mechanism to plug yourself inside the kernel. So the NKs are not automatically loaded. You have to load them or invoke their loading through a startup script or through dependency. You can have a chain of dependency like that will require your NKE to be loaded. This is pretty much what's going on for some of the NKE we got in the system library extensions as well. The Apple NKEs live. Some of them are loaded through scripts and some of them are loaded through dependency. So there is three command line tools that you should be aware of for, especially if you're debugging and doing your NKEs. kxload is gonna load your NKE, kmodstat will give you a list of all the NKE loaded, and kmodunload will unload your NKE. Your NKE won't load if there is some symbol collision with what's in the kernel. This is, I think, in the next slide. No, one of the thing I wanted to add here is that when you're gonna do an NKE, you should use some prefixing, you know, for your symbols there because you're gonna be in the same address space, in the same symbol space as the kernel, so if you declare a function which is, I don't know, tcpconnect, right, you're gonna have a problem because this function is already inside the stack in the kernel, so your module won't load. And also, we would like people to somehow prefix their symbols with their own kind of prefix, I don't know, maybe their vendor code, the Apple OS type, to be sure that they're not gonna conflict with anything else or any other vendor NKEs that might load. So that's something to think about.
So different, we alluded to this a little bit, but different types of NKE. So four main types. Socket filters. Socket filters are at socket layer, we'll go in details on this. So basically they let you intercept socket calls. Network protocols. Network protocols can be implemented as NKE also. And there is mechanism to add those network protocol to the Darwin stack dynamically.
Data link interface layer filters. So there's different types of filter here for more of a lower level type of filtering. And also network interfaces, if you add some different, a new family, a new type of interface, you might want, you might have to have some NKE to describe that and let the protocol and DLIL know about your new family. So we'll go into those. So first of all, the socket layer. So just for people who are aware of the Unix or the FreeBSD way of things, the socket layer is right under the user to kernel boundary. And it's got a pretty interesting role here. Is that the socket layer basically is doing all the copying of buffers in and out of kernel space.
So by this, when your application under Mac OS X, when your application is trying to do, let's say a send call. It's gonna try to send something on the network. The buffer live into your user thread in your application and they need to be put into the kernel memory. And this is done at the socket layer. The socket layer here is gonna take the memory, I mean your buffers and put them into the socket buffer. Queues will go in details, a little bit more details about this. and the copy to the kernel space is done at the socket layer. One thing to note for people who are more aware of FreeBSD is that here the execution of your thread is gonna continue in the kernel. So your user thread will do the same. So we'll do that from that thread. So on the other side, the socket layer is where the protocol stack are gonna send the buffers and basically by a mechanism awake, get the application awakened, application thread awakened. We're gonna come another thread, the input thread, and by some mechanism we'll get the application awakened and the buffer will be copied from the socket layer back to user space. If you remember what Vincent was talking about this morning about those XTI send buff size. Well, this is where they are, basically. We got two sets of cues here, one for sending and one for receiving, and this is some parametriable size. So the socket filters actually live into the socket layer.
So they're a way to intercept packets, both coming from the application or going from the application or going up to the application. So different sets of calls, if you are aware of the socket calls, most of them have some mechanism here for filters. So what we'll do is when you'll create an NKE socket filter, we'll have a mechanism to know that your socket, your filter, your code need to be called and depending of which call you added your filter to, you will be called and you'll receive the packets and decide either to modify, drop, do nothing most of the time. But you have the flexibility to do something with packets coming in and out of the socket layer. So this is one way of putting a content filter. I'm talking about all this in this slide. So sockets, again, is the API for networking. This is the native API. So everything is going through the socket layer. Socket is a glue in between application and network protocols, we talked about this.
This is where you cross the boundaries of the kernel to user boundary. It sits above the network protocol, so the network protocols are not in the socket layers. The socket layer is gonna decide which protocol need to be called. Socket is kind of a file structure. It's following the Unix type, everything is a file system, more or less. So it's a subset of that. It's got some specific calls, but it's following some of the system. You can do a read or you can do a write on your socket.
So we talked about this, socket has a pair of packet queues for incoming and outgoing packets. So your packets are gonna sit in those queues, especially on the way up, until basically your application is awakened and it's gonna grab those packets. So as we said, Darwin sockets have plugins. Each of those call, let's say, SB receive, for example, is gonna run, is gonna look through all the filters that are associated to it before it actually does the call. So you have the opportunity to just discard the packet if you want or change it.
It's one way. Okay, so socket filter and KE, yeah, we just talked about this. Oh yeah, the other thing is, you have two type of socket filters here. You can have, which probably is the simplest one, a global socket filter. Your filter, the filters that you create, is gonna be invoked for each socket. So every socket is gonna run through your filter. Even if you don't care, you'll say, "Okay, well I don't care." And you'll return and the packet won't be touched. and And the application, I mean, and the processing will continue. There is some, a little bit more complex way of doing things where you can do programmatic socket filters that will only apply to a certain type of sockets. You know, let's say you just want something for web traffic or something. You have some way to decide that your socket filter will apply only to that. Thank you.
You have to register your NKE handle with Apple. So those are for people familiar with Mac OS 9, those are the OS types, the vendor types. So that let us identify the sockets and there is no collision when you insert your sockets. Your socket filter can be run after some other vendor X socket filter, so you don't know that.
So we need to have some way to identify all socket filters. So you need to use the NKE handle. So C with DTS to get your handle there. And as we said, an example of this is a content filter where you're gonna decide based on some, you know, you know, your own criteria, what you want to let go through in those packets that you receive each time you get a call. So, either you change them, you swallow them, you do whatever you want, you duplicate them, that's what, you know, whatever your content filter or your socket filter is gonna do.
So important points for the socket NKEs, there is no built-in reference tracking. What we mean by this is, that you have to, in your NKE, you need to keep track of where your socket is, your socket filter is inserted. If you're inserted in 15 different sockets, you'll receive, you'll know that, and basically what it means, like if somebody asks you to unload, you get to keep track of those insertion. And I'm gonna explain briefly what's going on, that when we run the socket filter code, basically there is a pointer to your socket filter handler where you're gonna receive packets.
And if you don't correctly remove all this when you unload, we're gonna call into some code that doesn't exist and guess what? It's gonna panic. So it's all your responsibility to take care of this and refuse to unload. It's pretty reasonable if your socket filter is in a state where it doesn't know or for some reason cannot really unload, to refuse to unload. And people won't be able to unload your socket filter, but it's much better than panicking five minutes later. So that's something you need to be aware. Another warning, you're in the kernel. It's pretty much wide open. You have wide open access to all the structure, all the thing, but you're part of the kernel. You're just a function. If you're not there, you get to do the right stuff, to plug the hole basically and don't leave no pointers and things like that hanging around because that's gonna be a pretty bad user experience. So that's why we, another word of caution about using kernel extensions, you got to be really careful.
Okay, so now that second layer when we looked at what basically the networking in Darwin is. The network protocols, two examples that come to mind here is TCP/IP and AppleTalk. So that's more or less something which is pretty close to what you'll find in FreeBSD. The way to register protocol and add protocols is pretty much the same. We get some, you know, some mechanism to let you do that dynamically, add and remove dynamically your protocols and your domain. So this is a second layer we're gonna talk about.
So what's important to know here is that a domain define a protocol family. One of the big example here is pfinet. This is where all the TCP, UDP, and write to live. Another one that you can think about that we have in Darwin is pfinet6, which is for the family of IPv6. So this is something that is pretty much covered in a bunch of BSD books, the protocol family, and how the protocol handler works. So this is what we have in Darwin here. From the socket layer, we're gonna decide which protocol handler, depending of the type of socket, you know, the address family you put in your socket. And if you have TCP, you do a connect, we're gonna call TCP connect, and this is done, and this is through those structure where you can add your own protocol. If you come with a next killer protocol family that's gonna replace IP, this is where you're gonna add it. So if you do that, please do it and put it in Darwin. We'll be happy to take it. So this is extensible, same thing. Can add your own protocol, and there is a mechanism to declare this from your NKE. Again, you can remove that. When you remove it, be careful. Clean up after yourself. And otherwise, you know, very bad thing will happen and the kernel will panic.
So another one, the third layer here that we have in the Darwin kernel is a new one. It's called DLIL, data link interface layer. So what's, we'll go in detail about this. It sits in between the network protocols and the network interfaces. The goal of this layer is to be basically a central point. Basically, DLIL let us do extensibility. You can add protocols, you can add network interface, network interface type of families, and also put filters. And DLIL is really the central point that's going to be the abstraction layer for all this. So this is where the protocol attach and say, "Hey, I'm IP and I'd like, I'll be interested in receiving IP type packets, right? I'm AppleTalk, I wanna receive AppleTalk packets. I'm IPv6, I want IPv6. So all those guys are gonna talk to DLIL and register themselves for requesting those types of packets. On the other side, drivers or pseudo drivers coming from the network interface, they can be Iokid-based drivers. I mean, Iokid-based drivers, we're not gonna talk in great details about those here, but they're basically drivers that are moving stuff off some kind of a wire or wireless, but they're moving some physical bits, if we can say that. Well, you got some pseudo drivers like tunneling devices and things like that that just add stuff or remove things to frames that have been generated somewhere else so they can generate their own frames, but they don't go to a media somehow. All those go to DLL and register themselves and give some information about the type of framings they do and the type of specificities they have. So DLIL is here as a central point to handle those.
And added to this, if I go back to the previous scheme here, in between those we got two types of filter again. We got in between the network protocol, we got the protocol filters that register with DLIL and say I want the IP packets for interface, you know, EN0, you know, your built-in Ethernet. And you also have interface filters that are registering with DLL and are more lower level.
So those guys will see all packets for these one interface. Say you wanna see everything coming from EN0 or ZR port, EN1 or whatever, PPP. And you're gonna put an interface filter here that will not be protocol dependent. you will get all your AppleTalk plus IP plus whatever packets here. Well, as a protocol filter, you will specify that you're interested only in IP packets or AppleTalk packets, whatever you want.
So this mechanism here is different from BSD and this is something you need to be aware of if you take something like a driver or something like that. It's got to, or pseudo driver, let's say, tunneling driver, it's got to follow some different rules that it would in FreeBSD 3 or 4. It will have to declare some DLL modules there to register with DLL and do things differently. There is some example in the end case and in the Darwin code about how to do that. Apple talk is an example. IP does that. IPv6 does that. So, yeah.
So the interface layer, the interface layers is for, as I said, I/O Kit type drivers or pseudo drivers. So they attach to DL/IL and basically tell the networking stack that they're available. So dynamically your airport is turned on or something. And the EN1 is gonna basically tell DL/IL, hey, I'm an Ethernet type of driver. and this is where I am. So it's more of a flexible mechanism for attachment of detachment of interface on the fly. On the same way you register your new, you load your new protocol and dynamically, it's gonna tell DLIL that, "Hey, I'm protocol this type and I'll take whatever snap ID for my packets. Now I want all those packets on this interface and that interface." So you'll see around the DL tag, which is the cookies or the handle that you'll get for filters and you'll get for protocol attachment or to interface. And all this is what identify basically your unique connection to DL/IL.
So there is, to go a little bit in details about DLIL, there is for interface family, per type of interface to handle framing. What I mean by this is for Ethernet, for IP, you need the address resolution protocol. And ARP is not really part of IP, it's more in between IP and Ethernet. And there is a way in DLIL to give an interface family and specifies those kind of specificities. So here, when you're gonna ARP, the ARP resolution is gonna be registered with DLIL. So if you're with Ethernet and you're using IP, you don't have to do anything. If you come up with your own media and needs your own address resolution protocol, you may have to look into this and declare your module with DLIL. And, okay, so this is the protocol module. And so, yeah, DLL, also the modules will end in the framing. So you have to, for your type of interface, if you go as a pseudo device, pretty simple. Usually it's just moving your data pointer and putting a few bytes in there. But this is what is handled here. So there is a lot of details in this. If you're doing, adding new type of interface, You really have to go into the documentation and look for this. The filters are a little bit easier to do.
The DLL filters, so as we said before, there is protocol filters on top of DLL and under the protocols. The difference, what they do is that they see all protocol datagram for an interface. So the little trick here is that you register per interface your filter. So if you register for IP on EN0, you'll see a valid IP frames for EN0. You won't see any AppleTalk packets there, but you won't see IP packets for EN1 for your airport, let's say, so you need to register on both. So they're kind of low level. However, you got the interface filters that give you even more flexibility because basically at this point you'll see the wall frame.
So framing looking at, you know, if it's an IP packet or an AppleTalk packet won't be done in the interface filter. So this is between the interface and the family here. You'll get access to the packets as they come from the driver. You'll get access to the valid packets coming from the driver. So you'll see full frames.
And this is where you can, if you're trying to do like a VPN solution or a firewall, is there one way where you could put your NKE is asking, you know, You could be doing that as a protocol filter or add the interface filter. It really depends what you're doing. But those are good point for getting all the traffic well. Things at the socket layer are more for getting things distinct to an application. Here you'll get all the traffic for all the sockets basically. Or even if they don't go to any socket. If they just dropped in the stack, you'll see them at this point. It's before any processing by the stacks. So those are good solution for VPN and firewalls.
Network interface, there is still the free BSD, the BSD IFNet structure, you have one per interface. It's I/O Kit based, you have a lot of things for Ethernet, Ethernet has, there is a lot of sample and the family shows how to do that for I/O Kit. So there, if you're doing an Ethernet driver or some drivers that really talk to a media, you really need to look into I/O Kit because those drivers don't live into these four layers, they live into I/O Kit. However, if you're doing a pseudo device of some or something which is a little bit in between, like PPP, you may have to do some work in the I/O Kit for your driver, you know, dealing with media side, and also at the network interface layer here, where you will have some DLL work to get your stuff registered with DLL and known by the network stack. So the networking subsystem is not part of I/O Kit. But I/O Kit has some way to basically give the packets and call some DLL functions to make the interface well known. One case where if you're doing a pseudo driver like a tunneling device, you may do that only in the networking subsystem.
you don't have to go to I/O Kit. If all you do is take packets, add a header to it, and do some encapsulation of some sort, and send it back to an Ethernet or another interface, you won't have to go through I/O Kit. Um... That's a thing to know. There is another, we just mentioned that here because there is some confusion here, but those filters we're talking about are different levels and people who are coming from Unix may know, especially on the BSD side, so BPF, which is, BPF is really an I/O Kit kind of tap. The difference, it's, well, it's a standard way, if you're not aware, if you're coming from on FreeBSD to get things like sniffer type application, network traffic analyzer and those kind of things. So those are tap from the driver which will copy all the frames back to BPF and get them to your application which asking for BPF traffic. The big difference with the DLL filter is that in DLL, when you put let's say an interface filter, you will get packets, but you'll decide to let those packets go through or make a copy and do your own kind of tap functionality, but by default, the packets, there's one instance of the packet. Here, you get a copy of the packet which is made. And also, for internet, that's pretty true for internet, BPF, opening the BPF device will put the driver in promiscuous mode, which means you will get all the packets basically seen by your interface, not just the one for your MAC address and the various multicast or broadcasts that you can get. You will get everything that is physically seen on your segment. So it's something you won't see from a DLIL interface filter. DLIL interface filter will get only the packets that are valid for your address. It's not gonna be in Promiscuous mode. And again, there are some standard hooks in I/O Kit network interface for this. So that's a good model to follow if you're building your own drivers.
It's a neat utility to be able to use TCP dumps. There's a bunch of services on top of that. And that's pretty much low work to do to get the BPF support in your new driver. So we just mention it here because there's some confusion between those and the DLL interface filters. They're not exactly the same level.
So, non-IoKit interfaces need to support BPF works. That's for BPF. Okay, another important thing here that we wanna talk about, about the networking subsystem is the mbuf. So, mbufs are the memory buffers that we use all over the system in the networking subsystem to all network data. If you're coming from 9, you're pretty much aware of the mblocks which are used in the OT or the streams modules. It's more or less the same thing in the BSD world. What we do with mbuf is that we're gonna hold either the packets coming from the socket layer in mbuf or things that are coming from drivers. So I/O Kit is gonna create some mbuf with the packets received by, let's say, an internet driver, and sends this back up to the DLL, of course, and DLL will route those packets back to TCP/IP or AppleTalk, whatever. But those are all mbufs that are manipulated. So mbufs are interesting because, like mblocks, manipulate pointers to data. So once you're in the kernel, there is no more copy of data. Everything is done through mbuf. So the drivers copy their data from their ring buffer to the mbuf, actually, there's a mbuf already, and they're passed up until they get to the socket layer. And then they'll be, when the application will be awakened and will get its data, then they will be copied back from kernel space to the user application memory. And they'll be released at this point. So what's interesting in the mbuf is that if we take the example of a packet going down, the example of a send, you're sending some TCP traffic, what's gonna happen at the socket layer is that your packets from user space will be copied in the kernel into some mbuf clusters, into some mbuf, and those mbuf will be in the socket queue, remember the socket queue we were talking about earlier. And what we're gonna do to send IP packets, CPIP packets is that we'll add an mbuf to a part of the data that you sent and this will logically point to the data in your socket buffer.
And until, and this is gonna go down to the driver and the driver will send this and the driver will send this we'll say it's done, but the data in the socket buffer won't be released until the data has been acknowledged by the other side of the TCP protocol.
So if we need to do a retransmit of this packet that we took from your data, we'll do that by prepending a new header and pointing to your data, but your data will be the same in there. So there is no copy here. It's only once we know that all the data has been acknowledged and we don't need it at the socket layers that's gonna be released. So that's something to know about mbuf. For you, as somebody who's gonna write in network kernel extension, you have a bunch of function to access mbuf, to allocate mbuf, to manipulate mbuf. There is pretty much everything you can think about to do with mbuf in mbuf.h is a good start in the sys directory to look at. However, there is a bunch of macro dealing with mbuf. try to avoid using them if possible for the problem we talked about before. If we change the implementation underneath in the macro, that may give some kind of panic if you're using that. So try to avoid the macro use if possible.
The other thing to know which is a little bit different from BSD is that we have a different VM subsystem underneath and the way we allocate memory is kind of different. So in FreeBSD you're pretty much guaranteed that you'll get memory when you allocate an mbuf. It's not the case in Darwin. So be aware of that, that your allocation of mbuf can and will fail. There is two modes for mbuf.
You can ask for an mbuf with a don't wait, which means give me an mbuf if you have one to handle your packets, your data. If you don't, just return. This is something you use on the Fastpass, let's say on your transmit and receive pass in your end case. This be warned that you may get, you will get a null back, and probably the best way in that case is to drop your packet and do whatever is good there. You can also ask for wait mode, but don't do that on the Fastpass because you're gonna block the thread while we're trying to allocate memory.
So you're gonna block potentially, It's not very good, so do that for things that are low bandwidth kind of things. When you start your protocol and need some mbuf to upfront or something like that. But don't do that on the Fastpass. So yeah, the rule is do not block. So mbuf, you get to release them. There is some rules. It's not completely depending on socket buffers and everything. You'll see that you have to release them or somebody's gonna release it for you. There's no real preset rules.
Yeah, maybe some other thing about mbufs. No, I think that's it. Just, yeah, just the warning is be careful with mbuf. Be careful about the use of macros. And there is some command like netstat -m that's going to tell you how many mbufs you're using in your system, and this is a really good way to see if you have a leak in your NKE. If you see the number of mbufs, you know, going up in use, it's probably somewhere you forget to release one. And so you might want to check that when you're doing debugging of your NKEs.
from inside the kernel from GDB, you can look at MB stats, it's gonna give you some stats about the allocations, number of drops and things like that. And again, it's gonna give you some information if you forget to release some mbuf in your NK. Another thing here that we have in terms of kernel extensibility are the kernel events. What we have is basically a new, you know, domain here, the PF system, which from the socket protocol give you a way by, you know, listening on a socket to see some kernel events. So it's going to report events from kernel to user space. And those are pretty low bandwidth events. Usually the kind of events we have are things like your interface, you put the link, the link is up, the link is down, things like that. Or the IP address change, so you will receive an event if you listen on this, if you connect to this socket. This is used mainly by the system configuration. Configuration is pretty much the user of this event. you can also add your own events to this mechanism. You can, if you have like some specific driver, some specific driver families that you added, and you wanna add your events, and you wanna have an application listening to this, those events, you can do that. It's not meant for high traffic, it's just low bandwidth stuff. But it's a system, PF system.
There is another thing that we add in Darwin, which is the network NDRV, the PF NDRV. So that gives an access to, yeah, I wanna mention for people coming from Nine, it has nothing to do with NDRV that you, the driver, the native driver from Mac OS Nine. It's really the network driver here. So what it gives you is, from the socket level, give you access to all the packets, to the raw packets. And one, you can do, as we said before, try to avoid if you can to do your protocol or things you are trying to do, try to avoid doing them in the kernel for all the reasons we stated before. If you wanna do your own protocol in user land, you could use PFN DRV to get some packets there. As an example, in Darwin you can look at shared IP which is the name of the kernel extension we're using for doing the port sharing with classic. With classic networking is an example of basically classic listen to PF and DRV sockets to get its packet back and emulate ZOT driver, DLP driver that way. So the DLP drivers is really talking through a socket to PF and DRV. That's an example for you if you're trying to do those kind of things. So-- Right now it works on Ethernet. Okay, funnels. Funnels are a mechanism that we introduced in Darwin. This is not something you'll find in FreeBSD. Why do we have this? Basically, as you know, we have an MP, and we're an MP system, so we have a multiprocessor. And what we want is to have a mode where we can have performance in MP and the networking stack in Darwin from previous is not completely MP safe, let's say. And so funnels are a mechanism to give some mutual exclusion to make sure that when we run into the networking code, networking from coming from I/O Kit up to the socket layer, to the system calls, that nobody else is gonna be running into the code at the same time. So we got a mutex that we take from the socket layer or from the packet level that's gonna make sure that nobody else can be running in the, let's say TCP code and do something at the socket. One of problem to think about is you're trying to send something on, let's say, a TCP, doing some, you're sending some data on a TCP socket, at the same time we're getting a disconnect. If we didn't have a system like funnels, we're in a multiprocessor environment, you could be doing your TCP send while the state of your TCP transaction is being modified by the packets, the incoming packet. And we don't wanna do that. We are not prepared for that. So the way to do this is to have funnel. So basically in the Darwin kernel, there is two funnels. There is the network funnel, which is used by the network stack, and the kernel funnel, which is used by the rest of the BSD subsystem. If you remember the diagram from before, inside the Mac OS X kernel, we get BSD subsystem, which is more or less networking plus file system. So it's used by the file system. So what it means is that in the file system or in the networking you cannot have two processors at the same time. However, by this mechanism, you can have one processor running, dealing with packets and processing packets while the other processor is doing file system sync. So that give us some good performance in servers on MP environment where you can have your Apache or Apple share server do at the same time, have one processor do some networking activity while the other flashing stuff on the disk or doing some file access. So, the problem with those funnels is that you need to be aware of them and the rule I stated is that basically we have one lock on top of the networking stack at the socket layer, and I mean that's a system called layer, and one at the bottom is not completely true, so we're gonna go into some of the detail you need to do in your NKE to deal with funnels.
But that's a difference from FreeBSD, so that's something you need to be aware of. And the thing is here, is that you need to deal with funnels. You cannot say, I don't care about funnels. Your system is gonna be having problem if you don't deal with that right.
Okay, so when to use them. So you need to set the network funnel and specify that you wanna work in the network funnel in your module start and stop, why? Because you're called by I/O Kit, basically when your module starts. And I/O Kit is not running under a funnel. I/O Kit and the Mach part of the kernel don't need to have funnels, they're already MP completely safe and they don't need that. So you need, when you're gonna be called, you need to basically tell, I'm gonna use a network funnel, and there's calls to do that. Same thing for the stop. Timeouts are another one where you need to be switching funnel or taking the funnel, taking the network funnel.
Why, because the timeouts are called as a direct mapping from the Mach subsystem, and you're not under any funnels. So if you, to run your code, you need to explicitly say, hey, I'm gonna run the networking code, so I need to grab the network funnel first. And there is also things, when you're doing things with threading, if you create a new thread, you need to explicitly tell that this thread has to run to the network funnel. Yeah, it's a preemption point.
So each time you're trying to grab the funnel or you're gonna leave the funnel, you can be preempted. Another thread in the kernel can run. So be aware of that. Your state might change when you come back from asking for getting a funnel or leaving a funnel. Just be aware of that. And more or less, if you're aware with FreeBSD and the SPLnet, SPLx, in Darwin, there are more or less no ops. I'm saying more or less because all they do right now is they just make sure that your ether under the network as a kernel funnel. They don't have any active nesting or anything like that. They just, they don't have the, I mean, in FreeBSD you would product yourself for a critical section by raising SPLnet and say nobody else can enter at this point. On 10 there are more or less no. But if you're under the network funnel, nobody's gonna get in there.
So that's for funnels, so that's something you need to look at and look at our kernel extensions in Darwin to see how to use that appropriately. NKE control, there is a way, pfNKE, to control the NKE from a process, from user mode. So let's say you inserted a socket filter or a data link interface layer, you know, filter, and you wanna have some control to that, you can have a special mechanism like a conduit to control your socket, I mean your NKE through this NKE control, so the pfNKE. The NKE manager is not loaded by default, so it's something you need to do. And we're kind of in the changing, we'll change that, so right now it's work in progress. It's a character device. And there is some other way that we encourage to talk to your NKEs. You can go through I/O CTLs, you're gonna intercept I/O CTLs and use that as a control mechanism for your NKE, or also socket options. This is what we do if you look in shared IP. We use socket option as a control mechanism for ZNK.
So VPN, yeah, we talked about a way to implement a VPN. So you could do that as a pseudo device, depending of what you're doing, the type of VPN you're doing. Or you could do that as a DLL filter. It really depends about how your code is organized and which level you think is appropriate for you to plug in.
Be aware that the Kame IPSec is coming to Mac OS X. So if your VPN solution is using IPSec, IPsec right now, KAME IPsec is in Darwin. You can take a peek of that and this is gonna come and be sometime in the future, this will be part of the Mac OS X kernel. So you can build your own Darwin kernel with IPsec right now and use that as a base for your VPN solution if you're using IPsec. And talk to us if you're interested. We're really interested in knowing how we can help you with that.
Summary, I just want to go again because the message here is about the rule for NKEs. You have to be really careful about your dependency. You have some IOCAT mechanism to say, "Hey, I'm linking against that kernel. I don't want to be loaded if, you know, it's version 15 of Mac OS X." You got to keep track of your resource and use H. You know, nobody's going to clean up after you. You have to do it. Do not block input on the Fastpass. you're gonna block the whole networking stack here. You're part of the networking, so you just have to behave and use those rules. You have to know your split funnels. Be really careful about that. Remember timeouts and anything that is coming out of the kernel funnel or the rest of the kernel is probably not under the funnel. So just check with that and be aware of binary compatibility.
In the future, as we said, IPsec, IPv6 are coming, they're part of Darwin. We get, you know, you can look at that in the Darwin kernel, we're gonna be based on that. PPP extensibility, there will be a way to get some plugins for PPP, if you say you wanna have PPP over ATM or PPP over some other new cool media you just invented. The NKE control via socket API is in flux, so it's gonna change a little bit. And also in the future, we're planning to get something to, instead of funnels, some more finer grain locking of sockets and sinks, so we don't have to use a funnel mechanism. So that's pretty much it. So I hope you got information there. You can have additional resources here. Mac OS X, of course, you've seen those. I'm gonna go fast through those. Darwin.org and FreeBSD are also good points for information.
Resources, the TCP, the Stevens book that we talked about. The implementation, design implementation of BSD 4.4 is also pretty interesting. However, be aware of those differences we talked about, like funnels, DLL, and things like that, that gives you a pretty good idea about what's going on, but it's not exactly what we have in Darwin. And the network kernel extension PDF file about all the type of DLL filter you can do and socket filters. So this is where you're gonna dig into all the details about how to do your NKE. And also kernel extension tutorial from Iokits which is gonna tell you about how to build NKs with Project Builder and what are the rules for dependency and things like that. and tell us what you need from us. We're trying to make this extensible. We got some mechanisms that we think cover some ground. So tell us if you need more or what you think we should change here. And we're really interested in getting your feedback on this.
And the roadmap this morning was the networking overview session, it's done already. And this afternoon, pretty interesting, the 303 network configuration mobility. We'll see how those guys are using the events to get some state information from the stack. And Thursday morning we'll all be there for the network feedback forum, which is in room J1, just next. It's always fun. So with that, who to contact? Contact my boss, Vincent. Thank you.