Daemon Wrangling with Launchd - WWDC 2006

OS Foundations • 34:28

If you write software that runs in the background, you need to become best friends with launchd. Introduced in Tiger, launchd provides a central facility for managing all the various daemons and timed services on your machine. Sysadmins and service developers alike will gain a deeper understanding of the how to utilize the full power of launchd.

Speakers: Dave Zarzycki, Kevin Van Vechten

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Welcome, everybody. This is the last session of the day, Daemon Wrangling with Launchd. It's actually more generic than that. We're going to be talking about process management in general. So first, what is launchd? It's a rocket launcher. Catapult, now. It's a program to bootstrap and manage other programs.

Now, I'd like to know, before we get too deep into the session, are you a developer or are you a sysaadmin? Let's do developers first. All right. Now let's do sysaadmins. OK, so some of you do both daemon. Awesome. Like I said, we're going to cover issues for both of you. Launchd is both technologies.

Now, why did we do launchd? Isn't this just another reinvention of the wheel? We had extensions. We had Unix's etc.rc.d directory. And we had Mac OS X startup items. What are the advantages? Well, I'd like to bring Kevin Van Vechten on stage. He worked on System Starter in the Mac OS X days beforehand, and he's had a lot of experience with other operating systems. He's my sounding board and my manager, and he will explain the big picture. Welcome, Kevin.

Thank you. So remember about five years ago now, there was a huge push for Mac OS X. And Mac OS X was billed as Unix for the rest of us. It had a lot of great features. It had a lot of things that made it modern compared to Mac OS 9. Virtualized address space, protected memory, inter-process communication between those protected address spaces. Preemptive multitasking was a big one. And then, of course, we pushed to a callback model as opposed to a polling model. for event delivery.

So it was, all these things are true and it was a very modern system. But a couple years ago, I remember we had Year of the Notebook. And an interesting thing happened in the Year of the Notebook. It really made it obvious that hardware had become more modern than Unix was. Hardware had a lot of hot plug devices like FireWire, USB, Ethernet, AirPort, even SATA.

So devices can come and go all the time. You have dynamic protocols, DHCP, LDAP, so a user group that might have been defined a moment ago on your system, could have just been revoked, or perhaps somebody new was added. Or your IP address could get changed because your DHCP lease expired. And these were things that Unix is not accustomed to dealing with historically.

So there were a lot of historical artifacts that were really no longer applicable in this new world of completely dynamic mobile computers. Taking, for example, the etsy.fs.tab file. This is a text file that describes what disks should be mounted on your system. Well, this really doesn't make very much sense when you have FireWire disks and USB disks that can be plugged and unplugged at a moment's notice.

Applications like the Finder, the system, they really need to adapt dynamically to look for files that are on these disks when they appear and to gracefully forget about these files when the disks disappear. Or ask the user to reattach a disk that was attached not a moment ago so you can finish doing whatever you needed to do.

So if fstab wasn't really applicable in this new world, is etcrc any different? I mean, this is a script that choreographed the startup of the system. But if disks come and go, well, you can't really assume that they're going to be the same disks there when you start up. Then 10 minutes later, why assume that they're there at startup at all?

And another big difference with Mac OS X is unlike other Unix systems, there really isn't a dedicated system administrator for most Mac OS X systems. Most Mac OS X systems are used by a single user, no administrator, really. There's nobody who can go in and edit etcrc to adapt to whatever needs to be done. It's something that we need to do dynamically. So launchd comes into the picture. What launchd provides is an unambiguous way to declare dependencies.

It's based on the actual primitives that processes need to communicate with one another. If there's a dependency between two processes, they must be doing something that allows them to collaborate. Maybe they're sending information across a socket. Maybe they're both looking for the existence of a common file. Maybe they're sending Mach messages over Mach ports. But the fundamental idea here is it's not good enough to just say this process depends on that other process.

You really need to think about why does it depend on that process? What is the OS? What is the Mac OS 10's best mechanism that's using? What is the manifestation of the dependency? And if you can describe the dependencies in those terms, then they're not ambiguous and you can be very certain about when that dependency has been fulfilled or not.

In a sense, launchd is really kind of like a process factory for processes, if you're used to object-oriented programming. You issue requests either by sending data to a socket or creating a file, and launchd will notice this activity and then automatically launch on demand a program for you. It's a factory for these processes.

So this means that we can now virtualize our processes. You don't actually have to have this MySQL daemon running or the X11 server running. You just have to have a socket there that could be written to. And as soon as an application writes to it for the first time, launchd can notice and launch that application on demand.

This means that you're using less resources at idle time. And only when you're actually using a service does it become available. But it also means that you don't have to worry about the service not being available. Because it's always there virtually. It just may or not be backed by an actual process at any given time.

In this respect, launchd is really a process superclass. It takes care of a lot of the things that background processes need to do. No longer do you need to daemonize. It can take care of all the steps of changing your standard out, standard error, standard in. And really, you just go right out the gate from the main function in your program and you're ready to be run as a background process and you've been launched on demand by launchd.

And launchd manages all these processes. It can keep them alive if that's what you've requested to do, so if your process exits, maybe it'll get launched again automatically. Or maybe it's only supposed to run once and exit, and that's fine, launchd won't run it again. It's taken a very central role in Mac OS X.

Launchd is not just a process manager like iNetD or xInetD for which it has compatibility, but it's actually PID1 in Mac OS X. We don't have init anymore. Launchd has subsumed all the functionality that init provides on other Unix systems. And now in Leopard, we've really enhanced the integration so much that there's at least 100 different jobs in launchd for the different services that the system provides.

Things like the Windows server or SSH daemon, that type of thing. And we're growing this list every week. More and more services within Apple are being converted over to using launchd. So it's becoming very pervasive in the system, and we think it really makes sense for you to use launchd when deciding how and when your applications should be launched automatically.

So launchd is open source. It does play a very pivotal role in Mac OS X. We want you to see how it works. We want you to be very familiar with launchd. And we're also very pleased to announce now that launchd is available under the Apache 2.0 license. And we would love to see launchd-- We would love to see launchd be adopted by other operating systems, whether it be FreeBSD or Linux or anyone else who's interested.

And of course, we would love to see standard system daemons that are found on all these various platforms adopt launchd for their process management needs. And there's a new project page, launchd.macosforge.org. There's a blog there. That's where we'll be posting links to updated sources to launchd and news about launchd. And hopefully, if any other systems are interested, that's where you can find resources about porting it to other systems.

So again, why launchd? Well, previously, OS designs were really kind of a house of cards. You were either relying on handcrafted ordering. So some administrator had to think in their head, OK, what's the order all these things should happen, encode that in a file. And if anything changed, you'd have to remember to go into the file and edit it and update it. And so that's a bit tedious. It also means that if one step got wrong, then that failure could cascade quickly, and you might end up in a completely different place than you wanted to be just because you forgot about one step.

So Apple had some goals. We needed resiliency. We needed recovery from failure. We needed to not rely on assumptions about what the hardware would look like, because with FireWire and USB and 802.11 wireless, we really don't have much we can assume. Things can look very, very different from one minute to the next. So we're really encouraging a model where applications are robust, they respond to errors gracefully, they try to perform some action. And if it doesn't work at the time, they have to deal with that. And hopefully, you can try again later and succeed.

Also, as I've mentioned before, we really encourage a model of using callbacks for notifications of when services and events happen. We really don't want polling to occur on the system. So, again, you may be thinking, well, what was wrong with System Starter? System Starter didn't require you to edit a net CRC file, and you could just install your startup item, and that worked, right? Well, it kind of worked.

There were some problems with it, though. One is that System Starter used what we call explicit out-of-band dependencies. There was a list of strings in a startup item that said, here's what I require, and here's what I provide. And System Starter would order those in a reasonable fashion.

But this is error-prone, because the dependency strings, well, they could be overstated. So, perhaps somebody in their startup item says, I depend on the network, I depend on the LDAP server, I depend on the X11 server. And guess what? They didn't really depend on the LDAP server. But it's listed there. System Starter has no way to know that that's a false entry. So, it'll affect the ordering of the items. Sometimes dependencies were understated.

Sometimes somebody would say, well, I depend on the web server. And at that time, the web server also depended on the network. So, the network would be up when the web server item launched, and then eventually your item would get launched, and everything would be happy. But if we ever changed that, if we ever said, oh, well, we can go ahead and launch the web server before the network comes up, because it'll be accessible on the local host IP, right? Well, you don't have that network dependency captured anymore. So, maybe that's not really what you wanted to have happen.

So, assuming dependencies by virtue of other dependencies is really error-prone, and it really tied our hands in changing what the startup items did, because we didn't know what parts of the startup item people were really depending on or not. And, believe it or not, System Starter actually had some pretty bad race conditions. What do I mean by this? Well, historically, most Unix background processes called the daemon call to background themselves.

If we show a little timeline, this could cause a problem. If you have System Starter running, and you go ahead and launch a startup item, then the example daemon's in its main routine. And that is a child directly of that startup item. However, if it goes ahead and calls daemon, System Starter's waiting for the child process. But now the child process is just about to exit, and it's the daemon that's going to continue living.

Which means, at the same time the child starts running, the parent reclaims the process. And so now what you have is System Starter possibly concurrently spawning the next startup item, at the same instant that the actual previous startup item started running. Which means the dependency really wasn't fulfilled.

Now, through a bit of luck and the fact that usually on Unix, the child process is the first one to get scheduled, things tended to work, but it really is coincidence that things worked at all. Launchd on the other hand, doesn't want you to use daemon. It'll launch your process, it's already backgrounded, all of the daemonization has been taken care of for you.

So Launchd can keep track of when your process is ready, when your process is alive, relaunch it if necessary. It's much more explicit, it's much more controlled. So, like I said, these are possibly running at the same time, which was not necessarily the results that you wanted. So, I'll hand it back to Dave to talk about the daemon. about the details of using LaunchD.

All right, thank you, Kevin. All right, now to get into the meat of things. I'm going to start with the big picture. What are we going to cover? For developers, we're going to first talk about where your code fits in to the system, because you've got to start somewhere. When does your code get involved? How does your code resolve dependencies at a real nitty gritty level? And how does your code and your daemon check in with launchd to solve the other end of the equation? For system administrators, why launchd adoption is good for you?

Now fitting into Mac OS X. We're cleaning the OS up into basic layers. We were a little bit fast and loose before. We're going to try and push people into some very concrete layers. These are the layers you're familiar with today. We kind of have per machine things and per session things. You might find Safari, iTunes, System UI Server running in your session. And you might find, you know, diskRBD, kexD, configD running as a daemon.

These are pretty defined buckets, but, you know, through code, you can kind of blur the layers. And we're trying to actually begin to enforce them and make sure people cleanly design their code. We're also investigating in Leopard, but it isn't there yet, but we're investigating it per user layer, which you may see someday. We expect things like iTunes agent, iChat agent, Spotlight plugins to live there. Essentially, these programs would transcend login sessions.

So, you know, just because you've logged out doesn't mean that some programs running in the background changing disk stuff and we need your plugins to analyze them for Spotlight or, you know, you might want to share your iTunes songs after you log out. These are all, you know, ideas we're experimenting with, but it's a layer that makes sense. And we hope we can help you fit into that someday. Now, the relationship between the layers.

Peer-to-peer or downward inter-process communication is what we're encouraging to help you create better design products. What does that look like? We have three basic layers. Recession at the top, per user and per machine. So you have apps running at the top. You might have agents running in the middle and daemons running on the bottom. And yeah, side to side, that's fine. You're in the same layer. Everybody's all friendly, buddy-buddy, good stuff. But what we only want to see is downward communication.

You know, an app can talk all the way to a daemon, or it can talk to an agent, and the agent can talk to a daemon. The important part is it's down. Well, why is that? Why only downward? Well, we have a machine-wide context. There's one of those. Uh-oh. We have more than one user on the system, possibly at the same time. We can have Bob's background problems and Sally's background programs.

There are many layers above a program. Knowing which one to talk to is a problem. Whereas if you go downward, it's implicit. You only have one per user context. You only have one machine-wide context beneath you. In fact, if we look, we have GUI sessions, we have command line sessions, and we may have other sessions in the future.

So trying to deal with the scenarios where a new session is created above you that you may have never heard of might throw you for a loop. And it's best if you can just push as much code as possible up into the top so everything becomes implicit. If you need to draw UI, have code running in a session. If you don't, maybe move it down a level.

Exactly. Downward IPC encourages good design. So what are the purpose of these containers? The per session? Thank you, interaction. Or programs that need to come and go with the session. They just don't make sense otherwise. The per user? don't need UI at all. And if you need to transcend a log in or log out, this is the perfect container for you. But you're still running as the user in the user's context. A lot of things are still implicit when it comes to permissions and access control and access to resources.

And finally, per machine. This is one thing I'd really like to stress. We're really trying to push people, as they evolve their code bases, to keep the minimal code required to arbitrate or share hardware in this layer. If it's doing anything more, consider moving it up a level.

This will be the best for security in the long run for the product because that's the most risky container of all. So getting started. All right, that's great and all that there's this high level vision. Let's get started. We need to install a property list describing our job into the right location.

Well, which location? If you went to the Unix talk before this, you may have heard us talk about the Next style directories. One of them, for example, is /library. That's a local machine configuration. You can put per machine jobs in library launch daemons. You can also put per user jobs or per session jobs in network library if you want to put it across all your machines out on the network.

And you can put it in the local machine if you only want the agents to apply to the users on that system. Or as a user yourself, you can give yourself per user or per session agents that only apply to you. Think like cron jobs and whatnot, all these little jobs that just apply to you. You can put them in your home directory and launch, and it'll pick them up.

Well, what does a property list look like? It looks like this yummy XML goodness. But we're not going to dwell on that slide for a second. The real meat of that is you need three keys and three values. You need a label, which uniquely identifies the job to launchd and a string that we can actually print out on the command line. And it's just better for everyone if we can actually give it a name. We have program arguments. It's the pre-tokenized array of strings.

It corresponds exactly to that second argument of main. This trips people up occasionally. So they might, in one of the array arguments, say dash c space some file, not realizing it needs to be dash c is one element in the array. And then the actual next argument is the next one. So you've got to map how you think about things in your head as arguments to how they actually come out in the argument vector.

And finally, on demand. The most simplest case of a launchd job just keeps a program running all the time. There's no on demand-ness to it. So you need to actually tell launchd that, because the default is to have jobs be on demand. So set it to false. Launchd goes, OK. And just keeps the job alive as long as it's loaded into launchd.

[Transcript missing]

Net dependencies. Kevin talked about them at a big picture level. Let's talk about them at the low level. Forget everything you knew about Bootstrap. System starter and all that? No. We're doing it differently. Programmers work with contracts. That's our bread and butter. Contracts define expectations and promises between the producers and the consumers. There are two kinds of contracts when programming. There's programmatic, what most people are familiar with, and it's every API you've ever used. There's also data-driven.

There's a lot of data. And all the P lists we're talking about, that's the contract you have with launchd about your job. What you're doing is you're telling launchd whatever you want about yourself, be it the multitude of keys that are available in the launchd P list, or maybe what you're not. Maybe you're going to let launchd implicit some things.

So for example, if you don't set standard out or standard error in your P list, you'll just inherit it from launchd. Again, think object-oriented. And that can be really powerful if you're working with launchd. You can you can do some neat inheritance tricks. But those are the two contracts we're going to be dealing with. And how do they pertain to dependencies?

For networking changes, we'll tell you to use the system configuration framework. For disk changes, the disk arbitration framework. For device discovery, the I/O Kit framework. There's a pattern here. Other changes require other API frameworks. APIs are in fact fluid. Let's give an example. Goes all the way back to the 1970s. Get PWNAM. Not a pretty name, but it works. Takes a login name, returns a structure representing the account in a Unix system.

It previously read ETSI password directly. That's fine. An API is allowed to do that. That was the contract. But later, we changed it to talk to lookupd via IPC. Okay, well, how did we avoid a race between getpwnam being called and lookupd vending the backend information via IPC? All right, time to rewind. Back to the launchd plist. If you notice here, we've added something to the job describing it. Lookupd, in fact, was implemented using Mach. And we told launchd about it.

Hey, we vend a Mach service. The name in the Mach namespace is lookupdaemonv2. And well, because dictionaries are key value based, we need some values, so we stuck true in there. But the important thing is this ended up telling launchd to create a Mach port that can be found. So later, when When getpwnam was implemented, it went out, found the communication for lookupd, and started sending it a message. It really didn't matter whether lookupd was running or not. The communication was prepped.

So what launchd promises you in the race-free IPC setup? We're going to make sure all the communication handles are set up before any program beyond launchd itself starts running. And the nice thing about that is that when they do start running, all the possible services that can be vended on the system are all known. All per-machine jobs will be loaded before any of them start. Again, for the per-user context, per session, it's all the same. We load the handles up, then open the floodgates and let program start running.

But we don't promise you some things. We don't promise networking interfaces are found. We don't promise that auxiliary file systems are mounted. We don't even promise that hardware is done being probed yet. Any detail about the configuration besides what's necessary to bootstrap communication between processes, we don't promise that. So yeah, talk amongst yourselves as processes, have a blast, but it's between you and the various API frameworks to negotiate your dependencies.

And best of all, if you can make fewer assumptions, this will all just work. Now talking to launchd, well how does that work? We have some APIs. They're pretty simple. If you've used core foundation or foundation, things will look familiar. We're dealing with boxed objects. Here we're allocating a new string into a boxed object and then we're extracting it. Real simple container APIs. They're not meant to be high performance. I wouldn't recommend actually writing your app around these container APIs. They're just meant for IPC with launchd.

So the other side of the XML contract when it comes to these APIs, you need to get your IPC handles from launchd. Well, how do you do that? Again, we have this boxed objects. We're going to allocate a string, particularly known string. And we're going to check in with launchd. So hey, launchd, I want to check it. That's what that launch message does. It takes an arbitrary object graph in and returns an arbitrary object graph back. Here we're checking for failures. If it actually doesn't return something, the communication apparently failed for something.

But part two, checking for secondary failures. We can actually extract the type of the object. In this case, we're checking to see if it was an error. Just because you could talk to launchd, doesn't necessarily mean what you asked it to do worked. So here, we're trying to see if there's an error. We extract it. You can review the slides later, but the important part is there's some sample code in launchd project. You can see this in action.

Pass the error checking. Let's actually see something interesting. Here's how to iterate results using the APIs. We have a dictionary lookup, much like you have in core foundation and foundation. We take the response. Here we're looking to see if launchd provided us an advisory timeout. Again, we like launch on demand, and the next bonus points, gravy, extra credit, if you will, is to actually idle exit when you're not needed. So here we can extract an advisory idle timeout.

Don't have to, but if that key exists and if we care, we can use it. The next thing that we're checking for here is if any Mach services are to be vended by this job. The neat thing about this, too, is you can actually compare your own program's notion of what it does versus what the plist said.

It allows for some interesting potential for adding things after the fact into the plist that the original program wasn't expecting. That can be to your advantage sometimes. And then the next one we're looking up are sockets. And again, may or may not be there. It's good to check. But let's say you did get some sockets. We've got to do something with the result of that particular case.

Well, sockets come back as a dictionary, and we want to iterate them. You can provide a callback for the dictionary iterate API. It looks like this. It's your standard iteration API. You have the key. You have the value. And you have a little cookie you can pass through. Here we're just going to print out the results.

And once you iterate the full results coming back from launchd, you will have the descriptors right there as a part of your message payload. And you can add that to your event loop and carry on just as normal. Now, we actually made a demo. This is actually in the stock Leopard system. And with a small little tweak, you can enable this. But we made SSH agent launch on demand within your GUI session.

That means every one of your Leopard systems right now, if you make the small little config tweak, can just type SSH add without any prep work and have an SSH agent launched for your session. And that's really neat because then all the GUI apps, again through inheritance, can get the SSH agent values from it. So if we can switch to the demo machine, I'll show how this works.

So we made a small tweak. Essentially, we did this, sudo touch var db use ls. This enables some code that wasn't quite ready for WWDC, but it's where we're trending in the OS. And what this did was actually make all the GUI apps launch via the per session launchd. So if we look at all the programs running around on the system, we have the per session launchd here at the bottom. It's PID 192. It's parent PID is 34. 34, if we look at the list, is-- oh, right above it-- it's login window. That's our session.

And if we look at all the apps running within the session, this block right here, notice that Spotlight's running as a child of the per session launchd. The dock is a child. The finder's a child. Everything running within the session is a child. Well, that's great. Now it's inheriting things from launchd. Environment variables, standard I/O, inheritance is good.

How does this actually affect things? Well, if we look at the list closely and grep for the SSH agent, we see that it's not running. But if we talk to launchd and say, list your jobs, we actually see that the SSH agent is loaded into launchd. It doesn't have a PID assigned to it yet, because it's not running.

If we actually look at some of the configuration details about it-- So you remember the check-in I was talking about? You'll get a socket back. It's a dictionary. In this case, it's using listeners as its own key for its protocol. And then it has a descriptor that it'll get back.

But what you don't see here is that, again, due to inheritance, every process launched from launchd is getting this environment variable. In fact, we can ask launchd to export its list of environment variables that it's exporting. And what do you know? Oops. All right, well, there's a bug. But it did work because terminal did get the variable. And because we got it, we can now say ssh add. Oh, and probably it should generate a key first.

And now when we say ssh add, we added it. And if we look, it's now running. And if we check the parent-child relationship, We noticed again 192. 192 was, if we scroll all the way back, the per session launchd. So now we can open new terminal windows, say SSH add. And what this means is we can SSH to systems.

And in the case of Xcode, which has support for CVS, it means that when it does a CVS check in or check out, it can use your SSH keys now as far as talking to your CVS server, which some of you, I would imagine, might actually use. So this is a nice feature. We'd like to actually see more processes adopt launch on demand so they can vend their services as they're needed. And that is our demo. If we can go back to the slides. Thank you.

So this concludes our talk. I am putting my email up here. I can't promise that I'll get back to you, but I do try very hard to help people in their questions. Launchd is the bootstrap daemon, so it would be in my best interest to help bootstrap people with their questions.

So if you need, you can try and email me. But if that doesn't work and you find a bug and you just want to report it, we have a mailing list you can send messages to and that's another way you can get a hold of us. Also, there's documentation in the