Monitoring Your System with ARD and Open Source Tools - WWDC 2005

Enterprise IT • 48:47

Knowing what your systems are doing is essential to keeping them up in both 24/7 enterprises and 9-to-5 workgroups. Mac OS X provides hundreds of pieces of information about itself, from the Apple Remote Desktop 2 SQL database, to log files and output from utilities like fs_usage. Learn how to access this data, and build custom tools, workflows and reporting mechanisms (using tools like Automator, PHP, Jabber, and the open source tool Nagios) to proactively monitor and manage your systems.

Speakers: Mike Lopp, Todd Dailey

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning. My name is Michael Lopp. I'm responsible for desktop management technologies at Apple. We're going to talk to you about a lot of different stuff today, focused around monitoring of systems using Apple remote desktop as well as a bunch of open source tools as well. To kick it off, an interesting thing happened yesterday at the server feedback forum.

We had one gentleman over here saying, I'm a command line guy. Make sure I can do everything from the command line. Make sure the man pages are there. And we're like, yes, absolutely. We want to do that. And then someone from K-12 stood up over here and said, we want ARD to be really simple to use and do all this.

So there's clearly a need for both of those. And I'm here to tell you, you can have it both ways. And that's what we're going to do during this session right now is show you, starting with ARD, we're going to give you some good demos. And then we're going to start. We're going deeper and deeper into the foundations of Mac OS X. So to kick it off and give us some context, I'm going to bring Nader Nafisi up. He's the product manager for server and storage software.

Great. Thank you, Michael. Good afternoon, everybody. So I'm going to give you a quick overview about what monitoring is all about. Hopefully you all know why it's important, but just kind of review it very quickly, and then we'll let Michael and team take it away with more demo and content.

So basically, monitoring, you know, why is it good, why is it interesting, it's really there to help you figure out what state your systems are in. You know, if they're up, if they're down, maybe somebody's abusing it, that kind of thing. So you really need to have really great tools and capabilities to be able to detect that.

And the really cool news is that Mac OS X has a lot of capabilities built into it, either the command line level or at the GUI level to let you do that, and Mac OS X server even gives you more. And what's even better is that there's a lot of open source tools, just because we're a Unix-based operating system, a lot of the existing open source tools have been ported over very easily over to Mac OS X as well. And so what we're gonna do today is talk about how you can extend App Remote Desktop to better monitor your systems. And this is more out of the box type thing, you know, it's not exactly within the GUI, some of these capabilities are a little more hidden.

And we're also gonna talk about how you can use Server Monitor to extend that as well. Again, as Michael said, there's a GUI, which most of you guys have used, there's also a command line version of that, which some of you may not have used, and we're gonna talk about how you can extend that using Python. And last but not least, we're gonna also give you an example, again, there's a lot of open source projects out there you can use for monitoring, but we've picked one of the more popular ones, and we're gonna talk about that a little bit.

So monitoring, if we look at it, there's four main categories of ways you can monitor your systems. And probably the top of the list is assets, monitoring your assets, and essentially monitoring your hardware, you know, things like has memory been removed from systems. I was at a presentation on Sunday where Stephen Doyle from Edith Cowan was talking about actually somebody had added memory to their systems. So you might see that as well. So you might want to let those kind of things go by.

But other things as far as monitoring goes, security. Again, that's kind of a very important thing these days. So making sure that nobody's abusing systems, you know, as far as intrusion, those kind of things. Something that's very common when it comes to servers is monitoring the services that that server is providing. So, you know, monitoring the health of a web service, making sure that it's chugging along fine publishing web pages. And last but not least is hardware.

We've got a rack of XSERVs here, and again, monitoring them, making sure they're all nice and cool and happy is also important. And so if we look at the desktop side, there's a couple of technologies and tools available there for you to let you do that. Close and dear to my heart is App Promote Desktop too, but there's also a lot of other GUI tools out there as well as command line tools. And if we look at server, again, you can -- some of the -- a lot of the tools in the desktop are also relevant to the server, but there's also some server-specific tools out there. like RAID Admin and Server Monitor you can use as well.

And as I said, there's a whole bigger universe out there as well, and that's the open source monitoring tools. And here again, we've got a couple of examples here, and later on we're going to be talking about Nagios and how you can use that on Mac OS X and Mac OS X Server to monitor your systems. So with that, I'll hand the keynote thing back to Michael.

Thank you, Nader. OK, so back to ARD, a product near to my heart as well. So as Nader said, I'm going to give you guys some demos here about how to extend ARD a bit. But I do want to talk about what other things it does as well.

You hear me say this all the time. It's a Swiss Army knife to me. It has all these little doodads on it, all these different things that you can actually do not only for monitoring, but for software distribution, for blasting bits out to all your machines, for asset managing, doing a lot of reporting against all of your infrastructure, remote assistant. The reason that we have that binoculars on there is being able to obviously see your machines, as we'll be seeing here in a second, and then remote administration as well. So ARD has it all.

But there's other things that you want to do. You have problems, like my mobility-- I have mobile computers and they're wandering all over the place. How do I manage these machines? How do I monitor these machines when they're not there, and that's a problem. You've got, I want this specific piece of information, it doesn't happen to be in one of the many reports that ARD currently has. I'm missing a task. You've got all these different tasks that you can do in ARD, but you've got something else that you want to do.

And then lastly, I want to see, as Nader was talking about, I want to see changes in my lab over time. The current reports are a snapshot of what's on my machines right now. So let's go take apart each one of those and give you some demos, as well as maybe inform you about one feature that you may not know about in ARD right now, which is offline collecting.

So here we are, a very basic diagram of ARD in the enterprise. You have the admin and you have your machines there on wireless. How do I do a report against these when these machines aren't here? It's great when everything's live and everyone's sitting on the machine, but when it's gone, it's no longer there. longer there.

What we have in ARD 2, and this is offline collecting, what it allows you to do is it allows you to find a reporting policy and send it down. This is a task in ARD. You actually send that down to all your clients. It says to these clients, kind of like a cron job, every night at 4 a.m.

send up the system overview to my machine, to my admin. That's great if your admin's online. Your admin's not already there. So what we have is the offline collector. It's basically a server process that's up all the time. This is the agent in our Postgres database that's sitting there all the time.

So when these machines aren't there, or when the machines aren't there, let's do that again. When the machines aren't there, obviously they don't report in. When they get to the next time that they're on the network and they can report in, they actually send that information up to that database. So the admin doesn't need to be online to do that, and the clients don't need to be online to do that as well.

So what happens when you run this report from your admin and you've got this cache is the machine's not there. You'll actually see the most recent set of data for it. So it's a really handy way to do a lot. You're reporting against deployments that have a lot of portables. So let's keep moving. So the other problem is I've got all of this under information that I want to get to, and I can't actually do it.

Or maybe you've got a specific set of information you want to get at. Here's the good news about ARD. We've got two things that we're built on. First off, as I've already alluded to, Postgres is built right into ARD. It's a SQL database. That's where we put all of our reporting information.

As you'll see here in a second, you can get in there. You can get access. You can get access to that information to develop lots of different tools against it. And, of course, we're Unix. We have millions of command line utilities. So let's actually go to demo three here. I'm going to give you guys a couple examples of doing some monitoring here against some more machines here. Demo three.

[Transcript missing]

Click on Send Unix. And what I'm going to do is I want to find all the PDFs on these machines. Why do I want to do that? Well, I know there's PDFs on these machines. But what I want to do is I actually want to use Spotlight.

Now right now, Spotlight's sitting up here on the right, so the question is, how am I going to actually use Spotlight on these machines? There's a command line utility, which is called mdefind, which goes and makes a call straight into the Spotlight cache. So what I'm doing here is mdefind. I'm going to look for PDF, and then I'm I'm going to go ahead and send this command to my XSERVs.

It's running, and here's all the information. I just ran a remote spotlight search using ARD2. So, there's a million other command line utilities there, but what you can see here is that it would be as a send unix command, you can actually extend a lot of it. So, you can see all the PDFs that are associated, and you also saw how fast it was because the spotlight's really fast. So, that's one thing. Excuse me? Oh, you want to zoom in? I'm actually not set up to do that. It's a long list of PDFs. Sorry.

Excuse me? Command, option, plus. DAVID J. Command, option, plus. Zoom. No, it's not set up. I told you. OK, afterwards, come by. I'll show it to you right in person. So the other one I want to show right now is actually-- Renata was talking about security. So we don't have a security report, per se, in ARD. But what we do is, again, we have all these command line utilities available. So what I want to do is I want to see who's logging into my machines. So again, using the send Unix command, we go ahead and do use last.

What last is, it says, show me everybody who's logging into my machines, whether via SSH or via login window. Again, huge amount of information coming in that you're not going to be able to see in the back. I apologize. But again, here you are. I can see that people have been logging. Apple's been logging in via the console. They've been logging in via SSH. Again, the point is simple.

Send Unix allows you to multicast commands out to all your machines, and there's a huge amount of command line utilities out there. So to keep on going, let's go back to slides. Slides. I'm actually going to bring up Tony Graham. Tony's actually written a solution in Real Basic to actually do the asset tracking and system trend reporting. Tony's one of our system engineers.

Come on up, Tony. Can I get demo 2 up, please? So Michael said the Apple Remote Desktop tool collects system information reports, and it stores them in a Postgres database, but you only get that last bit of information you collected from each client. How many of you in the room would like to be able to track this stuff over time? Many. Okay. So it's actually pretty easy. I chose Real Basic.

As a developer tool because it's got a very shallow learning curve. It's well suited for system administrators who don't want to learn C or Java. Very graphically oriented and has built-in plugins to talk to databases. So I can go to my file menu in Real Basic, add a data source. I'm going to specify Postgres.

The database name would be ARD, username would be ARD. Any guesses as to the password? ARD. And I can now double click on this database here and see all the tables in it. Any experienced database administrators, people who use SQL and that sort of thing? Yeah? Alright, great.

The system information table has just about everything you're interested in. It's all the stuff the report has pulled. So you can hit a basic query there. You can also do a more advanced one. Here's one. That's going to get the distinct list of property names from the database. These are the things you can track. "How's that? Things like Ethernet address, your computer name, IP addresses, all that good stuff is in there."

I should back up and say that the database is not available to you until you unlock or roll back some of the security mechanisms that are in place. There are two configuration files that you can edit to do that, or you can download a tool written by Mike Bombic called Atom. Click a button, put in a password, and it will unlock that database for you.

So let's build something very simple here. I'm going to take a pop-up menu, drag it onto my standard window, stretch it a little bit wider, and a database query control. And the database query control over here on the right, I can say that is the ARD database. And I can say select distinct property name from system information. This is standard SQL. The Postgres book by O'Reilly is an excellent source for some of this stuff.

And I can do a binding to that pop-up menu and say I want that pop-up menu to show everything that my database query has returned. Now we'll drag a list box onto the window. A two column list box, I want that to grow with the window, and I want in fact the window to be able to grow.

And I need a new database query control for that list box. The pop-up menu that's going to give me the unique list of all items is going to feed the selected item into the second query. So the database, the second database query will take that and do something with it.

What it's going to do is use the ARD database, This one's going to get the computer ID, which is the MAC address, and that's kind of the unique flag that you're tracking all of your fields with, and the value from the system information table where the property name is equal to, and then this is a real basic thing, percent one will be passed to your database query by that last pop-up menu.

Finally, I want to bind the list box and say any results that come from that database query are going to go in there. So let me just double check and make sure all that stuff is there. We run it. Should have a unique list of all of the things that ARD tracks. Kernel version.

There we go. So you've got tremendous power under the hood, but you can expose this now to end users. This application will compile and run on multiple platforms.

[Transcript missing]

An XML text file in RSS format, and so you can generate your own reports very easily. This is just Safari RSS loading a file that's on the local hard drive, which you'd be able to sort by date, by time. The article length slider here will allow you just to show the most recent value. And again, this is something you can use to publish this data to other folks in your organization that perhaps aren't running Apple Remote Desktop or maybe even not using Macs.

Thanks. Good use of Zoom there, Tony. Looked good. All right, so real quick, what did we learn? So the point is that ARD is extensible. You can actually sit down. If we don't do that task for you right now via the user interface, there's probably a way that you're going to figure out to actually go do it. We're designing ARD and future versions of ARD with portability in mind. We know that there's a huge amount of mobile computers out there. We'll continue to support that.

And our open database infrastructure, it's great. It's SQL. You can go and use it. You can do the applications Tony demonstrated. You can pick whatever tool you want. But let's keep going. I'm going to bring Todd Daly up here. We're actually going to go under the hood a little bit deeper. Todd? Todd's one of our system engineers. Todd?

Okay, so we've looked through a couple of things with Apple Remote Desktop now, and now we're going to go a little bit deeper into more details on how we can operate things from the command line. So I'm hoping to speak to two groups of people here that I think are going to be interested in this.

First of all, if you're interested in the Apple Remote Desktop part of the presentation and you're thinking about, you know, how could I do a little more here? How could I learn a little more about what I can do with that send Unix command? Then hopefully this will get you started with scripting. If you're already an experienced scripter, then hopefully we'll show you a couple ideas here about how you can use some of the tools that are included with OS X Server that are unique to OS X and OS X Server to do what you're already doing.

So we're going to go through the process of writing some scripts. When to use a script is actually a pretty simple decision process. So the main thing that scripts do is they automate repetitive tasks. The script is going to run the same way every time, all the time. No matter what you do, it's going to run the same way.

Maybe something changes on the back end but that's probably what we want to monitor with the script. The second thing a script does is it outputs data consistently. So if you're going to build an Excel spreadsheet, if you're going to feed this into a SQL database, you want a common comma separated value format, a common text format, a common XML format that this can spit out so that you can then import that as Tony was showing into Safari RSS so that you can do things like that. And the third thing that a script does that's probably most important to those of you that are system administrators is that a script runs when you aren't there.

So when the system dies at 2:00 in the morning, a script can run and send you a message. So you can see the information about that. When a RAID, as we'll show you, degrades, when part of a mirror degrades and you've got half of the system broken, a script can run and show you that that system's broken rather than having the CIO call you and go, why is the system down? You knew long ago that part of the mirror was broken and maybe you made that decision at 2:00 a.m. to drive in and replace the drive. Maybe you didn't but it's your choice. You're empowered now by the information that we're gathering and collecting.

So scripting is great in OS X Server. You've got a really wide range of choices, and if you're familiar with the Windows world, or if you're familiar with the Linux world, we've really got a great set of tools for you in OS X Server because of everything that's built in.

You've got a guaranteed baseline in OS X Server that includes things like Python and things like Perl, a whole wide variety of scripts, and even some newer languages like Ruby are all built in, and what that means is that when you're deploying something onto a Tiger server, you can count on a minimum version being there, and it gives you a lot of power.

When you're deploying a Linux system, when you're deploying a Windows system, you don't have the same guarantee of a consistent base system being there because of all the different Linux distros, because on the Windows side, you have to install something like SIGWIN, or you have to install Perl from ActiveState, you have to install Windows Scripting Host. You don't have the same consistent environment.

So there's a lot of choices that you have, including Apple Script, including Automator. I think the best choice is actually pretty obvious. The best choice is whatever you already know. If you already know Perl, if you already know Shell Scripting, if you already know Python, if you already know anything, it's probably best to stick with what you know.

If you're new to this and you're just trying to learn, get into writing your own scripts, doing your own tools, I think Python is a fantastic starter language. O'Reilly over in their booth has some great books to get you started. The nice thing about Python that you'll come to hate later about it is that it enforces a consistent structure. It's white space sensitive. You have to indent things.

It's self-documenting, easy to read later, which can't be said of tools like Perl. So Python is a great tool to start with because it enforces good habits on you. As you get better and you want to use bad habits, you'll come to hate it and you'll learn something else.

Okay, so we're going to start out with a very small shell script. And so this is great because what the very small shell script is going to replace is someone being there at 2 in the morning when one half of the RAID fails and you've got to change something.

So the problem is that software RAID 1 is commonly used on XSERVs and it is enabled for it to be monitored on XSERVs. So when half of a RAID fails, you may be able to notice, those of you on this side over here, the actual broken RAID, which you can tell is broken because we actually physically popped the drive out. You can see we've got an alert light blinking over here. And if we were running Server Monitor, we'd actually see the identifier light on there.

And we'd actually see that there was an amber light and we'd actually get notified that there was a problem on the server. What we want to enhance on that is that within Server Monitor, you can send out generic alerts. So you can send out an alert that says, "There's a problem on Server 4."

What we can't do within Server Monitor, although it's a great tool and there's great monitoring built in, is that we can't send a specific alert that says, "There's a problem on Server 4 and the problem is that one side of the RAID mirror is broken." So, we've got to do that.

So, what we're going to do is add a little bit more detail so that when you get that page in the middle of the night to your phone, you emailed your phone through Server Monitor, instead we're going to email it through this shell script and it's going to say, "Hey, one half of this mirror is down." So that way you don't have to get up, you don't have to VPN in, you don't have to look at the server to what's going on. You can take a bleary look at your cell phone and go, "Well, half the RAID is down, the system's still up and going. I'm okay."

So, what we're going to do, and I'll show you this in the demo as well, is we're going to wrap around some command line output. So there's a tool with an OS X server, DiskUtil Check RAID, that puts out some text here that tells us what's going on on the RAID.

So, this is actually a ripe area for writing a script to, and the reason for that is that the text output here is consistent. So, when we look at this through the eyes of the scripter, what we're looking for is the places in which we can grab some text and look at it and see what's wrong.

So, as you can see on here, we can see on the script that there's two areas here where it says the system is okay. So, if we were going to write a script to see if anything was wrong, we'd probably want to look at those two fields and see if they've changed. And so, that's exactly what we're going to do with our script.

So the script, while this looks intimidating, is actually very simple. All we're going to do is we're going to look at those OK fields, so we're going to strip all of the text out except for those fields that say OK. We're going to take a look at them and see if they say anything other than OK.

And if they say anything other than OK, we're going to kick off an email to the system administrator's cell phone, and you're going to get it, and you're going to see that something is wrong. Okay, so let's go through that in a demo here on Demo Station 1.

Okay, so let me go ahead and show you the script again.

[Transcript missing]

Okay, so let me build this up for you. So here's Disk Util Check Raid. And what we're going to do now is we're going to send this through a tool called grep. What this grep line does is it grabs any lines that begin with 0 or 1. Assuming that we have a mirror, online drives within the mirror are going to start with the 0 or the 1, drive 0 or drive 1.

And so what you see here is that unlike the output that there was before, we've now grabbed just the line here that starts with one. So that's the only output that we're grabbing with this. And then the final step I'm going to show you for the demo here, is that we pipe this through awk, which grabs the third text field and just outputs that.

So now we've done all the hard work that we need to do to write this script. We've taken the output of the Disk Util Check Raid tool, and we've taken that down to just the line that says-- that we want to compare to and see if that is okay or not. And so when we go back to the script here, you can hopefully follow the logic here. If that third text field says anything other than OK, then we're going to email that information out. So let's just run that real quick. Somewhere here.

Could have typed it by now. And you see the output here. We're now just outputting that the RAID has the status of missing damage. And again, we get emailed in production here on this. And so that's it. That's the very small shell script that runs this. Normally you'd run this as a cron job every two hours or so, and so if anything ever did go wrong, you'd get emailed with a direct alert. So that's a very small, very simple example for demonstration purposes of how you can go beyond what Server Monitor reports and start customizing your environment to your needs. So let's go back to the slides.

So let's look through a little more sophisticated problem now. And so the problem that we're going to go to solve now is that Server Monitor gives us a ton of information. It's a really great tool. It's great that it's free with OS X Server. It's great that you can go to any X Serve and you can pull a ton of data off of it from what kind of RAM chips you have in there and what slots they're in to what the voltage levels are over time on the system to the example that we're showing here, which is what are the CPU temperatures. The X Serve G4 has two, and the X Serve G5 has 11 temperature systems built within the system that you can actually go in through Server Monitor and you can see.

The issue here is that there's no way to take that data and trend it over time other than the graphing that's within Server Monitor itself. So if you have a large cluster, if you have a whole bunch of systems, if maybe you're having some overheating problems in your department, you might want to graph the temperature of these systems over time. And so what we're going to write is we're going to write a small Python script that will wrap around the Server Admin tool, which is another command line tool that you have within OS X Server.

And we're going to take that data, we're going to parse it out, we're going to display it on the screen, and we're going to also write it to a comma-separated text file so that then you can load that into Excel, you could load it into a real basic script, you can load it into Nagios that we're going to show you next. You can do a lot with that data because we've now got a lot of data that we can show you right now. Rather than show it on the screen, we're going to actually write it out to a text file.

So how do we get data with server admin? So server admin is a command line tool that lets you see everything that's going on with the system. And I'll show you a demonstration of this command line in just a second. But it spits out about 500 lines of text that's all the data that you see in Server Monitor in text form. One thing we added in Tiger, which is great too, is that we added a dash X switch, which outputs in XML.

So as Tony was showing you, there's a lot of stuff like Safari RSS and a lot of scripting tools, a lot of programming tools that understand XML natively. A little bit easier to work with than just parsing the raw text like what we're going to do today. And so let's go ahead and go into that script.

Thank you. Okay. So let me show you the script first, and then we'll step through and show you exactly what goes on here. So what we've got here is a Python script. As you can see, Python is a very well-structured language. It's very easy to follow what's going on.

If you've done any programming at all, basically we've got a somewhat more sophisticated script than what we were showing you before. But the core of what it's doing is actually pretty simple. The core line that we're concerned about here is this one, in which we're running this command and capturing the output. And then we're processing that in the script. That's basically all this script is doing. So let's go ahead and take a look at that.

So here, we're going to actually run that command that I just showed you that the script is running. And what you can see here is this is just a whole bunch of text about your system. And this exactly corresponds to the data that you see in Server Monitor. Just Server Monitor has it pretty with graphs, with lights, with green lights, with red lights. But it's actually the exact same data. You can sit and compare this with the data in Server Monitor and you'll see it's exactly the same.

So what you'll see here is some data here. Just for example, we've got the fan RPM on the system. We've got information about the RAID. We've got information about the actual drives and what kind they are. Ton of information that's in here. So what we're going to do is extract some data out. And before I do that, let me show you too the XML output, just so you know. And so, same data done in XML. Again, if you know XML, if you have a tool that understands it, you can go in through this as well.

Okay, so let's go back to the script. So this is the data we're collecting. And so we're going to basically load all of this data into an array, which is what we're doing here. And for each value, we're going to assign it. So these are the 11 temperature sensors that are on a XServe G5. We're going to assign all these to a variable. We're going to display those variables that we assigned.

And then the last section here, we're going to log those values to a comma-separated value file var log xservtemp.log. And so that's what we're doing within the script. Again, we're just grabbing data that's provided by the command line tool, and we're just going to write that out into a comma-separated value file and give you a way to view that.

And so we just ran the script, and as you can see, there's the output. So that's the actual temperature of the system as it's running right now. And we also at the same time wrote a log entry for that. Normally you'd set this up as a time job, a cron job, run it every hour, every 15 minutes, every 10 minutes, whatever you wanted to do.

And then you could go back and check that comma separated value file at any time and find out exactly what was going on in the system. And again, any of the data that's in there, not just the temperature data, but there's data about the memory, there's data about the status of the system, you can go in and check that at any time and be able to monitor basically anything that server admins reports with this general framework of this Python script. So let's go back to the slides.

So what did we learn in this section? What we learned is that scripts are a really powerful extension. You can do a Send UNIX script, which does some of the tricks that we do here. But you can write your own scripts, too, that can collect information off the system and put them wherever you want. You can put those into a MySQL database. You can put those into the ARD Postgres database.

The sky's the limit onto what you want to do. And so if you manage a bunch of servers, or even if you only manage a few, and you need to run reports, you need to get data off of the system. Using a script is a great way to save you from having a 10 page process that you have to run through every quarter to get a quarterly report together for your manager.

There's many tools available. New to Airport is an Airport command line tool that collects data off the client side. So you can see what, for example, your students are connecting to in terms of access points, things like that. Server Admin, we showed you. System Profiler is a great tool on the client as well. It shows you all of the information about the client and how it's configured, what the hardware is, what the software is, everything. And you can write scripts that can make your life easier by automating tasks.

So that's it on the scripting session. Now we're going to get a little geekier. In fact, Nader wanted me to have a beard and wig that I could put on right now and show myself as a true Unix geek. But what we're going to do now is we're going to take that same concept, and we're going to integrate Mac OS X into an open source network management system. And we're going to do that using an open source management system called Nagios, which is really easy to set up compared to any other network management system.

And so we're going to integrate into that and show you how we can do the exact same thing we just showed you with the small shell script inside of Nagios itself. So that same RAID monitoring that we were doing, we're going to do now within the context of a network management system.

Okay, so there's a lot of monitoring solutions that you potentially have out there. There's open source systems, and there's commercial systems. On the open source side, a lot of people use very simple tools like MRTG, which is a simple graphing tool that generates pretty pictures. A lot of people use Nagios. A lot of people use HP OpenView. A lot of people use Tivoli, CA Unicenter. Microsoft Mom is out there as well.

There's a lot of tools that are out there that people use to monitor their system. So you have a lot of choices, and you have a lot of client options on many platforms, including OS X. And you have open source standards-based solution like SNMP that can report back into something like HP OpenView or something like Tivoli.

On the servers, you have many choices as well on platforms to run the server management, including OS X, like we're going to show you today with Nagios running on OS X server. But Nagios is what we're showing. But you have a lot of options. There's a lot of ways in which you can do this. We're going to show you one way to do it that works for us.

So, what is Nagios? So, Nagios is an open source management framework. It's free. It's widely deployed within education, within business, within corporate enterprises, within very large deployments. It's got a fully web-based infrastructure, so you don't have a separate sort of application that's the network management console. The network management console is the web browser, and you pull everything up through there.

It's got a great plugin architecture, both for monitoring and for notification, meaning that it's very easy to take any device that you have and plug it in to be monitored by Nagios. And also, if you have something new that you want to be notified, if you want to notify your Palm handheld or whatever new gadget is on the market, your Sidekick, your LifeDrive, your whatever, there's a way to do that within the notification plugin framework.

One very important aspect of Nagios that you get for free is that you can get a free version of the app. One of the things that you get for free that you pay a lot for for something like Ativoli is that it's got hierarchical monitoring. And so, what that means is that you can set up the alarms in such a way that the system knows that if the router is down, that the servers behind it are also down, and it doesn't need to tell you that.

And so, if you have a lot of servers, this is very important, because if you have 500 servers that are behind a router, and a router is down, and you don't have some way to do the monitoring and a hierarchy, you're going to get 501 pages, which is bad.

So, what you really want to do is you want the system to be smart enough to go, "Well, if the router goes down, all the servers are behind that, so just tell me the router's down, and I'm going to know that something really bad is going on, and I'm going to go check out the router." And so, you can do that with the Nagios. Again, that's something that you can do for free with Nagios that usually you have to pay a lot of money for.

That's really what Ativoli made their name on, was the ability to do things like that. And other tools like Unicenter as well. And within the contact and the notification system, you can do a lot within that, too. You can notify via email. You can notify via pager. You can notify via any way that you want to set up, any gadget that you have.

[Transcript missing]

So let's show you how that works. So let's flip back to the demo system. So, what we've got here is a movie that's going to show you exactly how it works. The demo system that we configured and ran is back in Cupertino, so it was easier rather than cart all that here, just to show you a quick movie of how this would actually work. But before I go into that, I'm going to show you the command script that we ran to do the monitoring, and just show you how that lays out.

So what you're gonna see here, so here's the original RAID script, check script that I wrote, and you can see here, again, there's some core logic here that goes out and checks the system. And let's go into this script, which is the Nagios monitoring script. This is a little bit bigger, but actually all this stuff is just Nagios stuff that's the same on every script that you run.

What you can see here is that we're actually almost exactly the same. This is the only stuff that we added to enable the software RAID check for Nagios. So you can see this looks almost exactly the same. We changed a couple lines to change how it reports back. A Nagios thing here with exit status saying that this is a critical problem.

The output here saying that we're gonna log this to Nagios instead of sending it to the pager like we did before. And that's really all the changes that we made. And so we changed the script, we set up the system to be monitored by it, and I'll go ahead and show you how that looks. Thank you.

So to start out with here, we're at the Nagios main screen. This is the main network management console that you'd see when you're running Nagios. And you can see here on the left side, there's various options that you have. There's various reports and screens that are pre-set up.

This all comes with it. None of that needs to be set up. It's automatically built. And what you can see here is that we've actually got an unhandled problem here. And so, we'll click through to the unhandled problem, and what you see here is that we've had an error. This is not really pausing.

What you can see here is that we've had an error when we were checking the software RAID. So the check software RAID failed. We've got a critical problem. You can see we've got the information here that the Xlab XS04 server has a disk with status failed. And so we'll click through to that. And you can see here's a lot more information about this specific alert.

And Nagios also tracks a lot of information about the specific host that we have as well. And so we're going to go ahead and click through to that. And we can see the state of the host, and we can see how things are running. And we'll go ahead and look at the status of all the services that are running on the host as well.

And so this is pretty cool, too. What we can see here is everything that Nagios is currently monitoring. These are all the different agents that are collecting data here. You can see there's data here about the software RAID. There's also data about the load average, so it's tracking statistics, whether NFS is running, whether there's any errors, how much disk is in use on the root volume. Some of these are off the shelf. Some of these are things that we customize. But there's a lot of data you can see here that we're collecting. And, again, this is a free network management framework.

This doesn't cost you anything. It's pretty easy to set up. There's a really good Mac OS X how-to guide that we listed on the WWDC site. But if you just Google for Nagios how-to Mac OS X, you'll find it on the Internet. It's on guysmac.com page. And so, let's go on here and show you, go back to the main status page.

And so you can see here we've only got two systems in here, but Nagios does build you a standard status map that you can see here. And we can very quickly go in if we had a bunch of servers and see the status of all the servers. So there's a ton of data that we collect in here, a ton of good information.

And so to end up back on the overview screen here, as you can see, Nagios is great because it really does a good job of giving you the pulse of your network, lets you see everything that's running. I love this tactical overview screen because it gives you just a quick view of how your system's running and how things are going. Again, this is running actually on OS X Server, and it's monitoring OS X Server.

You don't have to run Nagios on OS X Server, although it runs very well there. And then from a client perspective, you don't have to just monitor OS X Server clients, XSERVs. You can monitor XSERV RAIDs. You can monitor Windows machines. You can monitor other Unixes. You can monitor other devices. It's actually very easy to move things into it. So let's flip back to the slides.

Other resources are on the WWDC site. With the sample scripts that we showed you today, we didn't have time to get those through the official submittal system, but Jason Anthony Guy, who owns this track, guaranteed me that we'd get these up on the developer side in some way, shape, or form.

Because we're towards the end of WWDC, we don't have a lot of related sessions to tell you about, but there's one that if you're still here on Friday, you really shouldn't miss, and that's Steve Heyman's Building Automator Actions for System Administrator session. Steve rocks. He's a really entertaining presenter. Don't ask him to list the digits of pi, but Steve is just a great guy, and it's a really interesting session. I think you'll learn a lot of great stuff in there.