Information Technologies • 44:37
Xgrid 2, part of Leopard and Leopard Server, makes Apple's groundbreaking distributed resource technology for clusters and grids more powerful and even easier to use. We will explore the newest features of Xgrid 2 while surveying best practices in building and deploying applications for your IT infrastructure, from loosely coupled 'ad hoc networks' to tightly integrated clusters.
Speakers: David Kramer, Steve Simon
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it may have transcription errors.
of the conference. I'm glad you chose to come to my session. I think you'll be pleased that he made this decision. I think you'll find that all the rest of the sessions were just leading up to this one. This is really the capstone of the entire conference. So I hope you enjoy it. My name is David Kramer. I am with XSuite Engineering. Steve Simon will be joining me later to demo one of the new features that I'll be talking about now.
We're going to start with an overview. This is going to be an introduction for those of you who haven't used Xgrid before, haven't attended one of my sessions. Show of hands, how many people have never heard of Xgrid or haven't really learned anything about Xgrid? All right. For the four of you, this is going to be great. The rest of you, just a refresher course.
And we're going to get through this real fast and move on and talk about new features. So let me get back to this. The main thing we want to let you know is that Leopard includes Xgrid 2. Xgrid 1 already was included with Tiger, and we've made some improvements in Leopard, and we're calling it Xgrid 2. All of the primary components of Xgrid are included with base Leopard systems, but the administrative tools that make it easy for you to deploy your grids are included with Leopard server. And my recommendation is that to get these administrative tools, the best way is to buy an XServe or 8 or 16, because it's going to come with Leopard Server, especially one of those new Xeon XServes, quad core. Sounds great. Buy a bunch, please. You get the administrative software, it would be really easy to set up your cluster.
So as I said, we're going to start with the overview. Then we're going to move on to developing for XSquared, talk about some of the new API features we've added, and talk about the administration. And those two sections are going to be completely all about new features in Leopard. Stick around.
So why would you want to use Xgrid in the first place instead of something else? There's lots of options out there. I mean, you can use SSH. You could hire some grad students to run around and start stuff on a bunch of computers. You can do it any way you want, but we think you should use Xgrid, and here's why. Well, one is it makes distributed computing painless. And I'll talk a little bit more about that later, but the goal here is to make it drop dead simple for you to deploy clusters for your users to use the distributed resources that you have.
The next point I'd like to bring up about why you should use X-Rid is that it allows people-friendly sharing of resources. And so we have a mechanism for letting people's computers get used when they aren't using it. But once they return and they are using the computer, X-Rid backs off and it's no longer monopolizing the computer. And so people can feel good about sharing their resources with you. And one more reason why X-Rid is great in our opinion is that it supports a wide variety of styles of distributed computing. So one style is sort of the Beowulf style, where you just buy some components, put it in a rack, hook it all together with ethernet, and you have a dedicated cluster, a closet or somewhere that just contains all these computers, and you use it just for your one task. Maybe you get this cluster because you write up a grant proposal, and you get this nice cluster. You set it up, you run xRod on it, and you do your work. But we also support other styles. At the other end of the spectrum, we have the more SETI at home style. And this is where you have machines all across the wide area network, across the internet. And they're connecting to the central authority for Xgrid and getting their work and doing it that way. And so that SETI at home, you can get a lot of people involved. Charles Parnot's Xgrid at Stanford project is a great example of that kind of thing. And then in the middle, we have the sort of cycle recovery at an organization. And that's like where you have a computer lab or just a bunch of computers on desks. And at night, they remain unused. And so in that case, it would be nice to make use of those computers for computing purposes.
So that's why we think you should use Xgrid, but who should be using Xgrid is the next question. And there's a number of different people we'd like to be using Xgrid. We think institutions and organizations can get a lot of benefit out of using Xgrid because we're making setting up distributed computing resources much easier and much simplified. So setting it up is just a few clicks in server admin. Maintaining it is easy. There's some really nice graphical utilities for doing that. And then once it is set up, the grid becomes a service. And it's a service like mail or print or file sharing. And the IT department maintains it. So the people who are good at maintaining servers maintain the servers. And the people who are good at doing distributed computing get to submit their work to the grid and do it. And they don't have to worry about rebooting the servers, about access controls, about security authentication, whatever. They just get to use it. It's very easy. So scientists are clearly one group of people that have a lot of computing to do, engineers as well. I was trained as a physicist, and now I'm a software engineer in both of these professions. I have found that it's very useful to have a lot of computing resources when you have a lot of data that needs to be processed. So one of the reasons that X-Ray is nice for these people is because -- and this gets back to the painless distributed computing concept, and this is that we have a persistent job queue in X-Ray. So this means that you can fire and forget your work. So you imagine the researcher at work on Friday, and he's just collected a bunch of data from his equipment.
And he wants to process it, maybe come up with a really nice visualization of it, some sort of animation nicely rendered. So he sets up the job, and he submits the job up to the cluster. But he doesn't want to stick around all weekend to wait for this to get done. He needs to go home. He wants to spend time with his family, go to Disneyland, something, have a good time. And so what happens with XSRT is he can submit the work from his laptop and then disconnect, close the lid of his laptop, take the laptop home, and the job stays running at work, at school, wherever he did the submission.
And it'll just stay there. It'll keep working. XSRT will manage it, make sure everything goes right. If anything goes wrong, it'll try to fix it. And then when the researcher returns to work on Monday, He just reconnects to the controller, and the job results are waiting for him to download at his convenience. So we think that makes it much easier to use Grid resources and one of the benefits of using Xgrid. And then a third group that we're seeing more people getting interested in Xgrid, these people are content producers. And so one example of content producers are people using Apple's Podcast Producer, which we announced on Wednesday. And so clearly here they get to use Xgrid as the distributed processing engine for doing the video transcoding for running the workflows. And so they can get a lot of benefit out of having Xgrid set up without actually needing to know how to write software for Xgrid. And then there's other third party applications out there that use Xgrid. For example, VideoHub uses Xgrid to-- I think it runs FFmpegX across your network to do transcoding of video as well.
So before we get too far into this, I wanted to talk to you about terminology for those of you who aren't familiar with the terms I'm going to use. Actually, it's simple, and there's only a few terms here, but I just want to make sure we're all on the same page so there's no confusion. We have clients, controllers, and agents. These are the primary operators in this process, the primary people that are working together. So the client is the one that has the work that needs to be done, this is the scientist, the researcher. The controller is the process in the middle that maintains the job queue. And it receives the work, and it distributes it out. And then finally, the agents are the ones that do the actual work.
It's a little bit dry. I have a picture here to make a little clearer. So over here we have the clients. These are people using their computers, laptops, iMacs, G5s, Xeons, whatever you have. They're the ones with the work. They set up the jobs, and they run an application to submit the work. And the work gets submitted to the controller.
The controller is in this middle tier of this three-tier architecture. And it's the one that manages everyone on the other two tiers and makes sure everyone's playing together right. So finally, once your work's been sent to the controller, then the controller finds available agents and sends the work out to them to get it done. So agents can be any kind of computer. It could be the same computers as the clients. You could have an X-Rid set up where all of the clients were also agents. You can have a controller that's an agent. You could have all of these tiers in one computer and just have a one computer grid. Not so exciting. I think you should buy those 16 Xers I talked about before. But this is what we got. So there's three tiers all in one or distributed across lots of machines.
So just a little bit more, and this is kind of like a dictionary. I'm almost done with the terminology. What we're actually managing here in the controller are grids, jobs, and tasks. And so the grid is the collection of agents that the controller is maintaining. And you can have multiple grids that you set up, and you can drag your specific agency you want to each separate grid. And then you can tell people, OK, well, you guys, you submit to this grid, and you guys submit to that grid. And then you both get some of the resources that we have on the network. And then the grid is also the job queue. And this is what manages the list of work that needs to be done and make sure that it gets done in the right order. So a job, which is what is stored in the job queue, is the set of tasks that you want to get done on the agents. And it also includes the input and output data of the task. So once the task is finished, the job includes the output data, which can be retrieved by the client.
And then finally, the smallest piece of work that's done by an agent in the single unit that gets sent out by the controller to the agents is a task. And so a task, it's pretty simple. It's just sort of a Unix command line tool, basically. It's an executable, some arguments, it's working directory files, it's standard input, and all of this blob of task gets shipped over and then the agent runs it and sends the results back to the controller. I have another picture here to show you what this looks like. In action, this is the X-Route workflow. The basic idea here is that the client submits, and then it monitors, and then finally, once the job is done, the client can retrieve the results.
So we begin by submitting the job, and then once the controller gets the job, it schedules the tasks. It breaks the job up into tasks and figures out which agents are available. So once it has determined which agents are available and which ones are suitable for the tasks in question, it submits them off to the agents, and the agents run them. And so they just run the task. They capture all the output from the task and all the output files that are generated by the task in its working directory, bundles them all up, and will ship them back to the controller. But in the meantime, the controller is monitoring what these agents are doing. And if one of these agents were to go offline, to get disconnected, someone tripped over the power cord, or someone shut the computer down. For whatever reason, if one of these agents goes offline, the controller will notice this and will know what work that agent was working on and will make sure that work gets scheduled someplace else. And so it's constantly monitoring the status of the agents and then resubmitting the work as necessary. Once the agents are all completed, they return the results back to the controller and the controller collects the results. As I said, the controller holds on to these results until it's ready for the client to grab them. And then finally, the client retrieves the job results.
So that was the workflow, but what else does Xgrid do? One thing it does that I didn't mention before is that it enforces authentication and authorization policies. So this means that you can rest assured that if you set up your service ACLs that only authorized people have access to your resources. As I said before, it groups agents into grids, it monitors agent availability, manages the queues of the jobs, and then one thing I didn't mention was that jobs can have So you can submit multiple jobs and say that these following jobs should not even be started until the first job completes. And so the scheduler schedules runnable jobs when there are available agents to run. Extrad also handles data staging. As I said, the jobs and the tasks include the input data and then collect the output data. Extrad will move all the data around for you if you would like it to. And then the really important thing that I think Extrad does that I like, my favorite feature, I think, is that it recovers from failures. And so that's the agent going offline. It can deal with that and send the work over to another agent. And then another way it can recover from failure is that if the actual controller were to crash, heaven forbid, or if the machine were to be shut down, or maybe you need to restart for a security update, you can reboot.
The jobs will persist. They're stored on disk in a database. And when the server comes back up, the jobs are all there, and they will continue running. And the pieces that have been finished ahead of time, all those results are still there. The tasks that were in process at that point will get resubmitted once the agents reconnect to the controller.
So that was the overview. And now let's talk about what is new here in Leopard for Xgrid. So we've addressed three areas of improvements here. Ease of developer adoption. And for that, we have three new features-- Xgrid Anywhere, Xgrid Scoreboard, and Task Feedback. And I'll talk more about those in a moment. We've also tried to address ease of setup. And to this end, we've created a service configuration assistant for Xgrid to make it even simpler to deploy Xgrid on your servers. It was pretty simple before. There was only, I don't know, six clicks. I think we've got it down to three or four now. So it's pretty easy to get going with Xgrid. And then the final piece that we've addressed is ease of porting. And this isn't exactly an Xgrid feature, but OpenMPI will be included with Leopard. So if you have existing distributed computing applications that make use of the MPI API, you can use OpenMPI in Leopard, and it will automatically use Xgrid if you have Xgrid configured.
And I see we have some open MPI fans, so that's great. So here's the good stuff. What's new in Leopard for writing your Xgrid software? So three features-- Xgrid Anywhere, Xgrid Scoreboard, Task Feedback. Xgrid Anywhere is a feature that allows your Xgrid-enabled software to run no matter what. And by no matter what, I mean even if there are no controllers or agents on your network.
So you're just on a standalone machine. you're flying 20,000 feet, heading back home tonight, you can start writing X-Grid software on your desktop or your laptop, your portable machine. And you don't need to set up any controller or agent. And you can just start actually testing the job submission and results retrieval right then and there. X-Grid Scoreboard is a really exciting new feature that we've developed based on a lot of feedback we've gotten on the mailing list and at previous sessions and Q&A, where people have wanted to be able to specify which resources are used for a particular job. And so we've addressed that with Extrascoreboard, and we'll be talking about that more soon.
And then Task Feedback is the third new feature that we've added, which lets the scheduler make more informed decisions about where you want your tasks to run based on the current conditions. And that might sound a lot like Scoreboard, and I'll explain the differences between those two. These are just the three features that we're talking about today and that are included with your seed. We have more enhancements for x-ray that we intend to get in there, but we're going to talk about these today.
The first one is Xgrid Anywhere. And basically, the problem that has come up is that we'd like everyone to use Xgrid. And everyone's like, well, yeah, if I had five computers, everything should just go five times faster. That sounds like a good plan. And they say, so can I just drag my application onto the Xgrid icon in the dock, and it'll go faster? And that sounds good. I'd like that. But that's unfortunately not how it works right now. Sorry.
So there's no free ride. You do need to adopt Xgrid. And the problem there is that adoption is hard. And I don't mean that Xgrid in particular is hard. I just mean that adopting things is work. You're going to have to learn something new. There's a new API. You're going to have to change your code. And then once you do change your code, the real problem here is that you still need to maintain all of the code that you wrote before to do it all locally. Because you don't know if your users are going to have Xgrid set up ahead of time if they're going to have a cluster. So with XSRID Anywhere, we've decided to solve this by making it so that XSRID is available anywhere you are, no matter what operating system of Mac OS X Leopard you are using, desktop or server. And you can just run the application, and it'll work. So now you can rewrite your code to XSRID, and you only need one code path. It's always going to work, even if there are no controllers or agents available on the network. So there's two pieces that make up the XG Anywhere feature. The first one is XG Anywhere here, I'll call it.
And this is the private controller. And basically, there's a new API, private controller, on the XG controller class, which you can call, which provides your application with its own private in-memory space controller and agent. These controller and agent are basically running inside your application's process space. And you can use them as if they were a network controller or network agent. Everything is going to work exactly the same, job dependencies, submission, monitoring, retrieval, all exactly the same. So you just need one code path that always will work whether or not there's any network setup. You also don't need to use this new API. If you already have a host name entry field in your application, you can just type in this special host name colon private colon, and it will instantiate this private controller and connect the application to it.
The next piece of Xgrid anywhere is Xgrid there. And this is the default controller. So what we've done is created a system and a user preference that you can set that says, if I don't know what controller to use, use this one. And so this makes it easier for end users at a large organization who don't know how to configure Xgrid to connect to a particular controller. They don't know which one they're supposed to use. You can push these settings out, and then these users will just automatically connect to the right one. And so to make use of this feature, your application should call the default controller API. And this will return the default controller that's been connected up based on what this setting has been set to. And then the really interesting part here is that if no default controller has been set up, presumably because there are no controllers on the network, you just automatically get the private controller.
So you take extra here, and you take extra there, and you put them together, and you get extra grid anywhere. This makes your application easier to test and easier to use. And I've really enjoyed using this new feature myself because when I was developing some of the demo code for this session, I didn't have this rack in my cube. So I needed to make sure the job submission and results retrieval was working. And I didn't have to set up a grid. I just could start the application, type in colon, private colon, and I could do a test of all of my application code without having to set up a grid, without having to worry about Kerberos or passwords or anything like that. So if you use the default controller in your application, it's always going to work the way your users want it to do. Either it's going to use a private controller and it's just going to work without a network set up, or it's going to use their default controller, which is the one they want.
So this is Xgrid's scoreboard. This is the next feature that we've added. And this one is really cool. I like it a lot. And what the problem was here was that we designed Xgrid to be simple. And to simplify things, one of the things that we did was make it so that you didn't really know anything about where your work was going to be run. You basically just give it to the controller, and the controller manages all the effort of which agents are going to use it and dealing with the failover, all that business, you totally don't need to worry about it yourself. But it turned out maybe we were wrong about you not wanting to care. And so we've reconsidered this. And in fact, I've been doing some work myself and found that it would be really nice to choose a subset of the computers to use for some of my jobs. So now what we've done is give you an option to care. But you don't have to care. I mean, you can still use Xgrid in the very simple form, just say, here's some work, get it done, I don't care where. But now you can care. And so, for example, one thing that was mentioned to me that really hit home to me why we would need this is if you're using a GPU, a hardware-accelerated rendering engine, to render portions of 3D scenes that you're going to stitch together or maybe frames in a movie, you're going to want each of these pieces or frames to be rendered on the exact same kind of GPU because different GPUs from different vendors have slightly different qualities about them and so your frame, not everything is going to be rendered exactly the same and you're going to notice some odd irregularities between these frames or the pieces if they're on different GPUs. So you really want to make sure that Your work gets done on computers that all have the same kind of GPU. And so, I mean, that sounded good when someone explained to me that they really needed that feature and I understood why they needed that feature. I thought, well, I could just add a flagged job submission that said, well, this is the kind of GPU I care about. But that doesn't really scale and that only works for that one issue. And we can't predict everything you are going to care about. So we came up with a way for you to tell the controller what it is that you yourself care about. Thank you.
The way that you specify to the controller what you care about is by including an agent ranking tool with your job submission. And this is a tool that you write. It can be a really lightweight script or an executable. And you just include it along with the job. And what the controller will do is make sure that this ranking tool gets run out on all of the agents in your grid before it schedules any of your actual tasks, your computational tasks out there.
And the agent ranking tool gets to run. And it evaluates the conditions that you care about. So it could look at syscadol. It could do benchmarking. It could look for specific hardware devices, licenses. Whatever you care about and whatever's important to you, your tool can evaluate these and then generate a score.
So once the score has been generated and returned to the controller, the controller decides which agents to use and which ones to use first. And so first, it looks at all the zero-scored agents, the ones that have a score of zero returned, and it just eliminates those. Those will not be used for your job. And then it looks at the different scores, and it'll arrange and prioritize the agents based on the scores you give. Higher scores are preferred.
So, for instance, if your score was the clock speed of the processor, you could use this to prefer machines with faster processors. If it was the memory that you were returning, you would be preferring computers with more memory. Now, in the case of the GPU, it's not that you want at least some number of GPUs or anything like that. You want a specific GPU. So jobs can specify conditions, criteria about what range is acceptable or what exact score is acceptable. So in the GPU case, you'd want to give each GPU a different numeric score. And then in the job, say only run this job on computers that have exactly the right score for the GPU I care about.
And then you can use more than one agent ranking tool and include them with your job submission. And then what we'll do is we'll just take all those scores and we'll multiply them together and normalize them. So that basically, if any of your arts return a zero score, then that agent's not going to be used. But otherwise, we're going to multiply it together and do the prioritization that way.
Kind of dry description. So I have a picture here to show you how this works. So you start out, the client has the job and the agent ranking tool. The agent ranking tool is that nifty looking, spinning, magnifying glass. I'm not quite sure what that is, but it looks pretty cool. So it gets sent off to the controller. And the controller monitors and looks for the agents that are available. And once it's found them, it sends the agent ranking out to them first. So the art is out at the agents and it's running. And it generates a score. And so in this case, it's returned a non-zero score for the first two computers, but not the last one. Apparently, this art doesn't really like IMAX. I don't know. But so once the controller has retrieved these results, and it knows which scores have come back from which agents, it can then send the tasks out. And so in this case, the tasks only go out to the ones that have been shown as acceptable by the agent ranking tool. And then finally, once the agents are done, the results go back to the controller, and then the client can retrieve those results. So that's the basic scoreboard workflow. It's pretty handy to be able to use this to limit where your job runs onto a very specific subset of the grid.
The third feature that I want to talk about today that we've added for developers is task feedback. And the issue here is that the scheduler doesn't know everything about what's going on on the agents. I mean, it knows whether the agent's there or not, but it doesn't really know about the environment, and it doesn't know about how the task is running in that environment.
So sometimes the task can detect itself a condition that makes that particular agent that it's running on at that moment unsuitable. For instance, maybe the disk fills up. You're doing some sort of transcoding, and then you realize, hey, I don't have enough space to actually finish writing out this file. That's not going to work. And so right now, your task could just fail, and then your job would fail, and then the user would see the job had failed, and that wouldn't be very exciting or make anyone happy. So what you really want to have happen there is to have the task restart. But what's going to happen here is that if we just told the controller to restart the task, it would say, oh, well, That computer that was just running the task is available now, I'll just send the task right back to that same guy. And that's not what you want, because the disk is still full over there. So you want to make sure that you can say, well, let's restart this task, but don't do it here, please. And so now you can provide this information back to the controller from your task about the current situation on the agent and give the scheduler more information about what you want it to do with your task to make better decisions about where to send it in the future. So how this works is that agents can, to their standard out, send-- in addition to any logging information or output that you're sending there, they can enclose some data in a property list inside an xgrid XML element.
And the content of this element is a property list. Says, hey, what's going on on this agent? Tells the controller what it needs to know. And so there's a few messages you can send here. And so the first one sort of gets back to that example I which is that you could say this agent is unsuitable for this task.
The task has determined that something's wrong with this agent as far as it's concerned, and it just doesn't ever want to run on that agent again. And then if you know that all of the tasks in your job have very similar characteristics and they're going to have the same opinion about that agent, you can say, well, just don't run any more tasks from this job on this agent. Or if that's not the issue, maybe you just detected a transient error condition in your task. And so you could just say, just retry it, and then the controller will have the opportunity to retry it wherever it wants. It may want to retry it on that same agent. You could also say retry job, which basically throws out all of the results of all of the tasks that have completed so far, and then rewinds the job back to the beginning and resubmits all of the tasks out again, and they start working again. Now, if your tasks keep telling the controller to retry the tasks or retrying the jobs, eventually the controller will say, hey, I've retried too many times.
We're just going to give up. So that's a configurable value. You can say how many retries are acceptable to you. And then finally, you can just fail the job. And this is already the behavior, but we've made it explicit where if the task just says, you know what? Things are messed up. I was trying to connect to this file server, and apparently it's not even available. Let's just fail it so we can let the user know as soon as possible that something is really wrong that they need to address out of band from Xgrid.
So task feedback and scoreboard are pretty similar. And in fact, you could get most of the features of scoreboard by using task feedback. Basically, your task would go out there, it would start running, it would use syscadl or do its benchmark, whatever it needed to do, and determine, hey, this computer is unsuitable. Never run any of the job tasks here again. But that's not the only thing you could use it for. And in fact, you can actually use these together. And I have an example here of why. So it might be that memory integrity is really important to you, like critical to you that just no bits get flipped ever. And so you've already bought some XSERVs that have ECC memory, but not all of your XSERVs have it. And so you want to use Scoreboard to say, only run my tasks on the computers that have ECC memory. And so you'd have an ART that just looks at that system profile value and determines what kind of memory there is, and then returns either 1, yes, it's ECC memory, or 0, no, don't run the task here. But once the tasks do go out to those agents, the tasks may also have their own sort of checksumming or error correcting.
Or maybe they just run the calculation five times, because you just absolutely have to be sure that cosmic rays did not give you the wrong answer. And so what you're going to do here is that you're going to do it a bunch of times and compare the results. And if you determine that, well, the results aren't the same all these runs, maybe this agent isn't really as reliable as we were hoping.
Maybe this ECC memory, there's something more to it than just ECC memory, and something's going wrong here. So let's just not use this agent anymore for this job, just to be safe. And so this would be a way that you can use these two features together to really control where your tasks get sent to initially, and then also where they remain and where they get scheduled at in the future.
So at this point, I would like to invite Steve Simon up on the stage. He's going to show you a demo of Scoreboard in action. And so we're going to be using this rack down here. I realize not all of you can see what's going on with this rack. So we have set up a video camera that will be showing on the screen what exactly we have going on this rack. And you'll be able to see which computers are being used when.
Okay, so for the demo, the first thing that I would like to show you is all we've done is added one line to the job specification. We get the ART data and insert it into the job, very simple. And then there's one other step which is later on when we set the conditions, we just insert that says we want them to be exactly equal.
And if we take a look at what an art controller looks like, this is a very simple one. In the rack of XSERVs here, we have three types. We have red, green, and blue. And each of them will return a score based on this program here. So if it's red, it will return one. Greens are two, and blue is three. And if it failed to return a score, it would give us a zero, and then nothing would run on that one.
Back to the admin, you can see we have a grid here. And I have grid color. Grid color is a small application that's going to run our jobs. So first I'm going to pick all the XSERVs with the blue profile and submit a job. And you can see over here that, in fact, the blue Xserves are running and the others are not. And then you can also fire off a green job. And now the green ones are lighting up. You can see what's going on in the agent. The green ones are still working.
And there they go. And that's really all there is to it. It's very easy to use Scoreboard. And that's the demo. Thank you, Steve. So that's Scoreboard. We're really excited about it. It's based on your feedback. Like I said, we didn't really think you wanted to care where your stuff ran. But if you want to run all your stuff on red XSERVs, be my guest. So the last section of this talk is to discuss the new features that we've added for making your life easier to deploy XGrid. And so what we've done is-- Well, first let's say what other administrative tools already exist for XGrid that are already in Tiger and continue in Leopard. And so first of all, there's the sharing preferences. This is where you configure the agent on Leopard systems, on non-server Leopard systems and also Tiger systems. There's a little check box, XGrid. There's a configure button. You can use it to set up XGrid. There's also server admin. And we still have this, although the UI has changed a little bit for Leopard. This is used to configure both the agent and the controller on Leopard server. You can turn it on and off, set what kind of authentication you care about, and a couple of settings.
And then we have extra to admin, which Steve used briefly there. And this lets you monitor, once you've deployed your grid, this lets you monitor the actual state of your grid, look at the job queues, arrange agents into grids, and see their status and see if they're offline or if there's some problem. So what we've done now is added a new simplified server admin setup for Xgrid to just make this easier for you. So there's fewer steps to deploy Xgrid on your machines.
So basically, this gives you the choice of setting a couple of behaviors. You can either say you want a particular machine to host the grid, or you want it to participate in a grid. And then we also give you the option of turning XGrid off using the same mechanism. So the simplified server admin setup does require existing network infrastructure. We don't set that up for you.
You're going to need DNS. You're going to need open directory, because we're going to be using Kerberos authentication. So we need Open Directory or some other directory service available. But one thing that we have done, though, is in Server Admin, you can use Service Configuration Assistance to set up OD. So you can actually go up to the menu in Server Admin, choose Service Configuration Assistant, and choose to set up OD, check the open directory box, and then you check the Xgrid box, and... It'll set up OD for you. You just make, say, Open Directory Master, and click Continue, and it'll set that up for you. And then it'll run you through the extra setup. And so at this point, as long as you have DNS set up somewhere on your network, you can get your box hosting and grid and running an Open Directory Master with just a few clicks using the system. The one more thing that the simplified server admin setup does is it sets up a shared file system for you. And so let me digress for a moment here to explain what this is all about. So what we're doing here is we're adding a new data staging mechanism for Xgrid. As I said before, you can include your job executables. You can include input data with Xgrid, with your job submissions, and they'll get moved out across the network to the agents. And this is very convenient, but it's not the most efficient mechanism. And so there can be a lot of memory overhead for the controller to do this. It's parsing XML that contains this data.
And it's going to have to copy this data to all of the agents. And it's actually going to have to copy it to the agents before the agents even begin running the task. So if this is a large file, a one gigabyte file, you're going to have to wait until the entire thing gets there before the task starts, even if the task maybe just wanted a little piece of that large file, or maybe just wanted to start reading at the beginning before the end arrived. So rather than trying to solve that with an XGrid, we want to acknowledge that other people have solved this problem before and they've created network file systems. And so what we're doing is giving you an area where you can put your job data. You just have to put it there and then refer to it at that path in your job submission. And we're going to make sure that all the agents that you've configured to connect to your controller using the simplified setup and that are bound to this directory domain will get the mount records and will be able to access the shared file system in exactly the same place across all the computers. And so now you can put your large data there, and when it's time for the tasks to run, they just get to grab and stream back the data that they care about from the file server. So we'd like you to use XAN. This is a great shared file system. It's high performance, but we know it's not for everyone. And so we're giving you this other alternative.
So what does it look like? First of all, when you begin to do the service configuration assistant, you're given an explanation of what the prerequisites are. And so this is the DNS, the open directory. And you're just told what's going to happen. And when you click Continue, it goes out and checks to make sure that all those prereqs have been met. And then you're given the option to choose what exactly it is you want to do. Do you want to host a grid? Do you want to join a grid? What do you want to do? So in this case, we're going to host a grid. And then we just need to enter our directory administrator username and password. And we're going to use this credential to create the export and the mount record in the directory so that everyone can get the shared file system.
Once you've entered that, that's basically it. Now we're giving you an option to confirm all the settings you've chosen. There weren't a lot to choose. And as soon as you click Continue here, we're going to actually go out and do the work. And so what work do we do? Well, we enable the controller, which is like checking the box in the old server admin UI. We're also going to enable the agent, and we're automatically going to point the agent at that same controller that we just configured on that computer. So you get both. You get a full operating X-ray controller and agent connected to each other using Kerberos authentication just by doing this one step. As I said, Kerberos authentication is enabled so you can feel secure that your data is private and confidential and secure. We're also going to create the export record on the controller and we're going to put the mount record in the directory so all of your agents that are bound to that directory will automatically mount that file system. We're sharing it over NFS in the Leopard seed and we'll be evaluating what we're going to do with that as we move on with the Leopard schedule. So the file system is created, there's a folder created at /xgrid and then the file system is mounted on all of the agents and the controller at network/xgrid.
Joining a grid is very similar to this. You just choose the second behavior, join a grid. In this case, you're going to have to choose which controller to use on your network. You can either browse Bonjour discovered controllers, or if you have a controller on another subnet, you could just enter the host name or the IP address right there. Although you actually do need a host name, because Kerberos is going to use that host name to make sure that it's connecting to the right server.
So then if that server that you're configuring hasn't already been Kerberized and doesn't already have Kerberos principles, we're going to do that for you. And so we need the directory administrator password again. But if that server has already been set up with Kerberos principles, then we're going to skip this step. And then finally we give you the option to confirm. And again, there wasn't really many choices to make. So you're just confirming that you chose the right controller here.
And finally you hit continue and it does the work for you. The agent is enabled. pointed at the controller you chose. Kerberos authentication for the agent is enabled, and we made sure that the principle was created if necessary. So that's about it for this talk. Those are the new features that we have. So in summary, we've added XRED scoreboard. That's the big new feature I think you guys are going to be excited about to let you choose which of your machines to use for a particular job. We've added task feedback to evaluate the conditions as they change there. We have Xgrid anywhere, so it's easier to write your software for people to use so that they don't need to set up Xgrid to get started with it. And then once they're ready to scale beyond single computer performance, they just need to set up Xgrid and your application is going to work and act exactly the same. And then we have this new service configuration system to make it even easier to deploy Xgrid. So all that together, that's what we have for you in the Leopard Seed. You can look forward to some additional improvements. And I hope you guys enjoy. Thank you.