Enterprise • 55:15
Whether it's the PowerBook G4 – for UNIX to go, the Power Mac G5 – for serious computational horsepower on the desktop, or the immense power and scalability of Xserve G5 – and Xserve RAID, Apple delivers an ideal platform for scientific computing with an exceptional price/performance ratio. View this session to learn about how Apple products are driving momentum in scientific markets and hear how science professionals are using Mac OS X to increase productivity.
Speakers: Bud Tribble, Elizabeth Kerr, Chen Peng
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it may have transcription errors.
All right, welcome to Apple Solutions for Science at WWDC for 2004. Thank you. I'm Bud Tribble, Vice President for Software Technology. And what I'd like to do is go through a few of the trends we're seeing in scientific computing. And for those of you who've been watching Apple in scientific computing over the past few years, it's been an incredible explosion for us. We're seeing every year more and more scientists adopting the Mac. And I occasionally attend scientific conferences and you just look around the room and see how many people are using the Mac. And these days, a typical bioinformatics conference, maybe 30%, 40-plus percent of people with Macs. So it's really gratifying to see.
My original background is in science, and so I sort of warm up every time I see a scientist using the Mac because I know that they're just being a lot more productive. Trends that we're seeing. And any of you who are in the scientific community, there'd be nothing really surprising about these, because I'm sure you are seeing them too every day. The first is exponential growth of scientific data. Next is clustering for cost-effective performance. And for those of you who are running clusters, you know this already. But if you're not running a cluster, a cluster is probably soon in your future. Strong focus on application optimization. And getting to results more quickly is what science is all about, getting your results published quickly. Tuning your app so you can get crunch through the data more quickly can be incredibly productive. And I'll talk about that a little bit. Ease of deployment and administration. If you're a scientist, your job is not to tinker with the computer. Your job is to do science. And one of the things Apple does best is ease of use. And in the case of scientific computing, And that includes ease of management, ease of administration, ease of setting up. Portable Unix. Taking your laptop or your PowerBook with you with your complete environment on it. You know, scientists are a very mobile population. They're on planes, going to conferences. Being able to take your environment with you is incredibly productive. And Apple does a great job at that. I'll spend a little bit of time on 64-bit. And 64-bit, I think you're seeing the first instance of that was 64-bit address space in Tiger. It's in your preview release. There's a 64-bit compiler there.
I think this is going to be a big deal for the scientific community. And then finally, open standards-based tool development. And let's just get started. So exponential data growth. This graph shows over 18 months, the top curve is the size of the genomic database that's out there available on the web, and the lower curve is actually Moore's law. You can see that the growth of data, and, you know, bioinformatics is just one example of where this is happening, but the growth of data is huge. disk storage is exceeding Moore's Law in terms of what we can offer you in terms of disk storage. And I was saying yesterday we're down to about $3 per gigabyte with the XServe RAID. So this kind of changes the equation. In fact, it also changes sometimes the algorithms you use, the fact that you can have huge amounts of storage online. One of the trends that's not noted as a trend here that I'm noticing is that many problems are becoming more amenable to a brute force approach that in the past would take very arcane algorithms. And these days, you don't spend a lot of time if you can just brute force it because the hardware and the storage costs have come down so much. An interesting case study here, Swinburne University Center for Astrophysics and Supercomputing in Australia. They chose XSERV RAID for price and performance. Now, they have over 13.5 terabytes of astrophysical data accessed by a 130-node cluster. They have over 1 terabyte of data generated daily. So this is a huge amount of storage, a huge amount of bandwidth being consumed.
XSERV RAID is connected by fiber channel to their server cluster. and a quote from Professor Matthew Bale says, the performance of XServe RAID is quite exceptional. It can easily handle sustained read and write operations at 100 megabytes per second on a single channel, which is twice as fast as the previous generation RAID equipment that they were using. So XServe RAID with fiber channel, and now if you need a SAN storage solution, Apple has that too with XSAN. Really a complete, very cost-effective solution for performance-- I mean, for storage.
Clustering. And, you know, clustering is just exploding in the scientific community. There's a move from supercomputer custom architectures to clusters, fairly inexpensive one-use systems. XServe, we actually offer a special configuration of XServe specifically for building clusters. So it's sort of a stripped-down version, a little bit less expensive, and the licensing is sort of tuned for building clusters. I forget exactly what we call it, but it's on the web page. Combine that with inexpensive storage with XServe rate. We have a product, called XGrid, XGrid 1.0. It's on your disk. You get it with Tiger. It's included with Tiger. And it's a grid computing solution, distributed computing for the rest of us. It's produced by our advanced computation group led by Dr. Richard Crandall. It's basically an easy way to submit and run computational tasks. And as you know, grid computing is typically amenable to situations where you're even using spare desktop cycles around your institution. So grid computing lets you run tasks on computers you don't necessarily manage or own versus clustering where everything is managed very, very closely. Xgrid supports, as I mentioned, either dedicated managed resources or ad hoc resources where someone just offers up some of their desktop cycles. Xgrid handles the hard work of connecting the nodes into a cluster, monitoring the node activity, scheduling the tasks on the nodes, copying the executables and data to the nodes, and staging the output and collecting the results. And most recently with XGrid 1.0, we now have MPI support. So this is grid computing for the rest of us. There are other solutions, commercial solutions and otherwise on the market. But this is one with the sort of Apple added value of it just works out of the box. It's incredibly simple. Nodes can use rendezvous to find each other on the network. So a very simple, easy-to-use grid computing solution.
I'll mention Princeton Center for the Study of Brain, Mind, and Behavior. And this group is using clusters. They have a 64-node cluster, one head and 63 nodes in the clusters doing computation. They do brain activity mapping and neural net simulation. So what they do is take MRI scans and look at the spin decay to figure out what the blood flow is through the brain and have the subject do different activities. I don't know, watching movies or whatever, and look at what parts of the brain get activated in terms of blood flow. They also do some neural net simulation. The data sets from a single MRI scan will be 2 to 10 gigabytes, so huge data sets that they're dealing with. On their cluster, they use a variety of applications, MATLAB, inquiry, FSO, SUMA, Brain Voyager. This is a mix of both commercial applications like MATLAB as well as open source applications. So kind of the full combination of applications, whether proprietary or open source, is available on a MAT cluster. One of the things they found in moving to this cluster was that a single G5 at 2 gigahertz was up to 10 times faster than previous SGI Origin 2000. So this was a huge step up for them. The other thing they noticed is that they had been running in their server room Dell PowerEdge. They had a couple Dell, two Dell PowerEdge servers running, And they brought up the 64-node MAC cluster. And they were standing there listening to it, and it seemed really, really loud. And it turns out that all the noise was coming from the Dell PowerEdge. The Dell PowerEdge was way louder than 64 1U nodes from us.
So interesting advantage there. By the way, I'll just go back quickly. You know, that is a big advantage for G5 clusters in that the power and cooling that you need for a G5 cluster is actually, we do a lot to make sure that that is optimized. And power and cooling can be a major cost in deploying clusters.
Easily up to 20%, 30%, 40% of the cost of deploying a cluster can actually be power and cooling. So, you know, when you price these things out, one of the things you have to look at is that aspect as well. And we do quite well there. I mentioned application optimization. And this is an area that we spend a lot of effort in providing tools. Xcode itself, of course, the key there is fast turnaround so that you can modify your code and make it faster and try it out. But we also have a number of tools that plug into Xcode and can be used with Xcode. One of those is called the Shark set of tools. Shark allows you to do kind of extreme profiling of your application, find out exactly where the bottlenecks are, where the application is spending its time, so you can focus in your efforts on either redoing that algorithm or hand-coding that inner loop or whatever it takes to get that code running as fast as possible. We have Chud tools, C-H-U-D.
Chud tools are kind of the next level down at actually looking at what's going on in the instruction pipeline At the instruction pipeline level at the cache hit and miss level so you can really optimize that inner loop And that's very important because as you As you do this sort of activity you find that there are huge performance gains We also have third parties that have tools that help you optimize so IBM now supports XL C and XL F They're highly optimized C and Fortran compilers for the Mac OS X with the G5. There's Crescent Bay, which is an auto-vectorizing compiler. And then NAG with NAGWare, they're optimizing Fortran compilers. So a ton of tools to help you squeeze the most performance out. And just an example of the kinds of improvements you can see by doing optimization. PyMol, who I'm sure a lot of you are familiar with, molecular visualization tool. Warren Delano achieved a 400% speedup using SHARC for profiling in just a few days.
So zeroing in on those bottlenecks and fixing those algorithms or fixing that code, 400%, with a few days' investment. Hammer, hidden Markov modeling for protein sequence analysis. Eric Lindahl, in just a few hours using the Chud tools that I mentioned, achieved 650% speedup. So the message here is the tools are very, very simple and easy to use. They're there on the disk.
If you're using the G5 or the G4 at all for performance computing, take the time to run the tools, do the optimization. It won't be a huge investment in your time, but it will be a huge payback in terms of performance. Ease of deployment and administration. So Apple has a number of tools that make administration of especially clusters very easy, very simple. We have, as I mentioned, XSAN for storage area network solution. And we also offer our work group cluster for bioinformatics, which which is that box shown in the corner there, which is a pre-configured cluster with everything on it you need for bioinformatics, preloaded with applications. It's kind of a turnkey solution to get your lab up and running on a clustered system for bioinformatics, so kind of the ultimate in ease of use.
Portable Unix, I can't stress enough the niceness of this if you're in scientific computing. You've got not only a portable version of your complete system that you would use in your lab, but all the tools, whether it's Perl scripting or Xcode or the optimization tools I mentioned. But you've got the panoply of things you need to keep connected. So iChat, iChat AV, as you saw yesterday, Safari, Mail. FileVault is an interesting one. If you're in-- especially if you're in the commercial market or government market and you've got sensitive data on your system, you can encrypt your home directory, you can encrypt another volume on the system, and make sure that if you happen to leave your power book on the seat of a taxi, that that data doesn't fall into the wrong hands. VPN for communication back to your home network, SSH, et cetera, et cetera. You saw the new airport product yesterday. That's going to be great for people in the hotel room. You just plug that in when you get to the hotel room, and it's about the size of the power module for the PowerBook. And you can take your laptop around the hotel room, share your network with other people in the hotel room if you want to, only if it's legal.
Example of the power of taking Unix with you, Dr. Jamie Cate at Berkeley does crystallography, and he had a large set of applications for crystallography that ran on Unix workstations, and when he left the lab, he had to basically leave his research behind. That's no longer true. So what he says is, Apple's PowerBook G4 and Mac OS X have allowed me to use the same tools on an airplane on my way to a conference that I could only use before my lab workstation. So for those of you who have not had the pleasure of a PowerBook and taking it on a plane and using it, I'd highly recommend that.
I'm going to spend a little bit of time talking about 64-bit computing. So 64-bit computing is not necessarily for everyone, but if you have massive data sets that you need to iterate over as part of a problem, you're going to find yourself in need of a 64-bit address space. And, you know, today you can, with Panther, you can put 8 gigabytes onto an Mac OS X G5 system, but an individual process can only use up to 32 bits of address space. And, in fact, you can probably get up to about 2 gigabytes of address before things start to bump into limits. With Tiger, you will be able to compile and build a 64-bit address-based application. So we have a 64-bit version of GCC that will compile 64-bit code. We have a 64-bit version of LibSystem. We're taking a staged approach here. So we did LibSystem first. So that's LibC and LibM and other libraries. Those are converted to be 64-bit versions of the libraries.
You're compiling against those libraries when you compile 64-bit. What you don't have are the GUI libraries. So, for example, Cocoa has not been, or Carbon has not been converted to 64-bit. So what you will do to leverage 64-bit with Tiger is build a computational section of your application as a single process running in up to 8 gigabytes of memory. And that process will communicate with a front end of your application if you have a need for a graphical user interface. And certainly on a cluster system, the code running on the cluster nodes typically does not have a GUI anyway, and so that's a great fit.
Eventually in the future, we'll expand the number of 64-bit libraries, but for Tiger, it's confined to lib system and the non-GUI libraries, and specifically targeted at scientific computation like very large modeling and simulation, things that really need to have a full 64-bit address space. We're using what's called LP64. That means that longs and pointers are promoted to be 64 bits.
Integers, int, will stay at 32 bits. This is the standard for Unix systems, So if you have 64-bit code running on other Unix systems or Linux, it should be easily portable to Mac OS X. The compiler has been outfitted so that it will, if you turn on warnings, it will give you a complete set of warnings if your app is not 64-bit clean. For example, if you're depending on, specifically on size of int or the size of a pointer, it It will flag that for you. So I highly recommend that if you have large data sets, you take a look at the Tiger preview release, try out the compiler. One caveat to note is that the binary format for 64-bit apps, 64-bit executables, will be changing with the final release of Tiger. So if you've compiled something for the preview release, you will have to recompile once the final Tiger comes out. 64-bit apps run right alongside 32-bit apps, so it's flagged in the executable, whether this is a 64-bit executable or a 32-bit executable. In fact, you can build your apps FAT if you want, so that you can launch either as a 64-bit app or as a 32-bit app.
Example where this might be useful, so Vertex Pharmaceuticals, They use Power Mac G5 to accelerate their drug development, which targets viral diseases, inflammatory diseases, and cancer. And Tiger is going to allow them to transition their critical molecular modeling application, which really has 64-bit addressing requirements, transition that to a G5. A quote from Joshua Boger, who's the chairman and CEO, Mac OS X 64-bit memory management will allow Vertex to rapidly interact with huge libraries of chemical structures and advance our drug discovery process.
Leveraging open source, I can't stress this enough, we have over 100 open source technologies that are projects that are incorporated into Tiger, everything from Apache to Pearl to Python to OpenLDAP to Berkeley DB, MySQL, JBoss, you name it, we pretty much have it. And those packages are included in the release so that the code runs out of the box and, more importantly, When updates come out from those products, we incorporate them into a system update for Mac, so you get the most recent versions or the security patches kept up to date for these projects. Makes it incredibly easy to use these. Beyond the ones we package, of course, there's a huge number, thousands really, of open source packages available from SourceForge or from Think, and those applications are basically, you know, the doubling rate is about doubling every year year in terms of number of open source projects available on the Mac. So that's an incredible resource for you so that you don't have to reinvent the wheel. If you need to get something done, the first thing you should do with Mac OS X is go look, see if we've got an open source package that accomplishes the task you want to accomplish so you don't have to write code from scratch. of applications available include things like NCBI Toolkit, Emboss, PyMOL I mentioned earlier, Globus, WU Blast. We also have a version of Blast that has been optimized by Apple and Genentech, AG Blast, highly optimized for the G5, Amber. So a huge number of tools available out there for scientific computation.
So to summarize the trends we're seeing, number one, huge growth in scientific data. So you can expect Apple to continue to focus with products like XSAN and XSERV RAID, continue to focus on providing cost-effective storage, very high bandwidth to storage, clustering for cost-effective performance, and that is our strategy. We have one-use servers. We don't make huge big iron. we don't make 64-way SMP. We are all about optimizing 1U form factor for building clusters, making sure that we can give that to people as inexpensively as possible. XGrid for building ad hoc clusters. Strong focus on application optimization. I mentioned the Shark tools and Chud. You can expect to see additional performance-related tools coming from Apple. This is the way to squeeze the most performance out of your G5, make sure you're getting the absolute most you can get for your application. Ease of deployment administration, out of the box, turn it on with work group bioinformatics cluster and you've got a cluster under the desk in your office if you want.
Portable Unix. I can't stress enough the productivity gain that you get from being able to take your entire lab software with you wherever you go. 64-bit address space. And this is something that has been requested from us, and we are very pleased to be able to offer the non-GUI 64-bit app address space with Tiger. And this is an area where please, if you try it out on a preview release, please give us your feedback on what you're finding. You're the guys that have the 64-bit apps and we're committed to make this the best 64-bit system we can. And then finally, open standards-based tools development. And there basically is not an open source tool out there that has not, at this point, been ported to Mac OS X. And that's a huge leverage point for you to not have to reinvent the wheel.
So Apple really, in my mind, is the best platform for scientific computing today. If you look at all the tools oriented around it, available on it, the things that we're doing to enable clustering, the things that our partners are doing to enable scientific computing in a variety of areas, I really can't point to a system today that makes a better scientific computer. Thanks a lot. I'm going to turn it over at this point to Dr. Liz Kerr, and she's the Director of Scientific Marketing. She's going to talk about Apple in the SciTech market. So, thank you, Liz.
Thanks, Bud. It's really my pleasure to be here, and it's great to see so many faces out here this morning. What I'm going to talk to you about for the next 20 minutes is how both my team, the SciTech marketing team, and many other groups at Apple are working towards providing solutions and awareness out to the market to help really drive adoption of the Apple platform for scientific computing. I mean, we really think this is a perfect solution, and we want to help get that message out there.
One of the most important aspects of that-- oh, sorry, wrong way-- There we go. One of the most important aspects of that is really driving the awareness and communicating both to our customers and hearing from our customers. I'll go through some of the ways that we're doing that. One of the most simple ways is through trade shows. And we've done a number of these and plan to do more this year. BioIT World is one, and that's what this image is from, our booth there. Also ISMB, which is coming up in Glasgow, Scotland, a big bioinformatics show. and Drug Discovery Technology, which is a show that focuses more on the commercial aspects of science, biological sciences, pharmaceutical, and biotech. These shows are really important to us because they allow us not just the ability to talk to our current customers, but to let other people know that this is an area we're interested in and to hear from people who maybe we don't normally talk to. Thank you.
Another type of event that we're doing are focused customer events, where we actually go to a customer site and give them hands-on experience with some of our newer tools. These events shown in the images were to promote the Power Mac G5 and the performance of those computers for applications in scientific computing.
Another thing we're doing is focusing on advertising that goes specifically to our scientists. This is a little tongue-in-cheek. Obviously, this isn't an iPod, and the point isn't that we're focused away from that, but just that in many cases, our consumer advertising overtakes what our scientists see, and they don't think of us as a company that maybe makes computers that are really specific for the scientific market. So this is an example of an advertisement that's currently running in both peer-reviewed trade journals like Science and Nature as well as magazines like The Scientist and Genome Technology. And it really is a great ad because it focuses specifically on the Power Mac G5 and customers talking about why it's great for their use. We're also doing some online advertising. This is another great way to reach people who maybe don't normally think of Apple. And this is an online ad for the work group cluster for bioinformatics that Bud alluded to. And we'll talk a little bit more about this solution later.
Another thing that we're really pleased about is launching a science website on apple.com/science. This is the home page. This is really geared towards up-leveling all that information that's more technical and more geared towards both our scientific developers and our scientific customers so that they can find sort of a home for that information and find it more easily. We have lots of downloads and focus on both Apple solutions as well as our third-party solutions, but we also have success stories that focus on how customers are using Apple products as examples and to serve as an example for people who are interested in how they might use our technology to help solve their problems.
This is just a blow-up, because what I'm going to do is focus in on a couple of these areas and dig down a little bit just to show you what type of information is there. So in the upper right-hand corner, we're going to look at the applications for research. I wanted to point this one out because this is where most of the information from third-party developers and open-source developers live on the website. So we've got featured applications on this part of the web to raise awareness for particular applications, and this rotates on a regular basis, so we don't play favorites or anything.
We try to give everybody a chance to focus on their applications. There's also the Macintosh Products Guide, which is the comprehensive list of all the applications that are available that run on Mac OS X, both scientific and otherwise. There's also a Download section, so if you find an application you're interested in or somebody wants to download your application, for example, they can go to either the Math and Science part or the Open Source and Unix part, and all these have specific download sections.
So another part of this that's interesting, I think, is the resource section. And here we have it broken down into different categories. So if you are looking for a particular type of information, for example, high performance computing or software development, there's a part for Darwin resources, third party products. There are mailing lists and communities.
So if you're interested in joining a mailing list or a community to discuss your challenges or throw something out there and get a response back that's right there. And there's also a lot of links to technical information. You can see-- maybe you can't see-- on the right hand side, where you can download PDFs about the Apple technology.
We also have been doing what we call SciTech initiatives and solutions, and also I want to talk a little bit about how we're judging the momentum that we're getting in the scientific market. Thank you. So one of the cornerstones of this is the Apple Workgroup cluster for bioinformatics. We're really, really pleased with this, because it really ties together the highly technical aspects of what we're providing to the scientific marketplace, plus the ease of use that Apple's known for.
And the idea is, I think as Bud alluded to, is to take the setting up the computer cluster out of the hands of the scientists and make it just really easy. Make it so that they can have the compute power without having to know how to manage a cluster, how to code in Linux, how to do any of that. This is geared to be something they can take out of the box, set up themselves, and have it running in no time. We announced this at Macworld in January, and we were really pleased that it won the Best in Show award at BioIT World for IT infrastructure. We're just really proud of that, and I think it really speaks to how the scientific community is viewing this. It's really being adopted for many uses. It's a bioinformatics workgroup cluster, but people are using it for biological research. They're also using it for application development. And interestingly, they're using it to develop curriculum and teaching programs for bioinformatics at the university level.
Just a couple examples. This one's from the Naval Medical Research Lab. Dr. Michael Schutt is using his workgroup cluster for bioterrorism research, and he installed and maintained this himself. He has no computer science background whatsoever. His favorite thing is to say all he needed was a screwdriver, and he was able to set the whole thing up himself. He had it up and running in 30 minutes. They really like the security aspect of this cluster because, of course, they're working on something that's very critical to the security of the country. They also liked having the applications which come with the workgroup cluster with the web-based interface. They liked having the accessibility of that without having to know command line because a lot of bench scientists don't know how to do that, and it's much easier for them to have a familiar GUI interface. The other thing about the workgroup cluster, which a lot of our customers like, and which is one of the things that was a deciding factor for the Naval Medical Research Center, is the scalability of the cluster. You can always add to this. If you find that your eight nodes isn't enough, you can double that or add two more nodes, or whatever you need.
Another example is from Idaho State University. Dr. Mike Thomas set out to design a bioinformatics curriculum for the university. They bought a five-no-workgroup cluster. What happened was they set it up so much faster than they had planned that they were able to offer their bioinformatics course an entire semester early. The other thing that he had done, he had hired a person, a headcount to manage the cluster. Well, once it was set up and it was running, the guy had nothing to do. because it was just going and it was working. So they reassigned this person 75% of his time to do something else.
So they're using this to teach the very first course in bioinformatics at Idaho State University. A quote from him which talks a little bit about how this bleeds over into other areas of the university is, "I think the cluster is going to have a huge effect in our research environment, and I think it will help scientists here generate additional funding." So he sees this as a way of other scientists at the university referring to this resource and being able to hopefully boost up the value of their grant applications.
So one of the things we did to raise the awareness of this solution, the workgroup cluster for bioinformatics out to the marketplace, is with my team in the higher education marketing team put together a workgroup cluster awards program to recognize innovation in research. The goal was to give away five fully provisioned clusters with four dual processor XRF G5s with two gigs of RAM in each. Comes with the software included, the bioteam inquiry package with over 200 informatics applications. the hardware infrastructure, the power supply, the cables, etc., and AppleCare support for three years. This is a great thing to win. The applicants were tremendous. We had hundreds of applications come in from all over the U.S. and we were just blown away by the quality and the -- just the time and effort it took -- people took to put these together. And from all aspects of research, from higher ed, government, nonprofit, as well as commercial customers.
I'm like, I hope it's not pink on the screen because it's pink there. Okay, we'll go with pink. So first I'd like to say, of the hundreds, we picked five winners, but we also picked five honorable mentions because, again, the quality of these was so incredible that we felt we wanted to extend the acknowledgment to at least ten of the applicants. So just very quickly, these are the five honorable mentions. The first from University of Washington where they're doing HIV evolution research. At Yale University, Dr. Kevin White doing genomic research on model organisms.
Caltech, Dr. Barbara Wald doing gene regulatory networks. At University of Pennsylvania, Dr. David Ruse and colleagues are studying parasites and genomics of parasites. And at the Institute for Genomic Research, or TIGER, Dr. John Quackenbush is doing all kinds of things, but also software development and a lot of genomic database work. So now to the very pink winners for the Workloop Cruster Awards. The first one on the list, and these are not in the order of first, second, third, fourth, or fifth, they're all winners.
UCLA, Dr. Christopher Lee for doing work in comparative genomics, an incredible application, incredible project. At Duke University, Dr. Simon Lin who is representing a group of scientists doing oncology research, an enormously extensible project that he's looking at doing with lots of software development that would be used by the entire oncology research community. At MIT, we have to have Dr. Edward DeLong for environmental microbial genomics, a really interesting topic, very unique.
And at University of Wisconsin, Mike Newton, Dr. Mike Newton, he's developing statistical techniques for genomic research to really, it's like a light show, genomic research to really expand the types of algorithms and such that people can use for that. And then finally, at Children's Hospital in Oakland, the research institute there, Dr. Deborah Dean is doing really, really state-of-the-art chlamydia genomics research, much more in the healthcare area. So those are our five winners of the Apple Workgroup Cluster Awards, and I'd like to just stop here and give a round of applause to all the applicants and winners. Thank you, guys.
Okay, moving right along. And just talking again about the momentum and awareness, we have gotten an enormous amount of press coverage, both from this awards program, but really primarily starting when we launched the workgroup cluster for bioinformatics and started showing up at things like BioIT World. And it's been really nice to see the press, both the Mac trade press as well as more general press and scientific press, really want to hear what Apple is doing in this space and paying attention to the efforts we're making to provide really great solutions to our scientific customers. I want to turn a little bit to talk about the developers and some of the work that you all have been doing. I think the amount of the number of new applications that have come on to Mac OS X and continue to come on to Mac OS X is overwhelming.
The list just keeps growing. These are four that are relatively new, either updated or new to the platform. From the chemical computing group, we have the molecular operating environment, or as we like to call it, Mo. MATLAB 7, enormously popular program for our physical science customers. GeoSPEZA is a company that does the Finch Sequencing Center, a great tool for managing sequencing labs.
And GeneCodes with Sequencer, another really popular program for managing DNA sequence data. What really drives that, I think, is the amount of developer support that our worldwide developer group provides to our scientific developers as well as others. And I just wanted to highlight a few things that we have on offer for our developers. There are at the Apple Developer Connection software development tools, hardware support, technical support and services, as well as business services. And that kind of moves back into my area a little bit, but co-marketing programs and program discounts.
This is a blow-up. I'm not sure how well you can see that, but this is what, especially now, this is what you would see for a particular application on our website, and it just is a nice highlight with a description of the program and information about where they get it, what the company, who the company is that makes it, or the individual. These all live on apple.com slash science. They also are all in the Macintosh products guide. We do press release support for developers that are doing a big release. will help with promoting that. This year at all of our scientific conferences, we're inviting partners, specific partners, to join us in our booth to help show the solution of Apple hardware and Mac OS X with some of the key scientific applications for that particular audience that we're addressing. And then success stories. We're not just doing success stories of our customers, but we really want to focus on how our developers are using Mac OS X, for examples, for other scientific developers to look at and use as examples for their own work.
So I'll finish here, and this is a quote that came off the ad, which I'm sure you couldn't see because the type was so small, from Dr. Sean Morrison at University of Michigan. He said, "The Power Mac G5 is the fastest computer I have ever used. I can have eight different memory-intensive applications open on my desktop at the same time with no problems whatsoever. In my personal opinion, the system is so reliable, user-friendly, and powerful that I don't understand why people endure PCs." Now, yes.
I think I'd like to just close by saying what's not really covered there is really the key of matching the really powerful hardware and operating system that Apple makes with the really incredible applications that our developers provide. Because those things have to go hand in hand to provide the right solution to our scientists. And I feel like it's just so tremendous to see the people here really focused on developing and working towards scientific apps, maybe just for personal use, but also for commercial use because I really believe that those two things together really make the solution that help address the needs of our scientific community. So with that, I would like to introduce our next speaker. Chen Peng is from the Temasek Life Science Institute Laboratory in Singapore. They have a 75-node XServe cluster. It's the largest cluster currently in Asia, Apple cluster in Asia. And he both installed it and managed it. And he's going to tell you all about his work there. Please welcome him.
Thank you, Elizabeth. Good morning, everyone. It's my pleasure to be here, share with you our experience of building and using the XR cluster for bioinformatics in Termasac Life Science Laboratory in Singapore. Paul. So our group is involved in creating a computational biology division that will focus on comparing DNA between different species. Our current research project is the genome annotation of a C-squared species and the study of non-coding regions across co-date genomes. In parallel to annotation project, we are furthering the development of a workflow management software, BioPipe, to suit of a large-scale cluster-based computational needs and smaller workflows suited for other projects in TLL. Inside TLL, we work actively with other scientists to provide computational biology support. For instance, we work with a lab of reproductive genomes on the automation of filtering, clustering, and annotation of in-house generated sequence data and its integration with public databases.
The foremost large-scale project we are doing is sooner-savvy genome annotation. The genome size of the C squared in question has been estimated to 360 million bases with approximately 15,000 genes. The 400 million bases of raw data delivered from Sequence Lab is organized into 66,000 continuous reads. We typically run a series of programs, including some well-known algorithms like BLAST and in-house developed solutions to analyze each of the 66 sequence pieces. Each analysis program generally takes somewhere between five minutes to two hours to complete.
As large amount of data has to be passed from hard disk to memory, the data I/O speed is extremely important for us. So for the annotation project, we need to set up a cluster that can meet our requirements as listed in the slides. So the cluster must be able to deliver tremendous computational power.
It should be easy to install and ready to extend for the future. And we require high quality hardware and robust operating system that allows most of the bioinformatics tools to run without any problem. In addition, these applications should be optimized to achieve the best performance on that platform. We also require sophisticated software to manage distributed resources and thousands of computational jobs. And finally, the hardware and software solution must be cost effective.
This is the XServe cluster we built in 2003. It has 75 XServe units running Mac OS X server. Each XServe unit has dual G4 processor, two gigabyte memory, fast disk storage, and gigabit ethernet. Our cluster hosted more than 20 terabytes disk storage, and the cluster nodes are managed by platform LSF. Thanks.
So with the help from BioTeam and Apple, we figured out a way to conduct a mass rapid installation. We put up an XSERV unit from an external hard disk, which contain a pre-built disk image. During the boot up period, a script automatically restored the image and built the operating system on the local storage. We paralleled the installation with four external hard disks, and set up the 64 cluster nodes within three hours.
So Mac OS X is a BSD-based operating system, and we feel it is very friendly to the bioinformatics tools originally designed for Linux or Unix. This slide shows the bioinformatics tools available in our cluster. Most of the tools are compiled directly from source code by ourselves. Although some of them need to be modified a little to cope with the difference between BSD and the Linux, it is not difficult if you have some experience with C programming.
After the basic system is up, we spend a lot of time to optimize the performance. So as explained in the previous slides, we focus on improved data I/O speed. For each exonode, we stripe the two local hard disks to build a RAID 0 set so that it provides 240 gigabytes local storage at average speed of 66 megabytes per second. We store most of the blast database locally on each node to reduce NFS traffic and connect all the X of unit in gigabit internet.
On software level, we engaged to find the MPI enabled version to replace the normal version. If the application itself supports multi-CPU execution, we instruct the users to run with prop options. For example, to specify -a for NCBI blast, so that it runs in multi-threaded mode. In addition to these efforts, we also optimize at the compiler level. With proper GCC options, a lot of bioinformatics tools can speed up by about 40% if they were originally made with the default configuration.
Different from other simple biology analysis, our C-squared genome annotation involves running a series of programs for each of the 66 sequence pieces. Each step of the analysis must be automated so that the entire process won't stop in the middle. BioPipe is an open source workflow management software maintained by OpenBio community. It was designed to address some of the complex issues in large-scale biology analysis. Our group contributed to that project and used BioPipe to manage our genome annotation project. BioPipe is entirely written in Perl, and Mac OS X developer tool, CD, provides all the necessary tools we needed for development.
This screenshot shows the job status in our cluster in April 2004. There are more than 40,000 jobs in the queue, and over 100 are running. This is the situation we need to deal with almost every day. We use platform LSF to manage the thousands of jobs generated by BioPipe effectively. LSF is the most robust distributed resource management software we have ever used. With Mac OS X Server and LSF, we are able to perform large-scale biology analysis without worrying about system stability.
Setting up the cluster is a one-time task, and the maintenance is the administrator's everyday work. Luckily, we have a few effective tools that help us a lot in daily system administration. One tool I would like to mention is Server Monitor. It took us only two hours to set up the Server Monitor so that it provides an overview for all the 75 XR unit. We only need to configure the monitoring server with the IP address of each cluster node.
and the administrative account. The server monitor retrieves all the important hardware information for us in a few seconds. If we were using other unique system, the administrator have to manually log into each node for configuration, which would take much longer to complete. Server monitor also features a hard disk pre-finder walling that is very useful for us to quickly identify the disk with potential problems. And we also use server monitor to collect per node information, such as serial number or MAC address for each network adapter.
Another important GUI tool is Apple Remote Desktop that enables the administrator to operate a remote machine as if it is local. This tool is neat for headless XServe, especially the new XServe G5 without a VGA card. The most charming feature of Apple Remote Desktop is the ability to install software package by drag and drop simultaneously on multiple nodes. We find this feature extremely useful for us when doing cluster-wide system upgrade. We are able to update the 64 cluster node to a newer version within 30 minutes. Our previous experience of managing a similar size alpha system involves doing updates from command line, and it took us at least a half day for the same task.
There are other command line tools we use frequently to facilitate cluster management. We are glad to find that Panther has great support for command line tools. Almost every GUI application has its command line interface accordingly. Just to mention a few of my favorites. SSH is used to log into the remote node every day. Bash has been set as default in Panther. R-Sync is the core utility for data synchronization, And we use DSH for distributed share.
So to summarize our experience with XSERV cluster in TRL, basically the XSERV unit provides superior computational power we expected. The cluster was quickly set up, and we are able to run and optimize most of the bioinformatics tools. The entire cluster is robust for our genome annotation project, and the daily administrative work is made easy with sophisticated Mac OS X monitoring tools and open source command line tools. Thank you.
Thank you. Well, thank you very much. And I'm just going to point out a few places you can get more information while we're bringing some of the Apple people up here for Q&A. And in terms of questions, if you could please use the microphones if you've got any questions. And in terms of contacts, Liz, Robert Kerr, our science partnership manager, and Elias Stupka for TLL Bioinformatics Program Manager or Cheng Pen, Senior System Engineer. Other resources, Liz mentioned the website. There's the Apple Science websites. And other related sessions you may be interested in, I just want to mention specifically the Science Lounge on the fourth floor. You should check that out. There's going to be roundtable discussions ongoing throughout the conference there. All right, so let's take the first question over here.