Large-scale Xserve RAID Deployment - WWDC 2005

Enterprise IT • 1:12:03

Rolling out large-scale storage with Xserve RAID can be daunting without the proper tools and methodology. Learn from deployment experts who explain how to bring these systems online quickly and efficiently.

Speakers: Alex Grossman, Harry Aine

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Great. That's fantastic. How many of you, and I want to get a little bit about where you're from, how many of you are from education? Wow. Let's switch hands now. How many from the commercial market? Government? Mars? A few of you, okay. Well, we are at Apple, so why not, right? Okay, fantastic. So, with no further ado, let me introduce Harry Ain, the President and CEO of Sand Solutions.

Thank you very much, Alex, for that great introduction. So I'm Harry Ain from Sand Solutions. Today we're going to cover a few things. I'll go through what, you know, Sand Solutions' background is. We're going to talk about sand planning and the deployment process. We're going to talk about What a large-scale deployment is versus a small-scale deployment. What we consider a large-scale versus a small-scale. We're going to talk about, in that part, we're going to talk about how do you get from, how do you build out a large-scale deployment.

Well, we start with a small-scale as our building block. So we're going to talk about, we're going to talk through a small reference design and we're going to run tests on that. So you're going to see some real live data going through a reference design. Then we're going to scale that reference design up to a large-scale design.

Okay, and then we're going to show some reference implementations, okay, for the large-scale. So we're going to have a question and answer period and I have a few guests with me today, okay, that we're going to bring up and hopefully we'll have time to answer questions and if not, come up to the podium here.

Sand Solutions was founded in 1998 as a integrator, or a top-end integrator of storage area networks. And in 1998, just about nothing worked. Today we have products that work, but back in those days, it took about, well, it took us, when we did the Sony Pictures DVD Center, it took over 300 traces for Amulex to get their host adapter drivers right. So we've come a long way since those early days. Prior life was at Mountain Gate, which eventually got acquired by Ada Corporation, which you guys are probably pretty familiar with the XSAN file system.

We do largely SAN and NAS integration work. We specialize in rich media. How many in the audience really deploy rich media type solutions versus, okay, so you guys are going to be probably more entertained by this presentation than some of the enterprise guys, but all the things that we're going to go through today, the same disciplines, the same process applies to databases and IT applications as it does to rich media applications, except for rich media applications tend to stress things out a lot more than some of the database applications do.

Anyway, so we have high-performance infrastructures out there today. We have high-availability infrastructures in both. Some of the customers we have are Sony Pictures Entertainment. We did a lot of work for Microsoft. We put in the Johnson Space Flight Center in Houston that do all the simulation for the satellites, I mean for the astronauts, for the shuttle missions. We put in that infrastructure. We've done a lot of work for Skywalker Sound, putting in the whole entire audio facility there. We do the Tonight Show at NBC. We do the majority of all Warner Brothers television today.

Most of our client, I mean most of our CN installs are 10 clients and above, or 10 host computers and above. Okay, 10 terabytes to 100 terabytes today. We're really good at tuning real-time applications. In these SAN environments, most of our infrastructure, as a matter of fact, all of our infrastructure, have some Apple component in it, both from the Power Macs, workstations, and Xserves in there. And we're highly dependent, I mean, all our customers are really highly dependent on uptime, okay, which these generally are not engineering deployments. They're deployments that actually, okay, produce our customers' money. So when you've got the cash register going, okay, they can't take any downtime.

Okay, the first part we're going to go through, okay, we're going to go through it pretty quick because the more interesting stuff is towards the end of the presentation when we actually get into test results. Okay, but for completeness sake, okay, we're going to talk about the planning process. And that involves the requirements assessment, okay, first. And then how do you design a SAN? How do you implement it? How do you document it? Okay, what services do you need to be able to offer your end customers as far as training and support? Thank you.

Okay, when we go out to a site, we really start with what are the key business issues. Okay, and this ranges, okay, depending on who the customer is or what their problems are today. But they may say that, okay, we can't find anything. Everybody's got media all over the place, but it's, you know, on their desktop or on a removable drive or whatever. Okay, and we need to get this thing consolidated.

Okay, but usually within the first hour or so, you have a pretty good, clear understanding of why they think they need a SAN or why they need a new look at how they put in their storage infrastructure. Okay, we also have to assess what the risk adversity level is. Okay, is this a customer that wants proven things, okay, that have a lot of track record on them?

Or is this a guy that wants the latest and greatest thing, okay, that pushes his technology a little bit to the edge? Okay, and we have a mix of customers. Some of them, they won't touch anything that hasn't been out there for three years. And some that... I just want the latest and greatest thing to make them as efficient as possible.

Okay, we look at the implementation scope, and this is -- is this a -- is this today a department that's going to have to scale out to the full facility? Is it going to have remote sites that need to get tied in? So we have to define what the scope of the project is going to be.

We've got to assess really what the customer is, okay, in terms of how much support he has. Because some things may get too complicated for one class of customer or, you know, another type of customer, he has a staff and he has the people that can be trained, okay, to accomplish or to manage this thing once it's in.

Okay, we have to look at the workflow because as we change their storage infrastructure, we change the way in which the people work. Okay, a lot of our sites are shared media, and that's a different workflow, or they can change the workflow, okay, versus a site that has been using removable disk drives, for instance.

Okay, we have to look at what the technical requirements are, what applications have to run on this infrastructure. Is there anything strange about those applications? Because in a lot of cases, applications, especially in the rich media space, The applications have been written for a system that has a 12-inch cable between their disk and the computer motherboard. And when you put those things or those applications in a network, you run into a whole new set of issues. And they usually generally fall into the latency area. So we have to take a serious look at what application is going to run on this new infrastructure.

Okay, then this follows up on how we're going to support this site. Okay, not for now and in the future. And how much downtime is allowed. Okay, some customers, they work from 8 to 5. Some sites are 7 by 24 with no downtime ever. So you have to have these things in mind when you lay out the design of these infrastructures. Okay, your data migration requirements today. Okay, how are they going to get from what they're using today onto the current system without taking them down? Okay, so those things have to be explored.

And then we have to look at, okay, what is the phasing, what is the timeline, when does a customer have any kind of window to switch over, and then how are we going to meet their budgets? So that basically outlines what we look for on the assessment level.

[Transcript missing]

But this new customer wants Final Cut Pro across 10 machines, and then he wants to have real-time replication. Okay, so that real-time replication becomes incremental over a reference design that we've already created and tested.

Okay, so it's important at this stage that we document physically the physical layout of the architecture and the logical layout. And logical things are how is the thing zoned? Okay, what are the volumes? Okay, how does failover occur? Okay, those kinds of things go into the documentation package.

Okay, then we talk about, we got to look at what the future requirements of this facility are. And a lot of things are phased, okay, that they first want to get, you know, the graphics department up, okay. And then next year they're going to look at, they'll have budget for, say, editorial, okay. And then they may want to look at distribution, okay. So we have to look at it from a broad spectrum early on because we don't want to put something in that we're going to have to replace later if we can avoid it going in.

The last thing we look at is a lot of customers today, especially in the traditional IT area, is they don't want to have downtime and they want proactive management tools. And what a proactive management tool can do for you is can monitor the site, monitor your I.O., and based on policy, alert you of certain things that are occurring. Okay, and those types of things are, well, you just added five more users. How did that affect, okay, the usage or latency or bandwidth of the infrastructure?

And when those utilizations get to a certain level on average, then we want to, before we start losing real time of the current customers or, you know, or Before we start losing bandwidth today to our current customers, we want to upgrade to handle the additional load. So there's a class of monitoring tools that we're going to go through today that have that capability that we deploy in the infrastructure early on so we can monitor for these things. And they're also great for fault isolation and predicting failures of the optics. for instance, if you have SFPs going out or your GBICs. or you have a particular storage system giving you problems but you're unaware of.

Okay, in the implementation phase, okay, we normally put together a full implementation plan. Okay, and that kind of outlines what the responsibility, our responsibilities are and what the customer's responsibilities are. Because they kind of have to get a step through the process or they don't really understand what's involved in what we're going to do. Okay, so usually there's third parties involved that have to get on the same page. We have to get involved on the same page and our customer has to be involved in the same page.

Okay, we have a list of deliverables that we give, okay, and the customer has a list of their deliverables. And then we try to create a project schedule and also what the final test completion criteria is. It's important kind of to nail down how we're going to test this thing. How do we know when it works? Okay.

As you go through the process, you find that, you know, you might be into a project for a month or so, and then all of a sudden, new applications just all of a sudden popped up. Okay, so you might have to have a change control process in place. Okay, what do you do if they want to change what you got off, deviate from the plan that you started with?

Okay, documentation is also a key component of an installation. And here you got to kind of meet the standards of that facility of what they expect to see. Okay, and a lot of things, a lot of requirements actually come from procedural manuals. And these are, this is an area where, okay, how do I fill over a server, for instance? Okay, or how do I shut down this thing if I need to shut it down? Okay, so these kinds of procedures need to be tested, okay, and documented.

Okay, and then finally the training and support side of it. Okay, we specialize in training fiber channel, okay, on the basics of the infrastructure. Okay, how do you manage this infrastructure from the storage management side and also the operational management side? Okay, we also... On the support side, we've got to be able to handle break/fix. We've also got to be able to handle what happens if a year from now they want to integrate a new application into them, and how does that change the original design.

Okay, so now we're going to talk briefly on what we consider a large-scale deployment versus a small-scale deployment. Okay, a large-scale deployment, it's really, this classification is really not a function of how many or how much capacity or how many servers you have. It's really what function, what functions are in the infrastructure and what kind of data sets are you dealing with.

Okay, are you dealing with real-time video? Are you dealing with real-time audio? Are you dealing with both real-time audio and video? Okay, do you have databases? Okay, that you're dealing with? Okay, or do you have a bunch of shared media? Or is it all allocating storage to individual workstations?

Okay, generally, real-time characteristics are much more difficult to deal with than non-real-time. And we're going to see that a little bit later on here. Okay, how much redundancy do you need for high availability? Okay, most of the large-scales need a lot of redundancy built in. Okay, is it a homogeneous client? Is everything Mac? Or do you have Macs and Windows clients? Kind of all have to live together in the same infrastructure. Okay, well in a large, you're going to have both. Okay, you have your Suns or other Unix flavors.

[Transcript missing]

So we start small and we start scaling up. And that's sort of the process that we've always gone through to stabilize a large SAN environment. Some of the things that you've got to know about when you do this is that every disk system has a number of logins that it will accept.

So when we were at the supercomputer center down in San Diego and they bought 800 terabytes of storage from one of those three-letter companies, and they had 384 servers that wanted to log in to one array, and they couldn't do that, they have to understand A storage device typically will allow 64 logins. So a lot of integrators out there or people that are in the storage business don't necessarily know what their limitations of their arrays are. But that's one area in large scale that you've got to be aware of.

Also, as you get more devices out there, you add latency because as you have more logins, you need more resources both on the storage controller and on the host adapter to handle that connections. So as you start fanning out, you get a lot more latency involved in your infrastructure.

The other thing that we look at is the queue depth. Okay, and this is the outstanding I/Os that a host adapter can do. And when we tune a system, we'll minimize They allow QDEP to handle an application. Because if we don't, then what happens is you can get into a hogging mode where one client decides to do a copy and consumes all the potential bandwidth off an array, for instance, and all your real-time streams die. So we have ways to throttle back I.O. so we can balance it across the facility.

You also have to be aware of fabric design to make sure that you have redundant fabrics or you have isolation in your fabric so that if you lose a switch, it doesn't bring down the whole facility. Or if you need to do firmware upgrades later on, when the thing's in production, how's that going to affect the users that are on the system?

Okay, and the third item on the slide is really, is in the operational. We have storage provisioning. We have some customers that they provision storage out to their clients once a year. Okay, and they may revise that provisioning, okay, you know, minor, make minor adjustments after a year. They may add more storage or something. And then we got clients that have to provision or reprovision storage 30 times a day. Well, you got to have the management tools in place, okay, that allow you to do those types of provisionings.

Okay, backup, recovery, and archive. Okay, this is a great topic today, okay, because this paradigm of going disk to tape is kind of fading away. Everybody wants to be -- if they're going to back up, they might as well minimize the time it takes to restore. So we're seeing a lot of requests out there today for continuous backup.

This is an infrastructure where every I/O is mirrored in real time and at any point in time they can say, "Okay, I want to roll back to what this was 10 hours ago. Give me a new volume with what it looked like 10 hours ago." So we're getting involved in a lot of those types of integrations today. Snapshots are where they want a point-in-time image every couple hours of what the data looked like and be able to manage those things.

Okay. And as it used to be, an IT facility, if you lost your data, you go to the IT manager and he says, could you roll back that old backup tape, okay, if you can read it, okay, and bring it back online so I can get that file off. I just deleted it. Okay. Well, in today's world, that stuff's getting pushed more to the user, okay, so the user can restore his own files instead of relying on a central IT department to do that.

Okay, and finally we have the management tools. Okay, and these again is where, this is where we start introducing proactive management versus reactive management. And proactive can help predict when we're going to start seeing problems out there instead of reacting when a problem does come up, which usually leads to downtime.

And those tools also help eliminate the finger pointing. Okay, next we're going to start with a small reference design and run some real live data through it, okay, for -- on the Xserve RAID. So we get an idea of what does an Xserve RAID -- how does that perform and how is that going to scale?

Okay, so we're going to start with, this is going to be, we're going to run some high-def data through this thing. So we're going to start with a thing called a test generator. And I have one here. It's a small little package that basically you can program in any type of video, okay, that you want to test your SAN with.

And it goes from standard def all the way up to DOLINK film res video. It saves us a ton of time out there because without that thing we have to get a tape and a tape deck and a video deck. And we've got to get a tape with the right format and things like that. Well, this thing can be programmed basically to handle any format that we might see.

Okay, the other things that may be new to you is we're going to put taps in our small design reference implementation. And what a tap is, is an optical device that allows us to plug an analyzer in without getting in the direct way of the traffic. Okay, so what it does is it splits the light from your GBIX basically, or your SFPs, okay, and allows us to monitor, okay, that connection.

Without getting in the path of the connection. And FinnoSER has an extensive line of these taps that we use. Okay, then we have a Finisterre Net Wisdom probe. And what this probe does is it collects traffic. It collects all the exchanges that go on. An exchange is equivalent to a SCSI command. Okay, and it collects basically all of the traffic that goes on that connection. And then there's a thing called a portal. And the portal can consolidate a lot of probes and centralize all that data to help present it.

Okay, we also have the capability if the data would indicate something or we're getting errors someplace, it has the ability, that portal has the ability to email us. Okay, so not only for this test infrastructure, but it's also used for proactive management. Okay, and this is what we see from the views, okay, of the portal. Views is an application that basically shows us we can select megabytes a second.

We can select the latency. We can select, okay, any kind of errors. Okay, they go on. We can look at the queue depth of those conversations. Okay, just about anything that you want to know about the SCSI traffic. We can find out from the probe and the management utility. Okay.

Okay, so in the small reference design, we have essentially two Power Macs, two Xserves. We're going to run XAN on this thing. We got two Xserve RAIDs, or three Xserve RAIDs, I'm sorry, a Cisco switch, and it looks kind of like this. So we have our test generator producing data that goes through the Power Mac, okay, that has an AGA video card in it, okay, that goes down to the brocade switch, and we have two arrays dedicated for storage for XAN data, and then we have another Xserve RAID as our metadata device. And then we got the two Xserve as our metadata and a backup metadata server. Okay.

Okay, and the first test we run is really just I/O, direct I/O on the disk, okay, because we want to understand what the disk performance looks like. Okay, and it's incredibly important at this stage to understand what it looks like with known patterns. So when we get to scaling, we can help predict, okay, this data here will help predict how it's going to scale. So our first three tests are sequential RAIDs.

Okay, the red, okay, the pattern over on the left-hand side, okay, is, reflects the data pattern. Okay, so let's take the first one, for instance. Well, we have four threads that are slightly offset at the beginning of the disk. Okay, so we have three threads that are slightly offset at the beginning of the disk that are doing reads. Okay, the prefetch was set at 64 K bytes. That's the smallest that the XSERVE RAID allows. Okay, and in this mode, we were achieving 110 megabytes a second.

Okay, and then we also have some parameters listed here that reflect what the exchange times were or the amount of latency it was to complete those things. Okay, and then we also have the data that we're going to be using to do the scaling. Okay, and then we also have some parameters listed here that reflect what the exchange times were or the amount of latency it was to complete those things. what size we're writing, so we're documenting what size of IO we're doing, because we have, we can vary that.

Okay? And the XSERV rate is really optimized for transactions that happen at one megabyte, okay? It has a megabyte of cash in it, okay, so that's the size of IO that we selected at this. And then we show the pending exchanges, and that's how many outstanding IOs, if we have four threads, one thread each, that's how we get to the four. So, stepping down to test.

You see that at 64K we got 110. At 512, which is the next level up, we're getting 165. And at 8 meg read-ahead, or prefetch, we're getting 191. Those are outstanding numbers for a 6 plus 1 configuration. You got 6 drives and a parity drive. So those are outstanding numbers in the industry, really, for read performance across a 6 plus 1 configuration. The next thing we did is look at, okay, those are the fast tracks. How does it perform on the slow tracks?

Because when we design a large scale and have to scale it up, we kind of have to take into consideration that we might have data way out there. And how is that going to affect the overall performance? So we look at -- you'll see four tests in that area, okay? And the difference between those tests are that there's now a new 2 terabyte LUN level -- LUN limit, okay?

Tiger came out and XAN 1.1 came out. Okay, it broke through the 2 terabyte LUN limit. Okay, and to do that, there's a logical block address that went from 4 bytes to 8 bytes. Okay, so this 4 bytes to 8 bytes logical block address is kind of new to the industry.

So we wanted to see, okay, how the array hung up, I mean handled, being out there in those extended tracks. So that's what those tests were. Okay, and as you can see, the performance dropped off slightly, okay, when we were dealing with tracks way out in, beyond the two terabyte limit.

Okay, they dropped down from 165 and 191 down to about 110. Okay, we got to know that when we look at how we're going to scale this thing. Okay, the next set, the next two is that we're going to go from sequential I.O. to random I.O., okay, because random I.O. is always going to be slower than sequential I.O., okay?

Okay, it doesn't matter whose disk system it is, when you add in the seek times, your performance goes down. So we had two threads at the beginning of the disk, two threads at the end of the disk, okay? And in fact, at 64K, by transactions, we went from 110 down to 44, okay, which could be, I mean, is explainable.

And then we looked, and that was at the 64K prefetch. And at 8-meg prefetch, we went from 190 all the way down to 135. I should comment here that these are half the array. This is a 6 plus 1 just portion, not the full array. So for the full performance of the array, you have to double it. But these are excellent numbers, okay, generally speaking.

Okay, then we went into write tests and what its performance on writes were. Okay, and what we found at 64K, it didn't matter what the prefix size was because that's more or less in reads, okay, but what does matter is the write cache enable, okay. But it didn't matter if it was random or sequential, okay, because of the write pack cache, we were getting the same performance and it would all range between 87 megs a second and 90 megs a second, okay. And that's the total performance for writes on that segment, okay. So it doesn't matter how many threads you have, okay, the total performance is going to be roughly 90 megs a second on writes, okay. So the fact that this is predictable, okay.

And believe me, not all arrays are predictable. The fact that this is predictable allows us to scale this thing, okay. If we were getting numbers all over the place, okay, we would be lost as to how many arrays it takes to achieve, you know, a number of users.

Okay, so anyway, the next test, series of tests, was what happens when you have a reader and a writer, okay, in the infrastructure. Because in an XAN environment, okay, or any disk environment, you're going to have both readers and writers simultaneously. How does that affect things? Okay, you also want to... In captures, you want to make sure that your writes have precedence over reads.

Because the last thing you want to do is lose data on acquisition. Okay, if you're out there, okay, in the oil and gas industry and you're recording data that can't be re-recorded, okay, to your disk subsystem, you don't want to lose it because a read came in and slowed up your disk so you didn't have enough bandwidth to write. So we wanted to look at how well it's going to behave in that area.

So in read-write performance, we found out that with one writer and one reader, okay, at 64K, okay, your read performance went way down. So writes totally took over at the small IO side. Okay, in the mid-level prefetch area, okay, it was modest. And at the high, okay, the reads and the writes started to balance out some. And that's because it was reading ahead all the time, okay?

And when it reads ahead all the time, you're likely to get more readback, read bandwidth. So anyway, that's what we determined there. And then what happens when you add multiple readers and writers? What happens? Okay, and this area, what we found was if you absolutely positively have to have your writes, okay, then go with the lower, go with the 64K, okay? If you want to balance your writes and your reads, then go either to the 512 or the 8 megs, okay?

But the 8 megs can drop your reads down to levels that may not be acceptable. So we found that 512 was a good medium, okay, in that. And then the last test was what happens when you turn write cache off, and you never want to turn write cache off, okay? Because it dropped not only the writes performance, but it also dropped the read performance.

Okay, so if we look at the finiture traces of these, we're going to pick test number 16, okay, which is the 8-meg transfer size, and we want to see what's actually going on as we add these threads. Okay, so this graph, okay, is showing the I-O size, which was set up for 1 megabyte.

Okay, the yellow at the beginning shows that 1 megabyte reader out there, okay, and then when the red part comes in, that's where the second reader came online, and then later on there's another write-in reader, but they're all the same I-O size, so basically that's why it's covered.

Then we're going to look at pinning exchanges. Okay, well, pinning exchange is basically showing us how many outstanding I-Os are out at any given time. Okay, and it steps up from 1 to 4 over time as we added those users. Okay, so this is documenting what the test was, really. Then we're going to look at megabytes a second. Now, this is the interesting part.

Okay, at the beginning of the test when we had one reader, okay, we had 125 megs a second coming off this thing. When we added the writer, the writer took precedence over the reader, okay? And it's running roughly at 70, and the reads are running roughly at 35 or so, okay? And that was fine, okay, because that's your precedent. The writes have precedence over the reads.

Okay, then what happens when we write, when we put on the second reader, which was the third thread? Okay, in those cases, because of the read-ahead was so large, the reads started out overwhelming the writes, and the write performance went down. Okay, so we have to understand this, okay, for how this thing is going to scale.

Okay, then we add the fourth one, and it stayed constant. Okay, so we can start predicting that. In that eight-meg prefetch, okay, our writes are going to settle down at a level, and our reads will settle down at a level. Okay, this is incredibly important to understand how these things scale.

Okay, this is another thing that you normally don't see is latency. Okay, just because you're getting the same bandwidth doesn't mean that the data doesn't take a little longer to get in, come in. So this graph here shows that the read latency for your, the read latency, this measures read and write latency. The red portion is your reads and the purple portion is your writes. So even though the bandwidth kept constant for those last two threads, what really happened was the write latency actually went up. Okay.

And that means it's just a little harder to get the data in. that right data back. But the RAID stayed constant. Okay, and this is more evident when we look at not the max, but the average. Okay, latency. And this is taking consideration of the MIMS and the MACs during those time periods.

Okay, the other thing we found when we did this testing was how we determined how XSAN stripes. And this is going to be important on how we scale. And there's two parameters that you specify when you specify a file system. Okay, one is the stripe breadth and the other one is the block allocation.

Well, the block allocation is how much they read or write at a time. Okay, and the stripe breadth is how many am I going to do it to one target before I advance to the next. Okay, well, typically they advise, you know, something that multiplies out to one meg. So we took an example of four.

Four stripe breadth and a 256 K byte block allocation. Okay, so this is how the transfers take place. Okay, you have four to the first device, then followed by four to the second device. Okay, and then the third and the fourth before it comes back to the first. Well, when you're reading, okay, a 64 K block prefetch isn't going to do you much good.

much good, okay, when your transactions are at 256K, okay. An 8 meg isn't going to do you too good because that's kind of overwhelming the reads on the readback, okay, and take more time, okay. So a 512 is just about right on the prefetch size, okay, for this file system.

And what happens is basically by the time the IOs go through from 1 to 2 to 3 to 4, by the time it gets back to 1, it's already prefetched the next two IOs when you set that to 512. So we found that the 512K byte prefetch is ideal for at least this XAMPP file system Okay, the next part of what we did was we started running high-def data through it.

Okay. Because high-def or any kind of video source that's a constant data stream, okay, shows up all kinds of array issues because most arrays out there were never designed for video, okay. Fortunately, the Xserve RAID, okay, is a perfect device designed for video. It's designed for everything else, but it handles video well, video extremely well. Okay, so we changed our test reference implementation a little bit, and we moved the probe up to the host side, because we're going to run high def through one channel.

Okay, and we're going to stripe across four of the 6 plus 1 RAID groups in two arrays. And then we have a metadata server. So this is running Tiger, okay, and it's running the 1.1 XAMP file system. Okay, and the first thing I want to show you is a slide of what we should be looking for and what is bad. Okay, and this is a slide or a trace of a popular array out there, okay, that competes with Xserve RAID, okay, probably a lot more expensive than Xserve RAID, and how it handled the standard def stream. Okay, and what you see is three bands.

Okay, the top band is total latency, the middle band is bandwidth, and the bottom band is its maximum write exchange. Okay, and that's the time during that second, what's the longest time it took. Okay, and what you see is it's constant for a little while on that latency, and then you get a little blivet. And then a little later on you get a big blivet.

Okay, well that big blivet is really bad because if you get too many of those big blivets that stack up together or just become too big, okay, your stream will fall off in terms of acquisition. Okay, this is common throughout the industry for this subsystems. Okay, so that's why we're an Apple reseller. Okay, that's why we use Xserve RAIDs.

Okay, it's because of this kind of problems, okay, on competitive products out there. You have to have stability or you cannot scale. You have to have predictability or you can't scale. This kind of latency problem, okay, because of house cleaning. And house cleaning is the functions that that controller does in its spare time to clean up its cache or whatever it's going to do, to manage itself. But if you look at, I mean, if you're doing a lot of video stuff, this is the thing you need to look at on how well that array performs with constant data going through it. And the way you test it is pretty straightforward.

So anyway, this is bad. Okay, this is the capture that we had for high def, okay, on the Xserve RAID. Okay. The top line measures 158 megs a second. That's the rate for high-def data. Okay, that line is constant. The next one down, the next two down are latency curves. Okay, those are completely flat.

Okay, that little blivet you see is a zoomed up, okay, event, okay, that happens once a minute. Okay, in Tiger and XAN 1.1. Okay, at least what we found. And it's zoomed up here, and we're going to see how much of an effect it has a little bit later on. Okay, but basically that's a 4K read that happens once a minute. Okay, but the point of this slide is that the latency curves are flat. Okay, very well behaved with streaming media down the thing. Okay.

Okay, so we're going to look a little bit closer. Again, I/O size, when the AGA card produces video and the file system takes over and moves it to disk, it runs at a one meg rate when you go through one channel. Okay, these are pending exchanges. They vary up to about four. Okay.

The megabytes per second is constant, okay? Very clean. Okay, here's the latency curve, okay? There's no spikes in this curve, okay? We're talking about the top one. Okay, that's the right exchange latency. There's no spikes in that. You can go for hours and hours and hours and not get spikes. Okay, that's predictability.

[Transcript missing]

And as you can see at the bottom left, okay, for two, three streams, you can pretty much predict how many arrays you're going to need. You're going to need two or -- two arrays, maybe three arrays. Okay, generally, the array -- the Xserve RAID actually works closer to that middle line than it does, okay, the high variant line, the green line, okay, because it's fast. But given the wrong AIO pattern, okay, you could be out there in the green area. Okay, so all we know is that it's fairly predictable in the low count of streams, but can become less predictable at the high count of streams.

Incidentally, the black line is a one array per stream line. Okay, so eight streams, eight lines. Well, that just so happens to be, comes out to that 87 megabytes a second average bandwidth. Okay, so I don't know how they did it, but if you look at that top, you know, remember that right performance that we saw of 87 megs a second no matter what we did? Okay, well, that translates into that black line. And a very linear curve as to how many arrays we're going to need to maintain writes in that bandwidth. Okay, so if you need eight writers and they're not doing any reads, okay, you can do that on eight arrays.

At 158 megs a second. So, but no environment that I know is just writes. They got to read and they want to interact between the writes and the reads. So that's what gives us this variance. But what we, again, what we found from this is that it's much more deterministic down at the lower number of streams than it is at the higher number of streams. And we're going to use that for a benefit a little later. Okay, and that's where we started getting into how do we take this knowledge and turn it into a large-scale reference design. Well, we're going to introduce some new toys.

And this is a new class of product called a storage network, a network storage controller. It looks like a switch. It smells like a switch. But it's not quite a switch. It scales from 16 ports all the way to 128 ports today. Okay, these are 2 gig versions.

Okay. And it adds some functionality that helps us work with media and also enterprise. First, it has storage virtualization. Okay, it has this ability to do provisioning on the fly, so we can stack a bunch of arrays, set them up in one way, and do the carving up and allocating from one point of management. Okay, it does mirroring, it does replication, it does snapshot, and of course, that single point of management. Okay, in an enterprise, for you enterprise customers out there, it allows you to take tier one storage and do your replication to XSERVE RAIDs.

Okay, or do your mirroring to those XSERVE RAIDs. The XSERVE RAIDs are probably faster than your tier one storage. Okay, so it can probably keep up. Okay, it also allows for remote replication. So you can go off site with your data. Okay, it allows you to do data mining applications because you can snap them off and send them to another server to do, you know, look for particular data, etc.

Okay, so how are we going to use this? How do we benefit from this in rich media, okay? Some of these things in enterprise we don't need for rich media. Okay, well, we're going to show you how to use this for increasing performance. We're going to show you how to do workgroup isolation.

Okay, we're going to show how we can improve redundancy in the system, and again, we have that single management point. Okay, it also has a function that's not listed here, and it has the ability to allow an administrator to go out and zero out a LUN. Okay, so for those posthouses that need to reprovision, okay, when a new customer comes in and zero out the data, it can be done on the switch level or on the network storage control level, and you don't have to dedicate a host to it. It has a one-click versus dealing with a workstation to do that.

Okay, so this is the design we have. Okay, basically it replaced the brocade in the environment. Okay, and we got four arrays. We're going to talk about a new mirroring method, okay, and the nomenclature for it. is, we call them two by three sets or one by two sets in this case.

First number represents how many volumes, okay, and then the second number represents the number of mirror members that you have. Okay, so for the orange set, okay, we have basically two volumes and they're mirrored three ways. Okay, for the metadata, we have one volume that's mirrored two ways. Okay, from the host perspective, when we provision those out to the clients, they see two volumes at 2.2 terabytes, one volume at 2.2 terabytes for metadata, or we can shrink that with the switch. We can set that size.

But basically, it works just like the mirroring goes on behind the scenes on the other side of the switch. Okay, when a write goes out, okay, this is how we're going to increase performance. When a write goes out, the switch takes that write and does a multicast to all three disk subsystems. Okay, so basically, you're getting a three times multiple of that write performance from the client to your disk subsystem.

Okay, same applies to the metadata. You're getting your mirrored on the fly. Okay, on readback, okay, it's going to take the performance from each array and assign it back to one of its own clients. So edit one reads from the top array, edit two reads from the middle array, edit three reads from the third array.

Okay, well, what's this going to do? What this does is it eliminates, when you're dealing with sequential data, for instance, okay, it eliminates what happens when you put multiple streams on the same array. Okay, when you put multiple streams on the same array, you get random access. Okay, but in this case, because they're dedicated to each edit station, okay, they can maintain their sequential data patterns. Okay, and we all know what that meant from our test results in the first part of this. Okay, our performance goes way up. So not only are we getting the multiple streams, but we're also getting the random access. the multiplication, but are data patterns more predictable?

Okay, so next we're going to test this thing with HiDef. Okay, so we moved the probes around a little bit. Okay, now we've got two probes up there, and we're going to look at one. We're going to open up, instead of piping the HiDef down one channel, we're going to pipe it down both channels of the host adapter, but we're only going to monitor one side. Okay, and we're going to write to two arrays. One is going to be a two-by-two configuration, two mirror members and two volumes presented up, and XAM file system is going to strike between the two.

and the metadata is the same configuration, okay? On reads, we're gonna set it up so that, okay, the array 2 reads to that editorial 2, so this should be no problem. We got--we've used Finisar, we've optimized what the file system should look like, okay, in terms of what its stripe breadth is and block allocation, and we got hi-def running reliably on one array, so this should be fine, okay?

Next thing we're gonna do is we're gonna over-commit that array, okay, with the second reader, okay? And here we should see the thing fail, and then we're gonna put it into this preferred read mode where the data's coming off of two streams in parallel, two arrays in parallel.

Okay, so... From the I/O perspective, what we see is the first thing is your write performance. Well, your write performance is at that 158 level, and actually your latency is lower than what we had on a one-stream. Okay? The reason why it's lower is because we're going through two channels. When you go through two channels, it's more stable on XAN or on the Apple driver because it breaks the exchange out in half. Okay? So it's moving less data but can respond back quicker.

Okay. So we actually have a lower latency at this point on writes. Okay? The second part is where you see that little bit of it and then you see a bunch of green -- I mean a red flat line. Okay? That is the readback from the first -- from that second array to the second workstation.

Okay? Very constant. Again, you don't see any -- strange things going on with latency. Okay? Then it hits the second reader on that disk. Okay? And this is where the exchange times go out of whack and everything starts falling apart because you've overcommitted the storage system. Okay. Well, we can look at FinnoServe traces and determine what events cause the streams to fall off. Okay?

By doing these exercise. Okay? But in this case, we overcommitted the storage system. Okay? But in this case, oversaturated, and your red one falls off. Your purple one came up and it started playing, okay, but eventually it fell off. Okay, then on this following slide, what you see is we reset the storage controller into its preferred read mode. If you see those little red diamonds out there, Those are where we physically had to do some stuff, so you got a check condition.

Okay, so Finisaur was really good about reporting that. Okay, and then we brought up the first reader, and then we brought up the second reader. Both are now reading in parallel at high-def information that was written once and mirrored by the storage controller, and we're maintaining two streams.

Okay, we're going to look at this again a little bit closer with the FinnoSort traces. Okay, and what you see here is the same thing, except for we're looking at what is the I/O sizes. Okay, and you see your one megabytes across the board. Okay, you got to get your binoculars out now because you might see those little blivets.

And those little blivets are the little blivets that we saw in that first high def trace. Okay, where every minute there's a 4K read. Okay, it's down at the very bottom. Okay, you're starting to see how much performance, okay, the XAMP metadata server or the XAMP file system generally, okay, interferes with bandwidth. Okay.

Here's the megabytes per second. Okay, we don't even see the blivets, okay, they're so small in this graph. Okay, again, you see, in this case you see 80 on this graph because that's one hand, because we're only looking at one channel. But then you see the read, okay, come back very consistent, okay, and then you see the place where it falls apart, and then you see two reads.

Okay, here, this graph is the exchange times. And what we're going to look, what we're most interested in is did the switch, or the Morandi box, okay, contribute to any more latency? Okay. And the answer is no. Okay, so those reads are coming back as if they were native reads, okay, for their own disk. Okay.

I don't know if you can see it, but, I mean, we noticed that in the megabytes per second, that those metadata interactions were not even visible. Okay, we might be able to see it in the average in the total read exchange. But if anybody tells you that the metadata servers are adding, okay, to scalability problems or any of that kind of stuff, okay, they can forget it because you just don't see it in the traces.

Okay, here is the maximum read exchange. Again, we're maintaining full bandwidth at the same latency levels, okay, as what it was going to one readback from one machine. Okay, so how do you scale high-def streams using Mirani? Okay, in this case, what we did was if you want four streams, then you have sort of four banks, okay, of two arrays each. We've been able to optimize the file system so that we can reliably get two streams read back and even three streams off of two arrays, okay, of high def.

And so if you do a, it scales linearly, okay, as you put those two arrays together and add them into a new mirror set. Okay, so if you need 12 streams of one high def channel, okay, that would be the first column. If you need to have two high def channels, okay, that would be the second column, okay. If you need three and four, okay, it just keeps scaling. Okay, we can scale further than that. If we take our two-drive pair and create a three-drive pair or a four-drive pair.

So, no matter what the requirements are, okay, we have a solution set on how to scale high def streams. Okay, here's a reference implementation. Here's a group that they have editors and they have graphic artists and they want to share media, but they don't want to have interaction between the groups.

Well, you create a, in this case, we have a four by three. One of the mirror groups is dedicated to the graphic artist department and the other two are to video. Okay, we split the metadata so that we have a redundant mirror of that data out. So, and then the bottom two arrays that are mirrored allow for the video side to see the graphic artist side.

Okay, so in conclusion, okay, to scale large-scale SANS, okay, you're going to have a real, or to create a large-scale to begin with, you've got to have the customer needs well-established. Okay, what are they trying to accomplish with the project? Okay, you've got to leverage reference designs. Okay, Xserve RAID is a great building block because it's extremely predictable in its performance. Much more predictable than your competitors' disk subsystems out there.

You've got to know your products inside and out. We do that using FinnoSERC gear. By the way, this is the same gear that banks monitor their infrastructures with, or investment bankers, etc. We build a small system to optimize I.O. Then we can add the concurrent users to it. Then we scale out to the full deployment, and then we test it. Then we add in the product proactive management, and that way the thing is supportable for future.

That's basically the end of the talk today. Okay, this is, we have, if you're interested in getting full Finisar reports, we have a bundle that has all of the traces that we've done and bundled them up all together. And we can email that back to you if you send us a request. And you need to just send an email to wwdc at sansolutions.com. And thank you for coming out today.