Performance Tuning for Mac OS X Server - WWDC 2009

IT • 44:29

Apple's advanced hardware, including Mac Pro and Xserve, paired with Snow Leopard Server, delivers substantial performance and helps to ensure that critical applications and services perform optimally-even under high utilization conditions. Get the latest benchmarks and details on how to maximize performance and enhance scalability of Snow Leopard Server installations across various workloads and configurations in your environment.

Speaker: Steve Parker

Unlisted on Apple Developer site

Downloads from Apple

SD Video (70.5 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript has potential transcription errors. We are working on an improved version.

My name is Steve Parker. I know that I caught you right after lunch and just a little while before the beer bash, but I think what I have is pretty exciting so I hope it keeps your attention and keeps you interested. In particular, I'm going to talk about the general server performance results in Snow Leopard which I think are outstanding.

I want to suggest some things that you can do to help plan for and create a system that performs just as well as some of the performance information that I want to share with you. And I am going to mention a couple of things that are very important if you happen to be a developer, and you're building a service that is either large in scale or high performing. And then finally, I'm going to spend a bunch of time talking about how you evaluate the performance of your system, how you identify bottlenecks and in general how to make sure that it's performing its best.

So, let's begin with the server performance data, but I want to start by talking a little bit about benchmarks I'm about to present a bunch of benchmark data, how fast does the system go, and in particular this is important because we'd like our performance not to be a surprise.

We'd like to be able to know if I get this kind of a system I can get this kind of performance out of it; we'd like to be able to use that in planning as we purchase our systems; and we also need to be able to know when the performance is actually wrong.

When is it not performing as well as it should be. So I'm going to start by talking about a couple of SPEC results. SPEC.org is an industry consortium that does standardized benchmarking. They write some really excellent benchmarks. One of the ones that I'm going to talk about today is SPECjbb2005; that's a general Java performance benchmark.

This is something that is based on an order entry inventory system written entirely in Java. They have chosen that the benchmark not actually utilize any file systems so it runs entirely from in-memory tables; and it basically scales up this order entry and inventory system until at the end it finds the very peak level of performance as it scales up threads in Java and so on-- the very peak performance that you can get out of that engine. So Snow Leopard with a combination of improvements to the OS and the Java Virtual machine is 30% faster than it was in Leopard.

I think this is a really excellent result. You might want to know how fast is your file system so SPECsfs is actually a very long-standing benchmark; it was originally only an NFS benchmark, and in particular what it does is it simulates both NFS and Samba clients. And the SPEC style is very carefully written to simulate the real activity of users. Sometimes you can build an artificial benchmark that's like if I copy files how long does it take. Well those benchmarks are usually scripted and very sequential.

And in fact when you have a group of users using a file server, their requests actually arrive at a somewhat randomized interval to the server, and that is a little more challenging load than you sometimes see with a simple scripted test. So SPECsfs is careful to simulate that behavior, it's a metadata heavy benchmark, so it realizes that in fact a lot of what you're doing is searching file trees, looking things up; it does a combination of read and write.

It also makes sure that there is 120MB worth of data behind every op/sec, so this is to make sure that you've actually got a full size file tree; because in fact if you're measuring a large busy server it's not just going to have a tiny number of files that might all get cached.

So this insures that there is a file tree that grows in size, it reflects the real life behavior on disk. If I look at my disk at the beginning of the day and I look at it at the end there's always more data on it. So the course of the run actually grows the data set.

And the other thing is because the goal of this benchmark is to measure the server's performance, it's careful to provide its own NFS and CIFS client code so that it can make sure to do exactly the same thing. Sometimes a client implementation can interact badly with the server and can make it either look better or look worse. So a combination of the K64 kernel, VM scalability, IP route locking, work in Snow Leopard has increased this by well over 2X for the Samba performance which I think is outstanding.

NFS performance likewise is benefiting from all those same things. That one is a full 2X. So, those are slightly simpler workloads. Another key workload, some of you have e-mail, yes, would be your mail server. So the mail server, there is also a SPEC benchmark, SPECmail 2009, we've been very active in helping SPEC develop this benchmark. It does a sophisticated simulation of IMAP. IMAP is a very complex protocol as some of you may know.

There are a lot of different ways to accomplish the mail operations. SPECmail has a huge array of these different ways of doing things, and it balances a randomized distribution across them all so that it simulates an average amalgam of a wide variety of IMAP clients. It tracks the message contents and makes sure to keep in mind that some of the operations ought to be against messages that have big attachments; many of them are actually much smaller, and those will have different performance properties. So in Snow Leopard I'm pleased to announce that we have posted a new world record in SPECmail 2009 wellsty lg above the Sun Java messaging server so you can treat us as a contender for mail performance.

I'm incredibly excited about that. So, those are great results. Let's talk however about how you actually get that performance on your system; and first of all I do want to start by acknowledging and talking a little bit about benchmarking versus reality. So a show of hands-- every one of you who has ever once found that a benchmark suggested that you could make a change and get a certain performance gain and you got something different. How many of you had that experience?

Right. So, one of the things that I am going to focus on in this talk is how you do an evaluation that can be specific to your system and your workload, and try and answer questions about how much faster will my system be. And if there's one thing that you take away from this talk, it should be that this is an end to end system problem, and it is necessary to make sure that all of the plumbing for whatever services you are providing are really up to snuff.

And the way I like to think about it is we are in this world that is filled with incredibly fast and powerful multicore CPUs; and they are connected by what are by and large small pipes to the network and to the disks of system; and it is getting this balance of how your workload needs to flow across all of those things that's necessary.

If you set it up so that you have too small a pipe, which you end up having is that the flow of your requests backs up and that is what you have to try and work to avoid. What you really need to do is make sure that there are parallel paths adequate for the network and the disk paired to the system that you're using and the workload that you have So let's talk a little bit about this NFS score.

So it is absolutely a disk bottleneck benchmark, and in fact there's a whole lot of extra CPU power. If I add more disks, I can probably get a higher score. That's what we did. That is still a really great example of the incredible speed that's available, but one of the things that you should definitely take away if you go and look for example on SPEC.org at that publication and it lists this big pile of Fibre Channel storage. That's a key part of achieving that benchmark score. So keep in mind that these SPEC Postings are what I like to call the "Speed of Light".

If I tune this all the way up, if I get the biggest and best, what can I do with it? So in this particular case, it really is that array of 64 15K SAS drives, there is 8 megabytes' worth of battery-backed RAM caching it. In fact a lot of disk I/OS are actually only taking 100 microseconds from the point of view of the server because it's delivered to the RAM in the Promise unit. The Promise unit makes sure that there are redundant copies and two sets of RAM and then it writes back across the Fibre Channels that it's good to go. So that's really important to providing that level of service. However, that's not for free.

That's $80K worth of disk drives. So how many people actually have a system with that many drives connected? Wow, all right. Wow. OK. Excellent. So I'm kind of surprised. I think the majority of people though-- how many of you have nothing but a SATA drive internal in your equipment? Internal Ray, OK. So, the other thing that you'll notice in that posting, I use 32 separate HFS+2 file systems across that array.

It turns out that's also important and I'm going to talk about that later on. If you are actually going to get that performance level though, you are going to need to do all of those things to plan for getting the storage adequate behind that. Well, so, if you haven't got the $80K worth of disks behind it, you might be asking why is this actually good news?

Well, the really good news about it is that this is an indication of the ability for the server to scale up. So it's telling you that the plumbing that the OS underneath it, that the systems services are able to have-- it turns out during the peak of that run there are 10's of thousands of I/Os in parallel for example.

There are paths capable of handling really large volumes of data. That particular set of postings is using 8 gigabit NICs. And is driving at peak 840 megabytes of Random I/O to that array. So the good news about that is that you do have this big multicore chip that does a lot. Unfortunately back in Leopard, you were only able to put some of the cores to work in some important areas.

What's different is that all those cores are now available and if you are not in a position to be driving this to that kind of level, then you should be considering putting a lot more than just NFS on this box. You should be running some web services; you should be running maybe Wikis, Podcast Producer, but in general all of those workloads will operate in parallel because this is just available and spare machine capacity; and there were a number of limitations in Leopard that did not allow you to get at that especially for server workloads. So the good news is that where you might have in the past used a server for a single purpose, that's no longer necessary.

So, definitely understand that all of the focus that I've given on disk performance is because the CPUs have gotten faster and faster, sort of Moore's law kind of every three years or so twice as fast, the clock speed 's not faster now, now we're adding more cores but it's still going faster. However, you definitely do need to pay attention to disk storage which has not gotten that much faster.

And keep in mind that all of those factors that I've mentioned are absolutely necessary for that level of performance. So now I want to drop it back, however, and I want to talk about what I think is actually a little more common scenario which is I've just got a SATA drive, maybe some hardware RAID. So to put it in perspective, you should be expecting to be like 1/40th of that score.

It is definitely going to go a lot slower and that will be because the disks of system can't carry it. Now, a lot of us do have it because we care about having a lot of bytes. My MP3 collection doesn't get any smaller; I'll be yours doesn't, and neither does the data in your Data Center, and that is a lot less than 1/40th of the cost@-- it's about 1/400th of the cost. So but keep in mind those differences and design for it. How many of you use RAID5 underneath your storage?

Lots of people. So the slowest of those options that I listed is RAID5. RAID5 is requiring that I actually lay down three different stripes of the data so I get the reliability. That means that it's not actually delivering additional speed; so this chart is intended to give you a sort of a reference for that general difference between what you can achieve like in a single Xserve with traditional drive technology and all of that is still definitely at at least 10% of that score against the Promise Fibre Channel array. So, I have several slides here that I definitely recommend that after the conference you pick up my slides.

These are reference information that I think you'll find very handy because it breaks down what these choices are, and I'm not going to go through them in great detail, but it's pretty simple. At the 1 to 3 terabyte range you do still have the option of simply SATA and SAS drive modules. If you need very large data sets you're going to be on Fibre Channel anyway, and keep in mind that those are in order of speed so each one of the steps you take down that list, that's providing you a significant jump up in performance.

Now before I go through two other sets of important tables about disk drive choices, I want to take a second and talk about what the really tough challenge is around files systems and storage; and it's a big mouthful of geek speak-- Metadata Rate of Change. So, what do I mean by Metadata Rate of Change?

Well, anything that creates, removes, renames a file, ownership, permissions or attributes, all of those things require synchronous round trips to the disk, and that volume has to basically stop other people who have updates to that file system, until that change is committed, reliably to the disk. So, for example on HFS+, it needs to rebalance the B-tree if you've done a rename, create, delete, changes need to be committed to its journal, that's a synchronous operation, and it has a single catalog file that has to be updated when the volume is changed in these significant ways. Now, this is the opposite of the data inside the user file. So none of those things are actually requiring me to stop other threads of work.

So this is one of the reasons why I mentioned that I use 32 separate HFS+ volumes. So the reason was I got the Metadata Rate of Change of the system higher; and you should definitely keep that in mind as you lay out volumes and choose your technology. So in this particular case I wanted to highlight that between HFS+ and Xsan, you should keep in mind some of the difference choices. One of the things the Xsan is fabulous at, if you have files that tend to be only accessed by a particular client, it can deliver incredible bandwidth, incredible read and write speed.

On the other hand, HFS+ will work particularly well if many different processes on a single system are sharing it, and it will use RAM as a cache to help prevent your needing to go to disk, and definitely both of them do have reasonably good data write and data read rates. But now what I want to do is I want to pull it all back in to what you actually do and what you actually set up. That was all abstract about disks and technologies, yes whatever.

So, what I have done here is sort of connect those different things and brought back down to particular services a set of specific recommendations. So for example, Web service is actually a very low disk load service; that's something where if all you have is a built-in SATA drive that's not a problem; mail like file services are at the other end.

Those are areas where if you're doing a high performance system, you definitely need to consider something like Fibre Channel solutions. So, the details again will be available in slides after the conference. Definitely pick that stuff up and keep it handy. It will help you to plan for how to get a system that scales and performs well.

All right so I've talked about how fast it goes; I've talked about the things that you need to be thinking about in planning for performance. Now I want to talk about how you create services that are going to go fast. So this is really absolutely targeted for the developers in the audience.

So, in particular the three areas that I'm going to talk about-- the first is if you are using select and poll, and if you are using more than 100 or so files-- don't. Give it up, switch to kqueue. Select and poll are at best O of N algorithms. They will never scale; kqueue is an O of 1 interface, it's an interface which is also incredibly convenient because you simply create one descriptor for all the activities of the service, and you just collect events back with kevent off that service.

It can give you a handle that points in to a record of your choosing that helps drive the service; so for example if I'm writing an IMAP server, in fact Dovecot uses this, it calls kevent, kevent returns a file descriptor and a pointer to information about the IMAP session. And in fact that mail score would probably be 1/3 of what it is if all I did was to switch back to select. It is a significant factor-- don't do it.

You've got a better interface-- use it. Resource Limits-- services launched by launchd and I do encourage you to look into launchd, it's very easy to use, very powerful. By setting up the plist for your service you can help launchd do its job of playing traffic cop. So one of the aspects of traffic cop is to protect the system from a runaway process, from a process that has actually failed in some way and is effectively its own denial of service attack on your system.

And in particular, it will allow you to set this variety of resources; but in particular pay attention to the largest file size and open files at peak those are the two that I see services most often get wrong. If you're going to have a large number of open files, nominate how many you want, get yourself launched from launchd.

I definitely encourage you to look into that and use that. One of the things which is improved in Snow Leopard, if you are launched from launchd there is a new higher listen queue depth; listen queue is important for that Thursday after lunch when everyone hits refresh on their browser and in particular what it does is it makes sure that spikes in traffic are dealt with smoothly. And it does that by making sure that the kernel doesn't simply drop a connection because the service is ramping itself back up and doesn't respond quickly enough.

And in particular that is very common because Apache for example is going to scale itself way down if there's no activity. It takes a little while for it to ramp up and have enough daemons available to answer connections; and it may be spawning that process but you still want a response back to the client that lets the client know the server is coming with a response. So, those were the three specific recommendations I really wanted to make for our developers so now I want to turn to the final point which was the section on Diagnosing Performance Problems.

If you haven't hit the speed of light, why not? So I wanted to begin by showing you these two graphs and these two graphs are server performance in a nutshell Basically, pretty much every server performance problem has these two graphs that are key to the understanding how the workload behaves and how you study it.

And in particular, the thing to notice here is that most server workloads which are composed of relatively small amounts of work but lots of clients, lots of separate requests is that it steps up in a more or less linear fashion until it hits the system peak; and it should do so linearly so if you have this much additional client load, you should see this much additional CPU, this much additional network traffic, this much additional disk I/O.

That general guideline is that first level of triage to understanding how the system is behaving; and finally the other thing to be aware of is that once you've reached the point where you're presenting as much load as the system has capacity for, you want to pay attention to how it falls off in overload.

You hope it falls off in the manner that I've graphed here. If the system is badly behaved, if it has scalability difficulties, that curve will drop back down. So you want to keep in mind that as increments of client load are added that you should see equivalent increments, proportional increments in throughput.

So a similar evaluation, it's basically like two sides of the same coin really, is the latency of these different operations. And the general curve for most all server benchmarks looks like this. So the latency is constant under low load so what might be a good example? I'm perhaps logging into a secure web page so I connect to the web server, I provide it some credentials, it does some processing, it sends it back.

So that takes a certain length of time and no matter whether I've got 16 cores or a little Mac mini, there's a certain amount of time that's going to take. Now on my server machine as I add more of those, they probably are going to get separate cores, they're going to run in parallel and I'm not actually going to see the latency increase until I hit the peak.

So when I hit the peak what's going to happen is that there now in the example I mentioned, there now isn't going to be any available CPU to immediately run this request; so the latency is going to be the latency of queuing for whatever resource is being contended here. And that latency is basically-- you're just going to continue in overload also in a linear fashion. So why are these graphs important?

These graphs are important to understand because if you've got say a lab or you're providing AFP to your customers, what you're looking for is where that peak is. That is the point at which the system is going to go non-linear and some resource or other is no longer available adequately. How many of you came to my talk last year?

A few of you? So in that talk last year I introduced a tool that will be available as part of the session materials, probably later this afternoon but no later than tomorrow. And it's something that we use in-house all the time because it gives us the network, the CPU and the disk utilization at a glance; it allows me to break down whether my CPU has headroom, whether or not my network traffic is peaked out, have I reached 940 megabits a second out that 1 gigabit NIC, because if I have that's a problem and I've hit the peak point; and I would be identifying network reconfiguration to improve that performance.

So I will be demonstrating this tool in our lab afterwards as well as a bunch of the rest of these techniques, and will be happy to go into very specific examples about how you use this, exactly what you're looking for. But basically what you want to do is you want to be looking at that information for your server, and in particular, what you want to do is look for those opportunities when you can measure the system at different loads and do that same attempt-- see where it fits on that graph.

And in particular, the example that I've got here would be if you find this opportunity where you can identify I've got 200 AFP clients and I've hit 60% CPU, and I'm wondering wow you know how is that doing? The best thing that you can do is to save that data off, look for an opportunity when perhaps there are 50 more clients connected and doing work, and then repeat that measurement.

And what I've done with the math in there is to look for whether or not it was linear. And in fact what I see in that example is it's gone non-linear, and that would tell me that I should be looking for lock-thrashing; that something in the system is preventing using the CPU effectively, and that is my bottleneck.

So also from last year's talk there is a KBase article HT1992, which is a great place to start. Not only does it have a version of the top-guide script there for Leopard as well, it has a breakdown of some of these different areas; in particular in the case of CPU bottlenecks, there's a list of three techniques. One is using DTrace plockstat providers. Did any of you go to the first talk on advanced debugging and performance this morning? OK somebody. So those DTrace providers are excellent ways to get an insight to whether there are kernel resources being contended that are preventing CPU scalability.

And there are a couple of scripts attached to that KBase article. In addition the plockstat -C, that is a DTrace based command that will tell you if a pthread synchronization thrashing is happening in a process; it also talks in a little more detail than I can cover today about evaluating network traffic making sure that you've actually stayed within both packet count and data rate limits, as well as talking a little bit about disk I/O.

Disk I/O is a little bit harder to evaluate directly, but I am going to talk about some very specific techniques that you can use to try and identify whether or not a disk I/O bottleneck is actually present. So, just briefly plockstat -C output gives you information on pthread mutex blocks, reader or writer blocks for read as well as reader writer blocks for write.

If you're a developer this means that it can actually give you an insight into whether your application is multithreaded well and using the pthread synchronization primitives well. And the other thing is that count field-- what that is giving you is the number of times when you went after a lock, you were blocked from getting the lock because some other thread owned it, and one of the beauties of running plockstat in this fashion is zero overhead. Many times an instrumentation methodology will have overhead that affects the behavior. This has zero overhead unless you were actually blocking on a mutex. That is the only overhead for this DTrace-based primitive. So those are the three methods for going after when the CPU might be a bottleneck.

One of the other cute tricks that I wanted to show you so I talked about HFS+ and metadata updates, so in particular you can also use DTrace in a clever way to find out if the journal on HFS+ is overloaded. So the mtxheat.dscript that is part of that KBase article, it turns out that there's one particular path-- this is showing a stack backtrace inside the kernel, and when you see that journal start transaction, that specifically means that you are seeing another process being blocked because someone is doing a journal update. So that only becomes a relatively common occurrence when you have overloaded a volume.

The address of the lock which is printed by the script there actually correlates to a specific volume, and in the lab I'll show you how you can identify the specific volume that is associated with that. And that number that's being recorded at the bottom that is the number of times per second; so that particular sample 179 times someone couldn't request a journal I/O because the journal was locked for a current write.

That is a file system constrained performance problem right there-- that's bad. Too big a number. All right so per volume disk usage is also available. This is also a DTrace-based tool, iopending and it's in all 10 server systems, and it's telling you how deep is the disk queue?

How many of you are familiar with the concept of load average? Yes. So that's essentially how many processes are waiting to be scheduled on the CPU. This is telling you the exact same thing. This is telling you how deep is the disk queue here, and with the minus -m option you're able to focus it on a specific volume which is also very handy; because it allows you to look at the configuration of your system and perhaps you can identify a way to partition your workload, partition your data sets so that you get a more effective use of your disk subsystems.

So in this particular case this is actually a pretty good curve because we see that the distribution is normally about zero so that means that most of the time the disk controller is idle and ready to accept a command for that disk. If it were up around sort of 3 that would be a little bit warm, and if it were say, up around 5, 6 or it can get much higher that would definitely mean you are disk constrained.

So that is a way to look per volume at whether or not your disk I/O is performing well. In addition that disk_r and disk_w are an indication of the amount of the activity that is read or write and notice that in this particular case it was entirely write activity.

It was with single 4K read so. All right. So another thing that you can do that's actually really easy is check whether or not Spotlight is an impact on your system performance. So it does keep an index, it's maintained on the fly, and in particular if you do have a very write-heavy load, that may end up being a performance difficulty. In particular you can watch for mds or md importer processes using ps or top; if they have a significant amount of CPU usage, then it would probably be a good idea to try this technique.

And I find the easiest one is to just throw the big switch. So that command just stops it, and so if you're running top-guide in one window and you throw the switch, if you don't see a reduction in CPU and perhaps disk I/O then Spotlight is not impacting you. In that case, turn it back on. That's not a performance problem in that case.

In addition, if it turns out that that was a performance problem; one thing that you can do is to perhaps by volume disable Spotlight. Perhaps you can identify that certain volumes it's less critical, they have high write activity, and you can simply disable it on that volume. In addition, Spotlight is able to be turned on and off per file name or per-node as I phrase it in the slide.

So basically if you have a directory fu and you create a fu.noindex right beside it, Spotlight will leave it alone. So if you have specific directories that don't need to be indexed that would not be value add, that would be the way to let Spotlight know. OK so for more information Mark Malone is our Evangelist. Again the KBase article is excellent as well as DTrace.