Java Performance - WWDC 2001

Java • 59:58

Get the best performance from your Java application. Topics in this session include advances in VM technology, graphics and interface optimization, and coding strategies. Focus is on tools and techniques for analyzing your application's performance.

Speakers: Andy Belk, Ivan Posva, Jim Laskey, Scott Adler

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

I will not waste any time. So the goal of this session is to get the best performance out of your Java application on Mac OS X. Whether you're writing a new one or whether you already have one that you're porting from Mac OS 9 or some other platform. So what we'll cover in this session is try and give you an understanding of the characteristics of the Mac OS X Java performance.

Some techniques and patterns you can use in your code to try and get optimal performance on 10. point out why measurement is critically important whenever you're trying to do optimization. And then lastly, we'll have a demonstration of some performance analysis to give you an idea of what sort of tools will be available on Mac OS X. So without further remiss, I'll introduce Ivan Posva from the Java VM team. He's our tech lead. Thank you, Ivan.

Good afternoon. Well, I wanted to question first, what are the Java performance factors? Well, first of all, it's your application design and implementation. If you have algorithms that don't scale to your problem set, there is no performance tuning we can do in the VM at any rate that will improve your app. So you have to make sure that you don't have N-square algorithms or anything like that in there.

The second factor is the amount of memory your application is consuming. The more memory you use, the more likely you are to be swapping out, to be paging, the more stress you're putting on the VM memory subsystem, and the less likely you are benefiting from caches, data cache and instruction cache. Then the next factor down is bytecode execution. The second factor down is the speed of the application performance. It is at what speed does the Java VM execute your Java code?

Jim will talk about tips and tricks, do's and don'ts in the second part of his talk on what you can do in this area. I will concentrate more on Java VM efficiency in the first part of this talk. There are two other factors that influence Java performance. One is the speed of the underlying OS. We are not touching Java VM efficiency in the first part of this talk. Then there is obviously the speed of the hardware you're running on. Let's look at what Java VM efficiency means.

There are two most cited issues that influence Java performance at runtime. One of them is memory management. It is the footprint of your application, the footprint of your Java process running. That includes your Java heap once. That includes all the supporting Java VM memory structures. We are actually using the Java heap for that as well. It includes parts for the OS, parts for the code, for the VM and so on. Then second is the speed of allocation. Java is very object heavy.

You allocate a lot of temporary objects, so you need to be able to allocate those objects very quickly. To keep your footprint down, if you just allocate very quickly and never reclaim, you would grow your footprint to infinity. That's why you have to reclaim very efficiently as well.

I will touch briefly on what we do there or how you can actually help us as Java programmers in that area. The second part is synchronization. Java has built-in support for multithreading in the language. To make that... To make that viable use, we actually have to make sure that the synchronization primitives are implemented in a very efficient manner inside the VM.

Another part is startup performance. There must be hundreds of Java benchmarks out there on the web, small and big, and some useful, some less. But I've never seen one that actually measures Java startup performance. You don't want your application to start up for a minute or two minutes seeing the bouncing icon in the dock, so I will touch some of the issues we have done to address that issue. Most notably, reducing class loading. Class loading is about 40% of your Java VM startup. I will touch on what you can do to prevent or actually to help us out with startup performance.

Direct Performance. So let's go to memory management first. The Hotspot VM that we ship on Mac OS X has an accurate compacting generational copying garbage collector. What do those buzzwords mean? Accurate means we know at all times where in the VM you have references to your object. We can distinguish between real references to objects and just memory location that look like references to objects. Compacting is also means that we compact the heap. We don't leave any holes in the heap, meaning All the memory you use is basically moved together, compacted, well, as it says, and that improves both your footprint and locality of reference.

Generational copying means that we allocate objects for a first time in a new generation, and only when they survive for a certain amount of time, we actually copy them into an older generation, which we handle much less frequently than the newly allocated objects. So we spend a lot of cycles on newly allocated objects, collecting newly allocated objects. For older objects, we don't spend that time.

So there is one thing. Lifetime of your objects is important. If you have objects that you're not going to use, you have to null out the references. If you have object hierarchies that you're not using anymore, null out the reference to that object hierarchy to actually give the garbage collector an opportunity to remove all the space that is allocated for those object hierarchies.

Then the second part is if you use If you use some caches or have objects that can be recreated at a later time, you can use the weak reference class or soft reference class to give the collector a hint or an opportunity to remove objects out of your working set if the memory is getting tight.

Avoid finalizers where you can. Objects that can be finalized need to be handled specially in the VM. We need to actually call the runtime from interpreted or compiled code to allocate objects with finalizers because we have to keep track of them. That makes it very hard to allocate quickly.

All other allocation is done in line in interpreted and compiled code. For finalizable objects, we first have to... We have to keep track of them by allocating through the runtime. Then we actually have... When we are throwing them away, we have to make sure to call the finalizers when you're done. If you can avoid finalizers, please do.

To reduce your footprint, it is helpful to do lazy initialization and allocation. That way you reduce your footprint as well as reduce your startup time to increase the startup performance. What did we do to memory management to improve some footprint issues? Well, we introduced a shared generation that stores the often used classes and methods including their bytecodes and so on. And this shared generation is mapped in from a file, so we used the underlying Mach virtual memory system to just bring that memory into the VM. We reduced the GC time because we never tried to collect, reclaim that memory.

It is basically free. Because we read it from a file, we are never touching it, so we share it across multiple running Java processes. The thing you can help us there is, well, don't break it. Don't change your boot class path. Do not modify the system jar files that are installed on the system. And that way we can always use the shared generation when we start the VM.

It has additional benefit in startup time, which I'm going to touch now. Avoid eager class loading and class initialization. It has both effect on startup time as well as your memory footprint. If you don't load those classes when you don't need them, you're not using the memory as well as you're spending cycles to actually load them from the class file, decode them, bring them into memory, initialize, and so on. So if you want to see which classes are loaded at what time, you can use the -verbose:class flag on the command line to see what is loaded at what time and in which order.

So as I mentioned before, the shared generation reduces class load time because most of the classes you're going to use out of Java lang, Java net, swing, and so on, have been preloaded for you in that shared generation and are mapped from the file when you actually-- When you start the VM. So when you run the above command that I mentioned, java-verbose-class with version, you don't see any class loading going on at all. So the other part was synchronization.

The hotspot VM we are using has very fast synchronization in the uncontended case. What does an uncontended case mean? When you synchronize on an object, the object is synchronized. One thread at a time. That is most often the case. You don't have any locking really going on, but you want to protect if some other thread happened to be in that code. For this, we have a constant time overhead.

The inline implementation in the compiler or in the interpreter takes about 8 to 10 instructions. It has very low memory overhead. We don't allocate any space for these object locks on the heap. They're all stack allocated in the stack frame that you're locking this object that you're synchronizing on. We don't use any of the underlying OS resources because that is expensive. That ties down memory, sometimes even in the kernel, which you want to void.

The contended case is rare for that, but since it happens, for that we use Mach primitives directly. To speed up, to get as much performance as we can. Before I want to hand over this talk to Jim, I wanted to tell you what's new in 2001. We shipped Mac OS X.

It has a shared generation. That is one of the big improvements we did to the Java virtual machine we got from Sun. In the 1.3.1 developer preview we're working on, or that should be the first one, we're working on the developer preview. We did inline interpreter allocation. This was not in Mac OS X. We're working on thread local allocation. We have a faster instance off and faster array copy that is tuned to G4 with the velocity engine.

We use the same code actually for, the code we use for our array copy we are also using inside the GC to copy the objects between the generations. We're making use of that. We're making use of that code in multiple areas. So now I would want to give this over to Jim. Thanks.

So memory management and synchronization is a crucial part of what affects the performance of your application. But I think fundamentally when we're working with Java code, we think in terms of the speed of the code execution as being the key factor in determining what is actually slowing down our particular application.

So what I'm going to try to do in this talk is describe how things get interpreted, how they get compiled, when they get interpreted, when they get compiled, and give you some ideas on some of the coding hints that you can use to speed up the performance once you find out what kinds of weaknesses you have in your code. And I'm also going to run through a few things that have changed in the last year, since our talk last year.

So first of all, I want to point out that people often question, well, why don't we compile everything? Obviously, if you turn everything into native code, it's going to run a lot faster than if we interpret it. But because the Java environment compiles code on the fly using just-in-time compilers, there's a certain amount of cost, both in CPU and memory usage, in order to get things compiled.

And when you're doing some analysis of the actual VM, you find out that it's actually cheaper to interpret the code, because the interpreter is fairly fast, cheaper to interpret the code than actually go off and compile it and try to run it. So we have to get a balance there on what actually gets interpreted and what gets compiled.

[Transcript missing]

So using the criteria of the number of invocations, the number of times it loops, we can actually sort of find out what is actually hot in your code. And we find out it's only about 5% of the bytecodes of 5% of your applications or methods that need to be compiled and hence are hot in your application.

And Andy will be going over some of the tools which will allow you to determine which of those methods are actually getting compiled in native code. And we can, once we've got that information, we can start tweaking those particular methods because those are the methods that are going to be problematic.

So I'm just going to go through and discuss a few things that you can do to get the best performance out of your application and types of things that you can concentrate on once you find out what's hot in your application. The first, the most important thing I said is that you can't just keep compiling your code.

You can't just keep compiling your code. You can't just keep compiling your code. You can't just keep compiling your code. You can't just keep compiling your code. You can't just keep compiling your code. You can't just keep compiling your code. You can't just keep compiling your code. You can't just keep compiling your code.

And if only a small portion of your method is actually being used all the time, maybe you have some exception code in there or some special case code, it's doing a disservice to the compilation process by having it embedded in your method. So you should try to break that code out into separate methods and try to keep your method focused, small and focused, so that it can compile quickly and then go off and execute. And then you get good locality also of the execution of the code.

So as I say, separate rarely used code out into separate methods. As far as saying to yourself, well, if I put it into a separate method, then we're going to incur the cost of calling that method, basically pushing parameters and so on and so forth. But you find that in the VM, we actually inline things that make sense to inline if we can make more optimal use of the code by inlining it as opposed to having it separately.

So don't worry about that. And in particular, accessor methods are always inlined, so you don't have to worry about the fact that, well, I've got a very tiny method and all it does is extract the field. And with the 131 code, we actually have a much tighter implementation of accessors, so they are very fast.

One of the things that I like to do is actually try to find which methods that are used fairly frequently in the class libraries and tweak the code specifically to handle those because those are the routines that are used a lot by everybody. And we want to try to get good performance for those particular methods.

So trust the supplied classes. You may have the urge to go out and rewrite the vector class because it synchronizes every time you access it and does an object, you know, a check cast every time you do extraction from it. You know, these are the sorts of things that we notice that are used a lot. So we hand tool, or not hand tool, but we provide special service for those particular types of methods that are used fairly frequently.

So instead of going off and writing your own, trust the supplied classes. So classes such as string, string buffer, vector, and the collection classes, you know, use what's there because we're going to, you know, get the performance up for you. And we've added some more optimization or more special cases for those in the 131. If you have a copy that you're copying from one array to another, use array copy because, as Ivan mentioned, that we're using G4 acceleration.

So you can do that on the array copy, and hence it's going to be the fastest way of doing it. So instead of having a loop that iterates through, use array copy. And then, of course, there's certain functions like sine, cosine, and tan, which have native implementation. So it's best to use, you know, what's supplied and not go off and write your own JNI routines to deal with that.

Make the best use of the native data types. Again, the G4 is not a 64-bit processor, so whenever you have a long arithmetic, it has a certain amount of cost associated with it. Some of the basic operations like add and subtract or bit and or whatever, they're reasonably cheap, but when you get into shifting or divide or that sort of thing, it can be fairly costly.

So if you don't need all that precision, stick with ints for the time being. And then also consider using floats instead of doubles. Not necessarily in your computation, because sometimes you just need the precision, but when you're dealing with arrays of values, it's best to keep the size of your arrays down by using floats.

And there are quite well-known techniques for actually keeping precision, even though you're using a 32-bit value. And then new to the 131, we've added Bedrock. It's a better register allocation for long, floats, and doubles. So you'll find that some of the, say, especially if you're doing a looping-type calculation, that performance will be improved on that.

Try to avoid using the generic data types because there is a cost in assigning, say, a generic data type to a specific data type. We have to go through a check cast. Ivan mentioned that we've done some performance improvements in 1.3.1 to deal with instance of and check cast. But it's still a cost.

Instead of a simple assignment, we have to go off and do this to make sure that it's the right class for doing that. So try to avoid using generic types and use subtyping or subclassing in these circumstances because then that way you can avoid making assignments that require these checks.

Try to work with local copies. Now, one of the things about people have been asking about was why is the code generated by a hotspot client slower, let's say, than maybe the server version and so on and so forth. Well, some of the optimizations you get in the server version of the compiler are very sophisticated, and they're not there in the client because, again, we want to try to compile things fairly quickly and get them up and running. So, if you have array access and you're working with that array value, it's best to make a copy of that value and work with that copy and put it back in again.

In this particular example, you have three accesses to that array. That means we have to do three bounds checks and three null checks to see on the table itself, whereas if we make a copy of it, we only have to do two in this case. And then, plus, you get the locality issues where if you're working with the value in registers. In this case, what would happen is value would be assigned to a register as opposed to going back to the array that you get performance boost there as well.

This is sort of a de-optimization of your code. One of the things that people run into, especially on MP machines, is with a lot of threading, if you have access to global values, they're kind of wondering why the values are sort of changing or not changing from underneath them. Make sure that if you have a global value, a global static that's being accessed from several different threads or written to by other threads, that you use the keyword volatile.

One of the optimizations a compiler will do is say, well, this is a value that I've already got a copy of. Why should I go back and get the original? If you put the keyword volatile, this will guarantee that things will get reloaded every time you access the variable.

Use static finals. This basically specifies that the variable, say in this case buffer size, is a constant. And the compiler can treat that as a constant all the way through the code and optimize, do constant folding. In this example, we know that the character array that we're allocating is a fixed size. We know that the initialization of the buffer in the loop is going to iterate a fixed number of times. So take advantage of that by making sure that you declare your statics as final if they're going to be constant throughout your execution.

There's a certain cost involved in invoking anything which is a virtual call or an interface call. In the case of virtual calls, we have to do an index into a table to find the address of the method that we want to dispatch to. In interfaces, it's a little bit more complicated because we actually have to do a match to make sure that we match the class of the method that we want to invoke.

So virtuals are a little bit cheaper than interfaces. So if you have a choice, try to stick with subclassing as opposed to creating interfaces, and you'll get better performance that way. In the Hotspot VM, we actually cache the call so that from any particular call site, we know which method worked for us last time. We try to reuse that so we don't actually do a lookup each time. But there is still a cost in that initial lookup. and whatnot, so try to use virtual versus interfaces.

One of the optimizations that we've done in the Hotspot compiler is dealing with switches. You can create switch statements with fairly sparse values in your cases. In traditional compilation, what would happen in those situations is that they would create a big if-then-else. We were using a technique of double indexing, which will allow us to actually just dispatch basically fairly quickly on any switch. It's not a nested-if combination. So if you're comparing a single variable against an integer data type, utilize switches over if statements.

More and more, you know, as more and more people are learning to program, some of the new people new to programming have a tendency to use exceptions for control of their program flow. You really should try to use exceptions for the exceptional cases and not for actual flow of your program. So, because there is quite a bit of cost in the VM to actually handle that exception.

So, if you're going to, if there's a good likelihood that the routine you're calling is going to produce an error, then you should probably use error codes and test the result when you come back as opposed to throwing the exception. Because that would be faster than actually throwing the exception and having the VM deal with it.

And finally, I think you should think pure Java. In the 1.3.1 code, we've implemented something called compiled natives, which allow you to call JNI code fairly efficiently. We don't have to go through any kind of marshaling code, which marshals up the parameters and then goes off and calls the routine. What happens with these compiled natives is that we actually have a thunking layer, which already knows what the parameters are going to look like and assembles the parameters for a call to the JNI routine.

So on one side, your JNI calls are going to be a little bit more efficient and faster on the 1.3.1. But on the other side, there still is a cost in using JNI or JDirect, which is built on top of JNI. There's this translation layer involved, and it costs.

And then also, if you're dealing with callbacks, it's going to require some kind of lookup. So you should try to use Java wherever you possibly can. And try to avoid going off. And as time goes on, we're going to get the compiler faster and faster. And you can forget about C and C++. Okay, with that, I'll pass it on to Andy.

Hi. So I get to do my little bit now where I talk about how important measuring is. So all of this, all of the information that we've been giving you is kind of useless if you then go and apply it willy-nilly to your entire 60 megabyte code base. It's really important. All the textbook advice that says don't optimize prematurely.

It's really true. What you should be doing is measuring, finding the major bottlenecks, optimizing those bottlenecks, making sure that it actually worked because we've seen optimizations that have actually slowed things down. I'll also go through what you should try and measure. How with the 131 Java VM we've improved things and how you've actually got tools to enable you to measure those things. And then I'll just cover a few little myths that are still around in all those textbooks that are not quite true on Mac OS X.

So the first obvious thing that you always think of is how fast is my program going? You look at the CPU meter on Mac OS X and it's pegged at 100%. So obviously you should be looking at where the CPU time is going. Your program, whatever it's doing, is CPU limited.

The first thing that we do for you in Hotspot is we compile all those hot methods. We're counting which one gets used most and we're compiling the ones that get used, get called most frequently. So, obviously look at the hot methods, look at those ones that are being compiled. And I'll cover how using Xprof will actually tell you which ones have been compiled.

Now secondly, depending on where that CPU meter is reading, you might be using system CPU and not user CPU. In which case you might be paging. And the poor old OS is trying to just read and write things from disk and shuffle things around in the VM system.

Paging is really expensive. So if you're running on a 128 meg system and you set your heap to 256 megs, well, we think you've got 256 megs so we'll happily go and allocate and we won't do full GCs until we think we've run out of heap. But in the meantime you'll be paging madly. So think very much about controlling your footprint and heap usage.

Now other times you get into situations where your CPU isn't pegged, and in fact, at first glance your program seems to be doing nothing. And that's probably what it is doing. It's probably waiting for the disk or the network to reply. So there are some tools on Mac OS X, some of which are covered, I will mention here, some of which are covered in the performance tools talk later, that allow you to look at what your program is doing I/O wise and network wise.

And then lastly, one of the things that we talked about, synchronization. Monitor contention can get very expensive. And the reason for that is that, especially on 10, if you're used to Mac OS 9, switching threads and processes was relatively expensive because it didn't have the memory protection and preemption behind it.

Whereas on 10, when we switch a thread, there's all the state in the processor that has to be saved out to memory. And when we switch processes, i.e. threads between processes, you've got to save all of that context as well. So it's a lot more expensive than 9, so that's one thing to bear in mind.

So how do you go about measuring all of the things that I've talked about? The best thing is, from your perspective, is to use a commercial performance tool. One example of which is Optimizeit, which Scott will be demoing just after I've talked. It provides CPU profiling and/or sampling.

So profiling is a way of tracking each and every time methods get called. Sampling does a statistical analysis. There are pros and cons of each. Profiling gives you a very precise measure of exactly how often things get called. Sampling is less invasive. Your program doesn't slow down so much. So depending on what you're doing, one or the other is better.

You can also look at object allocation, which objects are getting allocated, where they get allocated, etc. Scott will cover a lot of that in the demonstration, I think. The other thing you can do with Hotspot that we provide in the 1.3.1 developer preview, HProf is now functional. It wasn't in Cheetah.

And that's--HPROF is implemented as a library loaded at runtime that uses the JVMPI interface in Hotspot. Secondly, you can use XPROF, which is a per thread kind of measurement. And there's minus XA-PROF issue allocation information. So as I mentioned, HProf comes with the developer preview one that's available on the website.

It's a basic CPU and monitor profiling tool. So it doesn't give you a lot of it gives you a lot of nitty gritty detail and not a lot of analysis. There's a relatively simple UI available from Javasoft's website that gives you a primitive GUI on top and lets you drill down a little bit. And I've used that to a certain extent and that's quite helpful. It's relatively simple to use. You just pass a couple of command line parameters and you tell it whether you want to sample or look at monitor contention, etc.

And this is the, it turns out that the perf-vanil tool only works with CPU sampling. It doesn't work if you use profiling. You should use the first example. The monitor contention will give you a little bit of information about how much time each thread spends waiting on a particular monitor.

So if you're seeing an application that you can't really see why it's slow, but there seems to be a lot going on, probably one of the first things you should do is look at monitor contention. You can see dramatic performance improvements there because when we get contention as to the cases where we don't, as Ivan explained earlier, it's a case of going through ten instructions in line in the interpreter or the compiler versus several thousand cycles going into the kernel and doing context switches and the like. So that's why it's expensive.

MinusXAProf will give you a simple allocation profile. So you run your program and right at the end it'll just, when it exits, it'll spit out this dump of all the objects that got allocated, how much space they took up, the average instance size, etc. And you can, just from that information, you can say, well, maybe I shouldn't be allocating so many vectors or hash tables or etc. But it doesn't give you any information about where they got allocated, which is why it's, you know, optimize it or something like that is much more useful.

MinusXProf is of somewhat limited use because it gives you per-thread information. So if you have a program that forks 400 threads, like VolanoMock or something that I tend to run on and off, at the end of the program when it exits, it spits out 400 copies of the information, which is not very useful.

But it is the only way that I know of where we actually list out the methods that got compiled versus the ones that get interpreted, and how much time we spend in interpreted code versus native code versus compiled code, and how much time we spend in GC, etc.

So that can give you some very useful insight on, first of all, which methods got compiled. And you might look at it and say, hang on a minute, I expected method A to get compiled because I was under the impression that this was my most expensive method. But it turns out that in fact we didn't go anywhere near that, we didn't compile it, or maybe we did and we couldn't compile it because it's got some funny assembler or some construct or it's too big, etc. So that will pinpoint which methods are getting compiled. You can sanity check that the ones that are getting compiled are the ones you expect.

Once you know the ones that are getting compiled, you can then focus your optimizations on those methods. And as a little example use down at the bottom, it's very simple to use. But like I said, don't try it with 400 threads. Just do it on something with a minimal number.

Now measuring memory is a little harder because the Java VM has several different perspectives on what memory is. As far as you are concerned, the only memory you can really have any control of is the memory in the Java heap. The tips that Ivan explained whereby you null out references, you try and avoid using finalizers, etc. That's the kind of thing you can control.

Other than that, so you can watch the heap as it grows and shrinks using the VeloC flag. As we mentioned, verbose class will show you classes as they get loaded. You might see classes getting loaded before you think you should be using them, and that's an example where you should sort of go in and pinpoint why they're getting pulled in, and maybe you can load them a bit later.

There's a command called top which will give you an overall memory View of the whole system. And that's good for splitting out memory that's being shared. For example, when you run multiple Java processes, some of the memory that we pull in from the shared generation is shared between several processes.

And you can tell the difference between memory that's privately allocated, I used in the heap for you, versus memory that's being shared in the shared generation or is being shared because of dynamic libraries that are being pulled in by native code, either your code or our. VM Map is another command line utility. It gives you a lot more specific insight on the intricacies of the virtual memory being used. And it's relatively complicated.

If you want to learn a bit more about it, they might cover it in the performance tools. It turns out that Java VMs are improving faster than the books about performance in Java can be written. So there are quite a few books out there, most of which contain extremely good advice. But some of their tips just... become outdated with time as the technology rolls on.

Um, traditionally in 1.0 VMs, 1.1 VMs, allocation was very slow. They had a malloc based allocation scheme or something. Our allocation is now extremely cheap. Now the initialization of an object may not be, but allocating it is few instructions. As a result of that, and as a result of the scavenging, short-lived objects are very cheap to GC because we essentially don't do anything with them. We just throw them away at the end of their life cycle. Synchronized methods cost us small, as Ivan mentioned.

And the contended case is still expensive. Now, lastly, as I hinted at before, system calls which involve entry into the kernel, just because of the whole context switching and a little bit more weight involved than with Mac OS 9, they are expensive. There are certain things that we do on your behalf as part of the Java APIs that involve system calls, network operations, I/O operations, things like that, thread yield. All of those are system calls. So if you don't need to do things like that, avoid them.

So here's a quick graph which you've seen before. This is the peak allocation performance from various different technologies. I think Blaine showed it in his talk. And you can see that this includes the garbage collection side of things, so it's not just allocation. And you see that Compile Java, which is the tall one, is just way faster than any other technology.

Now, so here's an example that I pulled from a performance book published a year or two ago. They gave us an example of one thing you can do to improve performance: pooling objects. So that you avoid, by recycling them, you avoid the cost of allocating and GCing them. So I wrote a little benchmark and I ran it on my G4 PowerBook.

And I got this sort of description. So as I increased the number of threads, you can see for the single and two-threaded cases, the pooling is just slightly faster. So I'm allocating 100,000 and then... Filling them up and then throwing them away, etc. But when you get to a larger number of threads, you can see that the time taken to actually recycle these vectors is actually longer than it took to create them and GC them.

Now, in the dual processor case, the moment you go to anything other than single threaded, the simple allocate and throw away mechanism is faster. And the point being is that you don't have to incorporate any complicated pooling code if you just do the brain dead thing and just allocate and throw it away.

So, this is one example where the technology has just moved on and that little truth about pooling things is not quite so true. Now, On the other hand, if you have an object, in this particular example I'm talking about a Java thread, that is extremely expensive to create and initialize. The expense of creating a thread involves a couple of kernel trips to create the internal data structures.

[Transcript missing]

The corollary is that sometimes, especially with something like a thread, which involves a kernel data structure, it can actually be costly to keep them around as well. So you have this trade-off between some things get more performant, but on the other hand, you have to pay the penalty of keeping the kernel-wide memory around, and the extra stack, etc. So this little graph just, there's an example web server on Sun's site, which, brain dead simple, it just, it's a web server.

sits in a socket listening for a request, gets a request, it hands it off to a thread to respond to it, and sends back a response. So I took that example and I produced three variants, one of which forks a new thread for every request, the second one which uses a pooled collection of threads, and then a third one where all the worker threads themselves actually sit and accept and handle the request directly, so there is no listen thread.

And the purple, sorry, the line... Well, this is the response time as seen by the client. So with the version that forks a different thread for every request, it just doesn't scale as the number of requests go up. You just see a response time that degrades with n squared according to the number of clients.

The others degrade as well, but they degrade much more gracefully. Now interestingly, I had kind of expected when I did this exercise that the version running multiple threads in Accept would scale even better than the pooled version. And lo and behold, it isn't actually that true. So that's actually a--I really wanted to include this slide because it's an indication of why you should be measuring.

Because my expectations were dashed. Now on a dual processor it's really interesting as well because The Perth Red one seems to be doing almost as well as the the pooled versions. That was somewhat unexpected. But then I realized that what's happening in the pooled versions is that I'm getting a lot more contention because I'm on an MP system. So the version where I'm running multiple workers in Accept is actually the best performing one. And the reason for that is because all of the contention is handled right in the kernel, right at the accept call, rather than everything coming out and fighting over the socket.

So our conclusion is very simple. Your application design is paramount. That's the most important part of the performance of your app. There's a lot of new stuff in recent VMs and in Hotspot and on Mac OS X that have improved some of those bottlenecks. If you follow our advice, you'll get better compiled code, and so your app will run faster. And where you are seeing bottlenecks, just keep measuring and improving those things and you'll see results.

So what we're going to do now, Scott's going to come up and give a demonstration of Optimizeit. So I'm going to show you Optimize It. I showed the memory portion of Optimize It at the Java development tools session. So I'm going to concentrate on the profiling portion. It's a tool written by a company called VM Gear. They used to be called Intuitive. It's a pure Java tool. It uses just a tiny little bit of native code, so it's mostly pure.

The way it works is it just sort of instantiates its own hooks between you and the VM, and then you run your application on top of it, and then you're able to just look at all the same kind of profiling stuff that you see in like Xprof and memory profiling and all that stuff.

It's really cool. We use it at Apple. We've been helping them get it up and running, and we've been using it to actually work on all of our AWT work and swing work to find all of our bottlenecks. I mean, it saved us tons of time, and we're hoping that we can convince them to get it. We've got a developer release out for you guys. They've committed to a fourth quarter release, so that's good that we have it coming out eventually. So let me just go right to my demo machine.

There's not that much I have to say about it, but this is my CAN sorting table demo. This is right out of SwingSet. I just put in names of people on our team, and I added some sorting to it. I didn't use any of the collection classes. I wrote my own sort.

So I wrote the worst sort possible. I do a little bubble sort here. So it's kind of slow. So I click here, and I sort by first name, and that's sorted. And that's only 58 items that I sorted by first name, and I sort by last name. So that's not really good, and I actually pause all of the UI while I'm sorting. So I want to figure out what's going on. Why is this taking so long?

So what I'm going to do is I'm going to go over to Optimize It, which I've already launched, and if I want to hook into this, I started this other app using Optimize It stubs, and I'm going to do this through remote debugging. You can launch it all through here, but I kind of like doing the remote thing because it shows you can do it on a separate machine. So I'm going to go to Remote Application. It's already been set up on this machine on this port. I have my source path set up, so I'll just attach to that.

and it'll just take a second for it to connect up. It gets, I had this all set up for my demo this morning. So this is the memory, the memory profile of everything, all the objects that have been instantiated and there's a lot of cool stuff in here.

But I'm going to go into the CPU profiler. So the CPU profiler hasn't been profiling until you tell it to. So that's one of the big differences between this and like X-Prof is you get to profile just a segment of the application that you want and you can turn it on, do your work, turn it back off. So what I'm going to do is I'm going to press on the button, go to my application and click sort and click sort again. Let's do another one. Then I'll go back here and I'll stop.

So here's one of the cool things. Here are all of our threads. Red is idle, green is active. And there are even groups of threads, like main, system. So if I just start generally at the main thread and I'm looking around trying to see what's going on. And let me flip this around into the normal execution path. So what we have right now is 49% of this happened through event dispatch, which makes sense because we clicked on buttons to do most of our work. And then 34% of it was in thread run.

Now, I know I wrote this application and I have a separate thread that gets spawned off every time I sort. I actually created a new thread. I wrote this really badly. So I spawned a new thread and I run my sort. So if I look through here and I can see that I have my sort and it ends up calling greater than because I'm doing an excellent single directional bubble sort. And I have a greater than and I have some of my time in two string.

So let's see what else is going on inside of greater than. I've got a compare inside of greater than. And there's something called two lowercase. So that immediately is, there's something going on here. And I can look, if I just click here and I double click on this, it'll bring up my source code viewer.

And I see there's some two lowercase that's inside of the AWT code. And I don't care that much about it. But two lowercase is taking up a lot of time. And that's inside of, there's a whole bunch of things inside of AWT. But I want to see my stuff. So mine is, I want to see my stuff. So my sort data is my class. And let's see what's going on in my compare.

And I can see right here that I get two strings. And to compare them, I turn them into lowercase first because I want them to be case insensitive. And then I compare them character by character. And, okay, so you know that's really bad. You know, actually, I've had an engineer who's done this before at another company I was at.

So there are a lot of things that you could, this immediately drives you crazy. So I'm going to go ahead and do that. And this immediately drives you there. Now, I sort of dove around here because I wanted you to see this allocation graph. It's pretty cool. It shows you, you know, each allocation entry point and how much time is spent. You can even get sub percentages.

So if I select here, I think, if I mouse over here, it says that 99.25% of the time is spent inside of this compare. And that's just my compare. That's not anything else going on. If I want to see right down here, this would tell me immediately what the problems are. These are my hotspots.

And this is just taking the individual methods, no matter who called them, whatever direction, and showing you what percentage of your time is spent in those hotspots. And if you flip the graph back around, you can start from the hotspots back down and you can see who's calling each hotspot. And you can look and see that two lowercase. Okay, who's calling that? And it really is only from one place. It's from my sorter. So that's kind of cool. And I found that pretty easily.

here just on the main thread, but you can also go into your individual threads. And I can say, let's just look exactly at the sorter. I looked at the whole main thing and included all my event loops and stuff. So if we look at my sorter, my sorter's even worse. It's got 60% spent in there. So if I actually had a big list I was trying to sort or something, I could do it a little better. So without, I'm going to reprofile again.

Let me go back here. So I have this thing called Fixit. Now, this really shouldn't be called Fixit. This should be like, don't be so stupid. Which is, what this is doing is it's not doing a whole to lower. It's getting each character, lower casing the characters, and testing against those. That's a little better. So I'll turn that on and I'll Reprofile again.

Do a sort by first. Oh, I already did that. So I'll click these a couple times back and forth. And we'll see that now we don't even have anything about lowercase in our... Let me get up here, actually. Sorry. We'll see some lowercase in here, but it's not going to be as huge. I can't even find it right now.

So that's pretty cool, is that now our compare is no longer the huge portion of this whole thing. We see that there's something about app context and graphics, which if you're using the hardware accelerator would be a lot lower than this. So this is a really cool way for you to find out what actually is your bottleneck.

Obviously, you wouldn't be writing these really bad sort routines, but who knows where things might pop up like this. If you're using a library from someone else, you'll actually see what parts of their library, as long as it's not been obfuscated or something like that, what parts of their library are slow. You can even see into our libraries and see what's going on inside of graphics and things like that, but you don't necessarily want to do that. But sometimes it's fun to do. And we've actually used this.

A lot of our graphics code is written in Java, so we use this all the time. I mean, I have engineers, other coworkers of mine are coming in and saying, you know, I made some changes over the past week and everything just slowed down. Like JBuilder doesn't run very fast. What's going on? And we haven't changed anything. And we run it through here and we find that, yeah, it was someone did a really bad draw circle or something like that. And so we optimize that and we get back our 10 times improvement.

So it's pretty cool. Let me just show you another thing. Which is useful. This is the VM statistics. I've had this running the whole time. And it shows you things that lead into what the other people before me all were talking about, which is you don't want to load all your classes right away.

So you can turn this on at startup and you can see your classes being loaded and it'll show you as you do different things. So if you actually have dynamically loading classes, which is what you want, you want to load them slowly as users get to different portions of your app, you'll see your classes going up and up and up. Plus threads active. So if I were to go over here and I actually say sort first name, oh wait.

[Transcript missing]

or something that's going to be really slow here like resizing these things. And you can see what's changed since you did that. And you can see that a whole bunch of char arrays were allocated. Rectangles, obviously we use a lot of rectangles in graphics. And they should mostly go away when you run a garbage collector. So I hit the garbage collector and we see that rectangle went down to none so we did a good job. There's something going on where there's one string and one character and we might go hunt those down for references or something.

But that's basically what you have. There's a lot of different things in this sampler. You actually can do, let's see, this is a There it is. So you have different types of profiling. I did all this profiling using sampling, so every five milliseconds it tried to get what routine we were using. I could crank this down or up.

I could also go to this method called instrumentation, which is every single instruction, every single call is being calculated so that you don't miss anything because it only happened for half a millisecond and you happen to always miss that half a millisecond. Sampling usually works pretty well. Instrumentation will slow your app down even more.

A couple things about running this. This requires the 131 interpreter or hotspot version that's in DP1. I'm running this all on the interpreter from a pre-DP1 release. That's why my app is even slower, but it worked pretty well for this sort demo. The other thing is that I just wanted to mention again, this UI was done in IFC. The guy who writes Optimize it wrote IFC, so he loves it, but that's why it's not an Aqua look.

I'm going to show you how it looks like in a second. This is his own UI inside of there. If you're interested in finding out when it's going to be available and how much it's going to cost and all that, contact VMGear. It's VMGear.com. I'm sure they'd love to hear from all you guys because they got this up and working and they're excited to have a whole bunch of sales to Java programmers. That's about it.

There's a little slide with a roadmap of the relevant talks that are coming up following this one. There's a demonstration and talk about JBuilder that you might want to go to. in the Civic Center just after this talk. And then some of the other ones, quick time for Java. As I mentioned in my bit, Apple Performance Tools, that'll give you more information about the performance tools if you're specifically interested in that. So all we'll do now is we'll have a quick Q&A session. I'll invite the rest of the...