Java Performance - WWDC 2002

Java • 57:49

This session presents the performance opportunities available with J2SE. Topics include optimization for file handling, drawing, compiler usage, and faster debugging. Learn what should and should not be done to ensure the best performance of Java applications.

Speakers: Jim Laskey, Victor Hernandez

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Good morning, ladies and gentlemen. Welcome now to session 407, Java Performance. Please welcome now Jim Laskey and Victor Hernandez. My name is Victor Hernandez, and I'm a member of the Java Runtime Technologies team. and I will be giving this talk. We're hoping to round off your week of Java sessions with some hints on how to write code for better performance and also a preview of upcoming technologies that will be available to you in coming JDK 1.4.

So what is the goal of this session? The goal of this session is for you to have a better understanding of exactly how and why your application performs as it does. We'll be discussing improvements that are being made to the Hotspot JIT compiler, especially in the JDK 1.4 timeframe that will improve the performance of your application. And we'll also be introducing to you new APIs that are available in JDK 1.4 that will be improving the performance as well if you use them.

So the structure of the presentation is Jim will be talking about hotspot compilation. I'll be introducing new APIs available to you in JDK 1.4, which are also available to you in the developer preview that we're making available to you from the developer.apple.com website. And we'll have a quick summary and then we'll bring up all the members of Java Runtime Technologies team to answer your questions. So I'm going to pass it off to Jim now.

Good morning, survivors. One of the main factors in determining the performance of a Java VM is how well its just-in-time compiler can convert Java bytecodes into native machine code. In past presentations, we've talked about various elements of the Java language and how the compiler takes those elements and converts them to machine code and how we can potentially rework code so that you can get the best possible optimization from the compiler. This year, because we're switching over to 1.4, we've been working on 1.4, I want to talk a bit about the A bit about some of the new technologies that have been introduced in the compiler and some of the new optimizations that are being done.

Specifically, I want to talk about six new or improved technologies that are coming along the line with 1.4. Class hierarchy analysis, dynamic compilation. Class hierarchy analysis is new. Dynamic compilation is partially new. Full speed debugging is a new feature. Aggressive inlining is a new feature. We've had inlining before, but aggressive inlining is a new feature. An expansion on intrinsics and something totally new, which has been sort of the crux of the new work that we've done, something called the low-level intermediate representation.

First of all, we'll talk about class hierarchy analysis. One of the things that if you're working in the Java environment you'll eventually figure out is that in the class hierarchy of all the classes that you're working with, only about 80% of them actually are subclass. 80% of them are never subclassed. They're never overshadowed by other classes. There's only about 20% of your working classes that are actually overshadowed. You can think of these classes as being Gleef classes or in Java terminology as final classes, even though you don't have them tagged as final.

And within those classes that are leafs, more often than not, you only overshadow a few of the methods that are in the parent class. And because you're only overshadowing a few of them, a lot of those methods could be considered as final as well, even though they're not tagged as final. And this influences the optimization that's done by the compiler.

Another factor that's involved here is instance of testing. Instance of is one of the more expensive operations in the VM, because what we need to do to test to see if an object is a member of a class is to extract the class field that's in the object and compare it against the class that you're trying to test against. And if it doesn't match, then you have to search up the class hierarchy in order to see if you can find a match.

And another factor that's involved here is instance of testing. Instance of is one of the more expensive operations in the VM, because what we need to do to test to see if an object is a member of a class is to extract the class field that's in the object and compare it against the class that you're trying to test against. And if it doesn't match, then you have to search up the class hierarchy in order to see if you can find a match. see if you can find a match in the class hierarchy to see if it's actually an instance of. So it's a fairly expensive operation.

Fundamentally, what we're trying to do here is to improve performance. We find that virtual calls cost a lot more than your final calls or your static calls because there's a lot of overhead in searching for the matching method, doing a lookup, and so on. What we want to try to do is reduce as many of our virtual calls into the same state as a final call, a call to a final method.

Okay, so this is sort of the background of what we're trying to do or trying to influence. In 1.3.1 VM, we have very conservative optimizations in relation to class hierarchy and the influence of how classes relate to each other and also with methods. When we're doing an instance of, we check to see if the class is a member of the -- or sorry, the object is a member of the class, and then we search up the class hierarchy to see if we can get a match.

It can be fairly expensive. Then we also have to make the assumption that any method that's being called has to be thought of as -- in a virtual sense, has to be treated as virtual. We can't do any kind of optimization around that because we always have to assume that that method can be overridden and we might introduce a new class that might force a virtual dispatch.

So in 1.4, we have this new technology called Class Hierarchy Analysis. And what it does is it does a more detailed analysis of how classes relate to each other, both in an inheritance chain and how methods that implement interfaces, how their relation works together. And in doing that, we can do a few things that will improve the optimization in the MUNI compiling method.

The first thing is that knowing a -- the first thing is basically a faster instance of knowing that a class is final allows us to do a very straightforward test to see if it's a member when we're doing instance of. So you're already aware of JavaLangString as being a final class.

So if you're checking to see if something is a string, you're comparing the test internally is basically pull up the class out of the object and compare it against the JavaLangString class and see if there's a match. And that's all -- so it's a very simple compare. If it doesn't match, then it's not a member of -- there's no searching up the hierarchy. It's a final class. That's a very simple test.

With Class Hierarchy Analysis, we can actually perform the same simple test on any class which is considered a leaf class from the hierarchy. So it's the same thing. It's the same thing. It's the same thing. So we went through and annotated each of the leaf classes as being final, at least for that point in time.

There's also a secondary type of mechanism for things that are actually in a class hierarchy. Instead of searching up a linked list, we construct an inheritance table for all classes. In that way, we can assign an inheritance depth for a class and index directly in that table. Again, instead of searching up the chain, we can do a simple check, but we pull it out of the inheritance table and make it a lot faster.

In the class hierarchy analysis, we also do a method level analysis. So in 1.3, we only check on a per class basis whether a method is overshadowed by a subclass or not. In 1.4, we look at every individual method and say, is this actually overshadowed or not? And this is true also in interfaces. Do we see how many implementations of a particular method? If there's only one implementation, then we don't have to dispatch using an interface-type call. We can make a direct call to that function.

Okay, so what does this analysis allow us to do? It allows us to make direct calls instead of virtual or interface-level calls, and then it's much faster. You just basically do a direct branch to the routine. You don't have to worry about checking to see if it's the right class or whatever.

It also allows us to do inlining, and I'll be talking about this a little bit further in other slides. It allows us to inline across virtual and interface calls. So where we couldn't inline a method call before, we can now do that because we basically can say, at this point in time, we can think of this as being a final method. We're not going to do a virtual dispatch. We can inline that code directly into the call point. Thank you.

The second technology is something called dynamic compilation, and we've always had some kind of dynamic compilation in Java. J2SE kicks in initially when you're basically converting from an interpreted execution for a particular method to its compiled version. We have a transition from interpreted to compiled. But as you'll see in the next slide, we've gone a little bit further and allows us to do a little bit more substitution of compiled and interpreted code.

So, with dynamic, and what we're trying to do here is basically get better code generation or better optimization based on the current state of things. So, as I've said in the previous slide about CHA, we can think of leaf classes or methods that are in leaf classes as final. That means we can do better optimization knowing exactly what method that we're calling. So, we can make assumptions about the current state.

We'd like to make assumptions about the current state. And then we could do that, but then we'd be sort of disrupted by the fact that dynamic loading of classes may introduce a subclass and basically force us to throw away all of our assumptions. So if we have to throw away our assumptions, we have to throw away the code that we've generated as well.

And that's where dynamic compilation kicks in. So in 1.3.1, we are very conservative. The compiler can make no assumptions about classes. It basically just has to compile them, assuming that new classes will be loaded and that if I have a virtual dispatch in the code, I have to leave the virtual dispatch there and I can't do any optimization around it.

In 1.4, we can compile during the current state and then do a dynamic replacement. So we can compile assuming that a particular method is final or a particular class is final based on CHA. But then if a class is loaded and changes that relationship, changes the relationship, then we can dynamically replace that method on the fly. And the two technologies that we use to do that, well the first one has always been there and this is something On-Stack Replacement allows us to replace something that's currently being interpreted.

with its compiled version. So that's basically interpreted through to compiled code. OSR specifically deals with methods that are looping for long periods of time and you want What's been introduced in 1.4 is something called deoptimization, where we're allowed to replace a compiled version of the code with its interpreted equivalent.

Now, how is that connected with replacing it with compile code? So presumably what we can do is that we can Initially interpret a method, then replace it with its compiled version, load a new class which forces recompilation of that class. So in the meantime, while it's being recompiled, we replace it with its old interpreted version.

And then once it's finished recompiling in its new form, then we can switch it back to this compiled form. So this is transition state. So the advantages of deoptimization are that we can correct for changes in the class hierarchy, We could also introduce higher levels of optimization. So if a method is running for long periods of time and we think we can get better optimization on that, spend more time making a better optimization, we could replace it with a higher level of optimization. And then finally, we could debug compiled methods. And then we get into the next slide, which is something called full-speed debugging. In 1.3, in order to enable debugging, what we do is we debug only interpreted code.

And the reason we do that is that it's very easy to map the interpretive bytecode back to the source code. So if you want to do source-level debugging, there's very easy mapping there. The problem with that is that when you're running interpreted, It can be too slow. I'll just give an exaggerated example. If you have a program that calculates 20 million data points before it craps out, then you could be sitting there for quite a while getting those data points calculated.

So if you could actually run it in compiled mode and then hit the bug, then you're not sitting there for as long time. Now why can't you compile or why can't you debug compile code directly? Well, because of the optimization that occurs against the compile code, it's not that easy to map from the compile code back to the source code. So we need to transition from the compiled code back into the interpreted code. So as I say at the bottom line there, debugging in 1.3.1 forces interpretation across the board. So you're forced to interpret.

In 1.4.1, when you turn debug using Xdebug, we don't disable the compiler. The compiler still compiles everything. But then if you hit a break point or you start stepping through the code, then those methods get converted over into interpretive methods. So then we can step through the bytecode and debug it.

The mechanism for doing that is quite interesting. I won't get into any details here, but it's basically when we want to de-optimize a particular execution, we just set it up so that when we return back to that execution, we go into some handling code, which does the actual substitution of the current compiled frame with some interpreted frames.

Now, in the first technology, we talked about class hierarchy analysis and the ability to sort of temporarily, at least, think of methods as being final methods. Knowing that a method is at least temporarily thought of as being final, this allows us to say, okay, we can make direct calls to that method, or if the method is sufficiently simple, we can inline it at the call point.

So more often than not, if you're writing small, concise methods, the cost of actually calling the method far outweighs what's actually going on inside the method. So it may be a simple accessor to return the value of field, or do a simple calculation, or get the value of a static.

[Transcript missing]

In 1.3.1, because we didn't have these earlier technologies, we can only really inline simple accessors and statics and final methods. So the level of inlining is very small.

With Hotspot 1.4, we can increase the complexity and depth of the types of methods that we inline, primarily because of thinking of methods as being... as methods as being final so we can have accessors that call accessors that call other accessors that can be all brought in and in lined. We also allow for methods that have exception handling to also be in line. So we've allowed for more types of methods to be in line.

It takes advantage of the CHA to specify candidates as being final, knowing that we can always deoptimize those methods and replace them. And this gives us significant improvement in performance. And I think we've already seen one of the Grand Canyon examples where we've actually doubled the frame rate just purely by using aggressive in line.

Intrinsics are something that we've had for a while. Basically, they're core methods that we realize that are used a lot and that we could do better by hand coding some assembly code for those routines. The types of things you can do are, say, reduce the JNI call overhead if it's a JNI support routine.

There may be a better implementation in native code, like, for instance, sine or cosine. Sometimes there's performance bottlenecks, like maybe Java-laying string charAt, which are used a lot, and we know that we could hand-tune a little bit better. These are types of things that we would like to convert into intrinsics.

What happens is that we, instead of calling out the method, we insert specific code to deal with that particular problem. We can use the method to solve the problem, but we don't want to use the method to solve the problem. We want to use the method to solve the problem. Some examples are sine, cosine, square root, and string charAt.

With 1.4.1, we've expanded the number of intrinsics we've implemented. The most important one probably is new I/O. In new I/O, when you're calling one of the native buffer accessors, we actually insert the code that directly accesses the buffer. So that's a lot faster than going through the equivalent Java code.

And one of the big performance improvements would be eliminating the per-byte accessors. So in the old I/O code, whenever you wanted to read an integer from a buffer, you had to assemble the integer value one byte at a time, and that would take about 21 native instructions to do that. But because we can treat that as an intrinsic, we can actually read one integer value directly from the buffer. So there's a big performance improvement there.

Thread allocation is inline, so when you do a new, the code for allocating the new is right inline with your code because it really, really minimally is just incrementing a pointer to do that allocation. So we don't need to call a subroutine to deal with that. Same is true with instance of and check cast. So in the simple cases where we're just doing a compare to see if the object class matches the class we're looking at, we don't call a subroutine to do that. We do it inline. And similarly with monitors.

And final technology I want to talk about is something called low-level intermediate representation. It's probably not really much of an interest unless you're into compiler technology and understand compiler technology. But whenever you do compilation, you have some intermediate representation that you do your manipulation on. It takes either the bytecode in Java or let's say if you're compiling a C program, you convert the text of the C program into some intermediate representation.

The problem with the intermediate representation in 1.3.1 was that it was too high level and didn't allow us to do detail optimizations specifically for PowerPC. And they were having the same problem with the other platforms. So Sun introduced this The LIR is actually much finer grained and is closer to the native instructions. There would be one LIR instruction for add and there would be one LIR instruction for load from an array or actually load from a memory location.

There is some higher level functionality to deal with some support things like virtual calls or type checks or intrinsics. But for the most part, you can deal with everything at the lowest level. And what does this allow us to do? Well, it allows us to do peephole optimizations.

And peephole optimizations would be things like you might have a shift operation and a mask operation at the higher level or in the IR, sorry, the LEER. And on the PowerPC, shift and mask is something that's all done in one specific PowerPC instruction. So we can take those two instructions and merge them into a single PowerPC instruction. And we couldn't do that at the higher level before. It was much more complicated.

Since the IR represents the native instructions, we can do better register allocation because we can do the register allocation rate in the IR and we don't have to do it in a platform-specific way at the higher level where we weren't doing very good register allocation, so that's much improved.

Because things have broken down a little bit, whenever we access an array, We don't have to worry about the range check and getting the base address of the data and so on and so forth. That's all broken down in finer grain. So now we can do unsafe array access where we can, once we've got the base address for an array calculated, we can get multiple entries out of that array without going through those other calculations. So that improves performance as well.

Okay, so what I want to do now is talk about how to utilize some of these improvements. The first one, of course, is write small, concise methods. I've always sort of promoted that, but before, when you do that, you make much cleaner code and it's a lot easier to compile because you're only compiling specific code at different points. But now, you don't really have to pay as much penalty for writing small, concise methods because of inlining. If your method is small enough to fit in its caller, it'll get inlined. You don't pay the call overhead, so it's to an advantage to continue that process.

I always think accessors as being in line to try to keep all external access of fields and statics through accessors. That way you have the advantage of changing how those fields are accessed. You can put restrictions on them. This also makes it easier for debugging to find out who's accessing those fields on you. You can write those accessor functions or methods knowing that potentially they will, for the most part, they will be in line. There will be no cost for putting them in accessor wrappers.

Final is somewhat superficial at this point now that we have CHA. With CHA, we basically take classes that we can assume that are, at least temporarily, that are final. And the optimization levels that you would have gotten before with specifying final basically propagate to as many cases as the compiler can figure out. So final is superficial. It's It's really unnecessary at this point.

Because new is inlined and we do this per thread allocation, it doesn't make really that much sense to keep pools around for specifically small objects. Larger objects, it's okay. If you have buffers that are 64K bytes or something like that, it makes sense to have pools for those maybe. But for smaller objects that are going to be just around for a short period of time, new is pretty fast and you don't really need to keep pools for those.

Java Dev, there was a discussion recently about programming by exception. Try to avoid using exceptions as your programming model. Do your tests in code and use exceptions for exceptional cases. That's really what they're designed for. As an example, if you were to rely on the system to do your null checking for you, when you hit the null pointer exception, it actually throws a Mach kernel exception. We go into an exception handler. It goes and assembles the exception. We have to do a trace back up through the stack. It can be pretty expensive. In your code, you could have just said, if object's not equal to null, which is pretty lightweight.

Exception handling also inhibits optimization. So within your try/catch area, it can be pretty hard on the compiler as far as what it can keep in registers, as an example, because if a value is sitting in a register and an exception is thrown, then that value in the register is lost. So the compiler has to make sure that all values it's currently working with are stuffed out in memory, and that can add to overall cost. So try to keep your catch-try around the minimal amount of code so that you can get optimization around that.

On the other hand, if you have exception handling in there, it doesn't cost anything until it's actually used. So when you enter a try block, there's no cost as far as additional instructions are concerned. And finally, the Hotspot compiler has been optimized for fairly clean code. Two things that disrupt that a bit are JIKs and if you use an obfuscator.

They rearrange the byte codes in such a way that it's hard to pick up patterns. So what we recommend is that you use JIKs for development, which basically gives you speed of compilation, a quick turnaround in the compilation. But then if you're actually going to deploy it and you want performance from the resulting code, then use Java C.

The obfuscators, some of them are very mean as far as compilers are concerned. I've seen some of them that actually put exception handlers around every sequence of byte codes just to totally confuse the decompilers. That can be really hard on the compiler as far as getting optimization is concerned. With that, I'm going to pass it back to Victor.

Thank you, Jim. New API in Java 2 Standard Edition 1.4. It's pretty straightforward. With every major revision from Sun, you're going to get a lot of new packages, a lot of new APIs, some of which you're interested, some of which you're not going to use, but they're all going to be there because they've been standardized into the Java platform. What are the new APIs?

So I will be introducing the new APIs, and then I will be concentrating on one API in particular that is very special and exciting because it provides a performance opportunity and not just added functionality. I will also be introducing changes that are going to be made to the Java native interface with 1.4, and I will be showing a demo highlighting this new performance opportunity and also some of the work that Jim has been discussing in the first half of this talk.

So we know we get tons of new API. Basically, I see the new non-GUI API basically into two categories plus a bunch of miscellaneous packages. There's the new I/O packages, which let you do I/O in a much more optimized way, and I will be going into much more detail on those. Then there's all the XML packages. that have been available before as standard extensions, but have now been brought into the standard platform. If you've attended any of the WebObjects and Web Services sessions, you'll know a lot about these.

Then there's a few miscellaneous packages. Java util logging gives you the ability to do logging of your application while it's running. Java Utilprev lets you have user and system-wide preferences for your Java application. And Java UtilregX, regX meaning regular expressions. So basically you now have support for regular expression processing, which is something, for example, you can do very easily in Perl, and now is being brought in as a standard part of Java.

And there's also extensions to the Java security package, which can be found in the JavaX security package. So lots and lots of new functionality. But specifically, an opportunity for you to improve the performance of your application by using new I/O. So I'm going to be now concentrating on specifically that.

So that requires you to have a pretty good understanding of what exactly is wrong with the present state of Java I/O, how those problems are solved by new I/O, and then exactly how you go about using new I/O. The three parts... There's basically four parts to the new I.O. packages. So there's native buffers, there's channels, there's Unicode support for the I.O., and there's also JNI support with new added functions. So, we all know that Java distinguishes between the Java heap and the native heap, and the Java heap is only for Java objects and Java arrays.

And it makes this distinction to be able to do the precise garbage collection that we are able to do. This unfortunately has the side effect that if you ever need to gain access to a Buffer that's residing in native memory. You actually need, and you want to do processing on that buffer from Java code. You need to copy it into the Java heap.

and you do that via JNI. That has the added cost of the JNI overhead, which is definitely, if you can avoid that, would be great. There's also another problem with this, which is the fact that now since it's residing in the Java heap, the garbage collector treats that as a Java object and actually moves it around during garbage collection.

Even though that native buffer we know is not containing any objects and is just simply an overhead of moving that around. So both of these costs add up to-- Bad performance of your I/O if you need to use native buffers, which you need to do if you're doing any file I/O or sockets.

So, why is it that these problems exist? The problems exist because you cannot get access to native memory directly from Java code. Every buffer has to have a native copy. And as I explained before, there's incredible overhead. I don't know about incredible, but there is overhead associated with the JNI and CPU during the GC CPU overhead. You might say, well, there is another way of actually doing this, which is basically getting access to a buffer residing in your Java heap from a native method using the getPrimitiveOrAcritical JNI method.

That is true, but you must be aware of the fact that whenever you do a getPrimitiveArrayCritical, the garbage collector is actually paused until you actually release that primitive array. While you're actually doing the processing in your native method, you actually might be pausing any of your other threads if any of them have gone into a situation where a garbage collection needs to happen.

There's also the problem where our collection moves the buffers around so you have the CPU overhead of that. You also can't do... There's also limited functionality due to the fact that the Java buffer, the Java heap copy of the buffer is actually moving around. So you can't do things like memory map files, and you also can't do non-blocking I.O.

All of that leads to the fact that it's pretty... that this does not scale well, and anyone who has... who has written a server trying to use Java I.O. knows that you have to be rather careful. As to, you know, how many threads you open for all of your connections and that sort of thing. So how is Java new I.O. going to fix this?

Pretty straightforward. You have the ability now to both allocate and directly access native memory from your Java method. You can allocate the native memory. The way that this is introduced into the Java platform is that it's actually not an extension to the language, but actually done via an API. So, you actually are creating an object wrapper that contains a pointer to the native memory. The really nice thing about this is that you also have direct access without having to go into a native method.

and that is done via get and put method calls. Those get and put method calls are actually integrated into the Hotspot JIT compiler as an intrinsic, which Jim spoke about earlier, so we can actually get incredibly good performance out of this. We also avoid the garbage collection in JNI overrides simply because you don't actually need to have any extra copies.

You have one copy which you can directly access from both your native methods and your Java code. And you actually can do non-blocking I.O. now since you have the ability to do asynchronous operations on that one location. and the goal has been achieved, which is basically to get native IO performance.

One of the things that's interesting about this is that since it's been introduced as an API, it actually is supplementing the pre-existing Java I/O. It doesn't replace it. So you can continue to use your code, which uses Java I/O. But it might be interesting, it might be worthwhile to change it depending on the situation. I'll go into those situations later.

But also it's interesting to note that it is not fully used throughout the Java runtime environment. Particularly a lot of... Graphics operations still are requiring Java arrays as their underlying representation for the graphics and are not yet using native buffers. That will be coming in future releases of the JDK. Support for that.

So there are three parts to the new I/O APIs. The first one is the Java and I/O buffers. These are containers for data of primitive types. There's a Java and I/O buffer for every Java type. There's an int buffer, there's a char buffer, there's a long buffer. But underneath all of that is a byte buffer.

and all the I/O operations are actually done on the byte buffer. As I said earlier, you access these via get and put methods and these are integrated into the hotspot compiler. Since the I/O operations are actually done on byte buffers, you actually need to create a byte buffer, use that byte buffer for all your I/O operations, and then actually when you want to use it from Java, you say byte buffer as int buffer or byte buffer as long buffer to get access to the Java types.

The byte buffer has support for being direct. By direct, I mean that any I/O operation is done trying to use the native buffer. You have the ability of actually adding a backing array that actually is resident in the Java heap, but if it's direct, you can actually set it so it uses the byte buffer.

Since the byte buffer represents a native buffer, you actually can do the memory mapped One of the things that's interesting about this is that the allocation-to-allocation costs of creating a Java NIO buffer is actually greater than of creating an array. Hotspot does a great job of optimizing the allocation directly into the Java heap, both of objects and of arrays. So it's not possible to get as good performance as that.

Actually, I wouldn't make that assertion anyways. But the reality is that you should be aware of this. And it's not just simply, oh, anywhere where I was using an array, I now need to use an NIO buffer. You basically need to check to see if it applies to your application. The best rule of thumb is that if it's going to be long-living and it's large, you want to avoid all those copies. You want to avoid all the CPU overhead. And therefore, it's probably worthwhile to use a Java NIO buffer.

So where are the actual I/O operations defined? They're defined inside of the Java and I/O channels package. The Java and I/O channels allow non-blocking and interruptible operations. There is support for file system access, and that's in the file channel. And that actually has, as I've said before, memory mapping support and also file locking support, so you can actually lock certain sections of a file from other threads within your application. There's also support for inter-process communication. Remember, these are interruptible. There's socket channel and pipe channel. These are equivalent to the old input and output stream classes that you were accustomed to before.

The third part of the new I/O APIs is the Unicode support. This one's rather straightforward. Since you might want to be doing text processing on that input and output, you want to have the support for mapping directly from the bytes to Unicode, and Java supports that. It supports all Unicode encodings, and that actually is done specifically on the byte buffer and character buffer classes.

The final part of the NIO functionality is actually additions to the JNI methods. They've basically added three JNI functions so you can use new I/O. First, there's the new direct Method. That lets you create a byte buffer from your native code so that you can actually wrap any pre-existing native buffer in a direct byte buffer and then pass it back to Java. Then there's the opposite operation, which is getting access to the native buffer from the encapsulating Java object, which is the byte buffer.

You might be concerned that since there's new functions available in the 1.4 Java Native interface that you will have to update your old JNI libraries. That is actually not the case. Your old JNI libraries will be compatible, but there are other reasons to maybe update yours. The first one is actually to be able to use the new I/O functionality.

The other reason is, as was mentioned in the Java VM internals talk, we've actually made changes to the Java implementation in Jaguar so that we actually support JNI libraries that are dilibs and not just bundles. It might be a good opportunity to switch over to doing that with the 1.4 transition. So, APIs are great, but there's nothing better than actually looking at some code.

I have here some example code that I was using to check to see how good the performance actually is. There are three main things I want to highlight as far as the code is concerned. The first one is how to do memory map file input and output. And that's pretty straightforward. You still use the old file input stream. And now file input stream actually has a get channel method. which gives you access to the channel. At that point, then you can call the mapping function and you can say read only or read/write and whatnot.

This is where it gets interesting. This actual piece of code, what I've actually done is that what I'm doing is I'm reading in a file of pixel information. I'm making myself a copy, which then I'm creating a server for that I'm going to send over to another application.

So the reason I'm making a copy is because I'm then expecting the other guy to send it back to me, and I need to have some place to store the pixels he sends back to me. So I'm going to actually throw away the reference to the memory mapped file.

So you can see that I actually allocate the byte buffer as a byte buffer, not as an int buffer, even though we know that that's the correct representation for a buffered image. And then here's a few operations. You need to mark the buffer because basically a buffer needs to know what its beginning position is, what its limit is, and where the present location is, which is actually known as the position.

By default, when you create a buffer, it does not have a beginning location. What you need to actually do is call mark on it to make the beginning of it the actual beginning location. Otherwise you'll get an exception later when you try to revert to the beginning location. Then reading the file contents is as simple as calling put. And that advances the buffer to the end. So I need to do a reset to bring it back all the way to the beginning.

The last operation you can see right now is actually copying into a Java int array. As I said before, this isn't integrated throughout the Java runtime environment. I actually still need to do the copy, which would be great if I could avoid, but unfortunately the APIs don't support that yet. That's basically what's going on there.

[Transcript missing]

There's selectors inside of the Java and I/O package. You create a server socket channel. And then here's a key one right here. By default, it actually is set to blocking, and you can get non-blocking I/O simply by making it false. Then you do your standard binding. You need to bind that channel to an address.

[Transcript missing]

So this is actually the iterator, and this is where I actually accept the connection.

And now what I'm actually going to do is write my first transcending over the pixels over to the other guy. And you can see right here, where is it? There it is. So this is how you write to the channel. And you'll see that I'm doing that directly for my byte buffer and not having to do that directly from my int array, even though I still needed that to actually draw the pixels to the screen. I use this position stuff to figure out how far I've gone, but you see that every time you write, you actually update the present position in the buffer.

So that gives you a general idea. There's really good example code available at the Sun website. Just look for a new I/O and you'll see plenty of great examples. Now I want to give you a demonstration of all this technology we've been talking about. You might have seen this before, which is the Grand Canyon demo.

But what I want to emphasize this time is actually what part of the technologies that Jim and I have been talking about today is actually in use during this demonstration. The Grand Canyon demo is a great flight simulator written by Ken Russell, who is a hotspot engineer over at Sun.

And what we have here is a real-time renderer. And right now, we have the notion of, all right, we're flying in 1.3.1 mode, and we're flying in 1.4 mode. The reality is that we're running this on the 1.4 VM that is available to you in the developer preview.

Let me actually get it so it's looking at something. There we go. And then I'll fly down. So, where am I? There we go. Or somewhere. I've actually never flown a plane. All right, so this is actually in 1.3.1 mode, and the exciting thing is to show you that... All right, if I can fly a little farther down... There we go.

You see that there's a choppiness, but if I switch over to 1.4 mode, it suddenly gets smooth, and then I go back...

[Transcript missing]

So, but both modes are running on the 1.4 VM. So there are actually benefits to the 1.3.1 mode that wouldn't even be gotten if you were running under a 1.3.1 VM.

Primarily, all the deep inlining and compiler optimizations that Jim talked about. and are available to it. So it actually, if we were running this on a 1.3.1 VM, even though I don't know how well we could since it needs to be able to run the 1.4 mode, it wouldn't be running even this fast.

What the 1.3.1 mode is saying is that basically the inner rendering loop is using NIO and, in contrast to the 1.3.1 inner rendering loop, is actually using the typical copy out to native buffer, and then from there copy, copy out, and then from there I'm going to fly. All right. I'm going to leave this alone if I can and just talk about it.

Let's see, is it in 1.3.1 mode? Let me put it into... Oh, no, that is 1.3.1 mode. That is 1.4 mode. There we go. So... The inner rendering loop needs to get the pixels over to the graphics card because this is actually built on top of GL for Java.

By reducing the amount of copying overhead by using Java New I.O. That's the main part of where we get the improvement in the drawing. The other part where we get the improvement is the fact that by using all the deep inlining, we actually are able to... have no virtual function calls inside of that tight rendering loop, which actually makes a huge difference. If you turn off the deep inlining, the performance of this is an order of magnitude worse. That is basically highlighting the fact that All of the technologies we've described today are going to be benefiting your application. That is that demonstration.

So in summary, what we've done is we've discussed the performance of your Java application. We've told you what improvements we're making for you in our changes to the Hotspot JIT compiler, and also what changes you might want to make to your application when writing in Java 1.4 that will improve your performance. What else? Who to contact? There's the contact information for our technologies evangelist, our product manager. That's Alan Samuel and Alan Dennison, Jim and myself.

Here's some more information. Where to find us. Where to find information about 1.4 directly from Sun. Also this great website that actually gives you a pretty good synopsis of all the other APIs that I didn't go into too much depth about, why you would be interested in using them.

We also have performance tips from last year's talk. Last year's talk we actually talked about how to profile your application, that sort of thing. We have handouts that might be useful to you, so please come and get those. Here's a few more information on how to find documentation, how to contact us. A lot of us read the Java dev mailing list. And how to get technical support, DTS at Apple.com.