Maximizing Java Performance on Mac OS X - WWDC 2003

Enterprise IT • 1:08:05

Learn the best tips and techniques to increase the performance of your Java application on Mac OS X. We cover making use of NIO in 1.4.x, using the java -X options, and how to get faster graphics performance.

Speakers: Jim Laskey, Victor Hernandez, Gerard Ziemski, Ken Russell

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

So welcome to the last Java session of the week. It's maximizing-- yeah. Maximizing Java performance for Mac OS X. My name is Victor Hernandez. This talk will be given by three of us, Jim Laskey and myself. We're from the Java runtime technologies team. And Gerard Ziemski, who is from the Java platform classes team. And we're going to be splitting the talk into three parts. But the goal of the overall talk is to give you a better understanding of why your Java application performs as it does on Mac OS X.

Jim is going to be talking about performance improvements that we've made, specifically targeting the G5 processor. Then I'll be talking about performance opportunities that have arrived with Java 1.4.1. And then Gerard will be talking about Java graphics performance on Mac OS X. And stay tuned specifically for that part, because there's a lot of great demos to be seen there. So we've got a lot of material, so let's get right to it. Here's Jim.

Thanks. So my part of the talk, I'm going to talk specifically about what changes were made to the Hotspot VM to target the G5, which was pretty exciting because we only got to see some of these new machines a few weeks ago and play around with some of the prototypes. First of all, I'll just give an overview of my section of the talk. I want to talk specifically about some of the details of the G5 to give you a sense of what sorts of things that we could actually exploit on the G5.

Then I'll do a little performance comparison between the G4 and the G5 to give you some kind of sense, as well as a benchmark can, of what kinds of improvements you might see in your application. And then I'll go into some detail on some of the changes that we made specifically to the Java VM interpreter and the Jitsi. And then quickly at the end, I'll go through a couple of the changes that we made to the Hotspot runtime.

Okay, so what does the G5 mean to Java developers? Well, the main thing you should note is, I guess, or you should understand, or it should be obvious to you, that the G5 is going to make your application generally run faster, and you would expect that from a faster processor. It's about 40% faster than the highest-end G4 currently shipping, faster bus structure.

There's also been some architectural changes to the way the G5 processor works over the G4, which actually improves the performance of various types of operations, and very specifically, floating point. You'll find that floating point is faster than that 40%, or typically faster than that 40%, projected by just the change in gigahertz on the machine.

Now, we could have left the VM alone and not done anything to it, and you would have gotten a gain in performance running on the G5, but we like to tinker, and there's all these really cool instructions on the new processor that we wanted to take advantage of. Specifically, there's the introduction of 64-bit operations, and if you have any longs or long ints in your Java application, we now use a much simpler and quicker set of instructions to do those operations, and I'll go into some detail on what was actually done.

[Transcript missing]

And finally, the main thing that you can walk away from this session feeling is that you don't have to do anything to your application to gain these benefits, improvements in performance. We've modified the VM. As soon as you run your application on a G5, you're going to gain all the benefits of having 64-bit arithmetic, the faster processing yourself. So you have to make no changes to your application.

Now, just to start it off, I want to show you some comparisons of running an application or some applications on G4 versus G5. This is - I'll be using a spec - sorry, a SCIMARC. We use several different benchmarks internally to test various things. We would normally have used spec-JVM, but there's a fair use policy which requires that you post the scores on a public forum before you can actually use them. And we're still working with prototypes, and we don't have our final values and whatnot.

So we chose to use SCIMARC, which is a fairly good benchmark, and it will give you a good sense of where we're going. The other thing about SCIMARC is that it's a scientific instrument. It's a very good engineering benchmark, which people have often said, well, you know, like the client VM is very slow when it comes to computation. Well, this SCIMARC score - the SCIMARC score should give you a sense of where we're headed with the computation.

CIMARC can be found at the National Institute for Standards and Technology website. There's the URL. And if you go there, there's a whole list of current standings. They're fairly up to date. I think the most recent one is May-June time frame. If you look at the list, you'll see us way down there somewhere. Actually in the 61st position. This was run by somebody back in the fall running on 1.3.1. 1.3.1 version of the VM on a 1.2, sorry, a 1.

and the other two are going to be talking about the new G4 processor, the G425 GHz dual processor G4. So note the score there. The score is 78253. That is what's called the composite score. The side mark is actually five separate tests, such as fast Fourier transform, sparse matrix multiplication, Monte Carlo. So that's the composite score prepared from those five results, and that's the composite score which is used to actually rate how you're doing in the side mark.

This graph will show a high-end G4, the current high-end G4, which is a 1.42 GHz dual G4 against a 2 GHz dual G5. Focus on the first column because that's the one that the score is actually, the main score is based on. So currently, our score would be around 111.

Okay, composite score. And you can see each of the subtests there. I'm not sure whether there's a normalization of these. I haven't seen anybody actually hitting 100 on any of those, but that's basically what you would find currently. Now if you took a straight 40% increase in performance on each of those tests, this would be the projection that you would get.

Okay, so we did this to sort of get a sense of where we should be headed once we ran it on one of these G5s. Okay, so a score of 158. Again, focus on the composite score because the other one is going to vary a little bit. So, as I said, this would have been the best we would have expected.

Well, we were kind of surprised when we actually ran the test. And we got a 232, which is pretty significant. So it's more than just gigahertz. It's also the system itself and the changes that we've made to the VM. And here's an overlay of the projections just to show you where we're at. So our score is basically more than doubled on the G5.

So where does this put us? Well, if we were to do this today, this would put us about 12th position. And what's interesting is that this is up in the high end there with all the high IBM servers running 3 GHz, and we're running a client VM. So that gives you a sense of the power of what the G5 is, and also the potential that we could have as we make further improvements on the VM.

Okay, so let's go over some of the changes that took place in the interpreter and the JITC. We enhanced them with G5 instructions. We can do this because the interpreter, the JITC compiler code generation, and also the runtime are all constructed on the fly when you launch the VM.

So this gives us an opportunity to choose which instructions that we want to apply. So if we ran on a G3 versus G4, we would choose different instructions. If we were running on a single processor or a double processor, we would choose different instructions. And now that we're running on a G5, we can actually choose 64-bit instructions.

So Java long in support is now using 64-bit operations, and I'll show some of the details of that. There's also been improvement in float and double support using some of the new floating point instructions. And they're not actually new instructions, they're just instructions that are only made available because of the 64-bit support. 64-bit support.

So let's talk a little bit about the details of what 64-bit means. In a G4, or G3 for that matter, everything that goes through the processor has to go through in 32 bits. That's because the data bus and the width of the registers is only 32 bit wide. So if you wanted to do an operation on a 64 bit long integer value, you would require two registers to deal with each of the operands and with the result.

In this case, if we had foo = x + y, foo would require two registers to hold the result. In this case, in this example, R3 and R4, we'd also need two registers for x and for y, so we'd need six registers just to perform a long add operation, or a long subtract operation.

In the G5 world, our registers are 64-bit wide. We can still treat them as though they're 32-bit wide, and some operations still deal with them as 32-bit wide, but the long integer operations, we can deal with them as full 64-bit. So in the previous example where we had foo equals x plus y, foo only needs, the result only needs one register.

And x and y only need one register each, so we cut down the number of registers that are needed for each operation, and that makes more available for other operations. So we get a win from the reduction or the lack of, sorry, the general win by having more registers available for operations. So let's look at the specific operations that we can improve on.

So in your Java code you have an expression along X equals Y. On a G4, this would actually require four steps in order to perform this. We need two steps to load each half of the 64-bit value, the high 32 bits, then the low 32 bits. And then we need two steps to store those out back into memory. So in the 32-bit world, we almost always have to use at least two instructions where one would do. And the G5, we only have one instruction for each of those.

So one instruction for load, one instruction for store. This is also used for moving data. We have a 64-bit data bus, we can get better throughput through the system. So when we're doing memory copies, we're also getting performance boosts there. Let's look at some of the simple operations like add, subtract, and negate.

Again, because we only have 32-bit wide registers, we have to do everything in two steps on the G4. So in this case, if we want to add two long ints, we have to add the low halves of the two operands, bring the carry forward, and then add the two halves and add the carry in. So that would be the two steps that are highlighted here. I've thrown in the load operations as well to give you a sense of that. Well, it's not just the operation itself, it's also the things that go on around it.

So it takes eight instructions to perform that. On the G5, it only takes one instruction to do the add, and again, each of the loads only takes one instruction, the store only takes one instruction. So we've cut the number of instructions required in half, and you can think in terms of fewer instructions, faster code.

Now the more interesting things, and this is the most trouble for us in implementing the Java VM, has been dealing with longs and some of these more complex operations like multiply and divide and remainder and shifts and even comparisons. They can take many instructions and long and divide can literally take hundreds of instructions or hundreds of steps in order to complete the operation. Remainder, just a few more. Shifts can take eight. Comparisons can take up to 12. They're fairly expensive operations. Each of these have been reduced to one single operation and I'll take multiply as the simplest example.

On the G4, to do a long int multiply, it takes six steps to do the cross multiply of the low and high parts of the operands. On the G5, it only takes one instruction. So you can see where this is going, is that if you have a lot of long int computation in your code, where it took many steps before, it's only going to take a few now.

Let's take a look at float. When I say float, I mean float and double. In the G5 implementation of the Java VM, we have taken advantage of some of the newer instructions that can convert longs to double, and doubles back to long, and same is true with float.

In the G4 implementation, it has to make a library call, which takes several hundreds of steps. So it speeds up the performance of casting or conversion of longs to doubles. There's also been some improvements in the float and double bit extraction routines, such as double to long bits.

These are used primarily when you're converting doubles to strings and back again. The most interesting of the changes is square root. On the G5, there's a built-in square root function. On the G4, the square root is implemented as a trig library routine, and it can take several steps. It's in the order of about 40 steps to complete.

So what I did was I took a little micro benchmark where I'm iterating through 100 million data points and applying a square root to each of the data points and producing a result. And just to make it interesting, I took a little bit more complex operation where I had 100 million xy points on a coordinate plane, and I wanted to compute the distance, so it's a little bit more complicated equation. And just to see how long it would take to do on each of the processors. The first processor is the G4 running at 1.42 GHz. So it takes about 12 seconds to do all those computations, and 13.5 for the distance formula.

Now if I was just to take a straight port over and use the library routine on the G5, it would be reduced to 7.7 seconds and 8.1 seconds. And this is actually better than 30% of the projected time that it should take. So the floating point processing is better on the G5, and you're going to get a better result.

On the G5, running with the square root instruction built into the code or inlined in the code, it only takes two seconds. So you've got, say, six times improvement in performance. This is a micro benchmark. It's just going to give you a sense of the increase in performance of the square root itself. So your actual example may take a little bit longer, but it gives you a sense of the magnitude of the improvement there.

Finally, I just want to quickly run through some of the changes to the runtime. In a 32-bit world, we have a little problem where two threads may want to share a long int value, say a static or a field in an object. And while they're writing to that long int, the upper and lower halves of those values might get slammed by one or the other process, depending on how the thread switching is going on. To avoid that problem, you can annotate your field with the volatile keyword. And what that does is forces the VM to coordinate how that field is accessed, and make sure that we don't run into that problem.

In the G4, we did a little fudge using a 64-bit double register, and using that as an atomic access, and copying it through some memory, and so on and so forth. So it took several steps in order to make that work. In the G5, 64-bit loads and stores are atomic, so you don't have that problem. So there's no overhead when you're dealing with volatile fields on the G5.

One of the problems that the G5 introduces is the fact that the hardware itself is a little bit more complex and has more stages when it's doing its computation. This is where it gets its speed. So when you're running on a dual processor, there needs to be some coordination on how memory is being accessed. In the G4 world, we use something called the sync instruction, and this allowed the two processors in the dual processor environment to sync up the data that's shared between the two processors.

But the problem with the sync instruction is that it somewhat freezes the state of the processors until they're both coordinated before it continues on. So there's a bit of an impact there, and sometimes it can be actually fairly serious. With the introduction of the G5, they brought in a new instruction called Lightweight Sync, which doesn't require as much handshaking between the processors to determine whether the data is in sync. And we use these when we're doing memory allocation, when two threads are trying to allocate memory at the same time, or when you're using a synchronization of an object.

And finally, the last major change that we made in the runtime to deal with the G5 is atomic long access. There's a class in SunMisc called AtomicLongCS - sorry, CSImpl, which allows you to do atomic access of long values. And this is primarily used in the net operations, like when you're setting up sockets and so on and so forth.

In the G4, we had to actually use full Java synchronization, and we just used the Java implementation to provide the synchronization. So we lock out the access to that particular object field, and then we make the assignment and release it through normal synchronization. On the G5, we use lightweight load and reserve, which is an instruction that we use for the G5. It allows us to reserve access to that word, and it can be done fairly quickly.

So in summary, the Java that 1.14, or sorry, 1.4.1 that ships with the G5s, once they start shipping, will automatically adapt to the G5 processor. And we're only going to be shipping one version of the VM from that point on. It's not one that runs on the G4, one that runs on G5.

It's one that runs on all platforms but adapts to the G5. And this is one of the great things about the Hotspot VM. You're going to get significant performance changes somewhat across the board, specifically in floating point. That's the main thing. If you're doing scientific computing, you're going to see bigger wins there. And then also with the long int arithmetic, if you're using it.

The main thing that I want to point out is that you don't have to make any code changes to your own code. The VM does the adoption for you. This is where you're one up on all the C and Objective C programmers. Because if they want to run on G5 and take advantage of the G5 processor, they're going to have to recompile their application and they're going to have to ship a separate version of their application for the G5 and one for the G4. So Java is automatically going to take advantage of that. Okay, and that's all I have to say, I guess. Okay, Victor?

There you go. So for those of you that don't think in terms of bits and instructions, we'll take it at a higher level now. My name is Victor Hernandez, in case you don't remember. And here we go. So basically what I'm going to be talking about is updates to Hotspot that have been made with Java 1.4.1.

Specifically, one of the features that we've added in being able to optimize your code, and that's specifically aggressive inlining. And also, one performance opportunity that you can take advantage of yourself in Java 1.4.1, which is the new IO APIs. And finally, I'm going to kind of wrap it up with a bunch of conclusions on tips that you can take advantage of to improve your hot methods.

Okay, so one of the performance bottlenecks that has plagued Java, well, I don't know if plagued, but that Java has encountered since the early days, is the fact that there's a large cost in the overhead of actually invoking a Java method. So our opportunity to minimize that cost is actually done by dynamically inlining the method calls done by your method when we compile your method.

What is inlining? That should be pretty straightforward, but I'll give a quick example. Here you've got average and sum, average call sum. Of course, this could avoid the call to sum if it just simply did the A plus B itself, but of course you don't want to do that in your code because that limits the reusability of that method. Good thing is that we're able to do that for you. You don't need to change your code, we just do it for you on the fly.

In 1.3.1, there was limited ability to do inlining. We were able to inline your accessor methods to your fields. We were able to inline your call to create new instances of your objects. And we were able to inline certain intrinsics. Intrinsics being methods that we actually don't need to look at the bytecodes to know what it's supposed to do.

We actually know what it's supposed to do and actually have a finely tuned implementation of it. For example, sine, cosine, also the identity function. And then also, but one of the main issues with inlining in Java 1.3.1 was the fact that we were actually not able to inline virtual methods.

Why are virtual methods difficult to inline? Well, the reason they're difficult to inline is because there could be possible multiple implementations of that method. When you actually go to the invocation, so we don't know which implementation to actually inline. So how do we go about inlining those virtual methods?

We do that with a technique called class hierarchy analysis. The goal of class hierarchy analysis is to determine if a method is monomorphic. And a method is monomorphic if there is only one implementation of that particular virtual method that has actually been loaded. If we know there's only one that has been loaded, and you go to call it, that's got to be the one.

and Hotspot 141 attempts to aggressively inline all monomorphic methods. That's the main feature we've added beyond 1.3.1. So what are the benefits of this? Well, clearly the fact that now we can actually inline virtual methods. There are certain situations where those methods actually don't get inlined, and even in that case then, we can avoid the virtual table lookup when invoking that method, because we know that there's only one entry in the virtual table.

This also provides us the ability to do a faster implementation of certain bytecodes, because the class hierarchy analysis has a data structure which actually tells us the full hierarchy information of all the classes that have been loaded. So when you're doing things like instance of or check cast, which are bytecodes used when casting your objects between various classes, we can actually use that data structure, and it actually performs a lot faster.

OK, so what is another performance issue that has affected Java in the past? Well, this one actually has two parts to it. One is the fact that if you ever wanted to operate on native data structures from your Java methods, you actually had to have them residing in the Java heap.

Why would you actually need to have native data structures in your Java heap? Well, if you ever want to interact with any system APIs, you actually need to have those data structures to pass down once you drop down into native methods. That adds the other heavy cost, which is the fact that those JNI transitions to do those method calls, the native method calls, are quite expensive.

I mean, in the previous section where I was talking about the inlining, we're trying to minimize the amount of method calls. And those method calls are even pretty quick compared to these JNI transitions. Not only that, but these JNI transitions definitely cannot be inlined at all because we're crossing ABIs, and we don't totally control a lot of the issues between calling from Java to C. But those JNI transitions are still necessary as of Java 1.3.1.

The other thing to keep in mind is that once you have all those native data structures in the Java heap, they actually need to be copied around during garbage collection. And yet they don't contain any actual Java pointers, which the garbage collection algorithm needs to take track of.

So what is our approach at actually improving this bottleneck? We want to remove this JNI dependency altogether by giving you the ability to actually access that native memory from your Java methods. You might be familiar with this. This is basically the new IO APIs that were provided in 1.4.1. They're available in the Java and IO package.

And there's basically a buffer class for every single one of the Java scalar types, and also for the actual byte scalar type. Actually, all operations happen at the byte level, even though that's not a Java scalar. But you can actually cast to int buffer, to long buffer, and actually operate at the Java level.

And one of the things you need to keep in mind here is that even though the goal is to have Direct access to native buffers that are not located inside of the Java heap, you can actually trick yourself into still basically having a copy residing in the Java heap, accessing that and have it being copied over outside of the heap, which even though it might be improved performance as before, since you don't have to drop down into a JNI native method to do that, it still is an added overhead, and you need to be careful about that.

There's a few other issues I want to bring up. This is a pretty straightforward code example that shows an allocation of a byte buffer of size 400, and you're basically zero filling it with a for loop. One of the things that actually you need to be aware of right here is that that for loop is not as optimal as it can be, because the call to the put method does not get in line because it's not determined to be monomorphic. This is a caveat of the actual class hierarchy of the Java and IO package, and it affects all your calls to get and put. In the case of byte buffer, if you're doing something like this, there is actually one way you can get around it.

And that's actually just simply by using a map byte buffer. The map byte buffer get and put methods are actually determined to be monomorphic, and they get in line. But you need to keep that in mind. And this is something that we're going to be tracking for in the future to see if this can be improved.

So how do you actually do high level I/O with the new I/O? That's using Java and I/O channels, found in the Java and I/O channels package. And the main thing that it provides beyond what was available in traditional Java I/O as of Java 1.3.1 is the ability to do non-blocking and interruptible operations. No longer is there any need to actually have one thread per socket. That's the thing of the past.

The other thing that it provides is actually it provides improved file system support, gives you a lot more of the system level primitives that you would come to expect from a robust operating system. Things like file locking and also memory mapped files. Just like in the case where you needed to make sure that you had a direct buffer sitting behind your native buffer, this is another example where you actually not only have access to direct memory, but you actually are accessing the memory mapped file itself. stuff.

So let me go into a little more detail about the socket channel. I don't want this to be a-- I'm not going to go into enough depth for this to be a tutorial on this sort of thing, but I do want to bring up a few issues that tutorials might miss on occasion.

This is an example of how to create a server socket channel and bind it to a particular address for it to be listening on. One of the things is that, by default, it is not set to be non-blocking. So you actually have to do that by calling configure blocking and passing it a value of false. You can-- it's a pretty straightforward thing, but it can be missed. And it definitely makes a huge difference.

And then actually, how do you actually communicate with your clients using this model? You use the selector model, which you might be familiar with from the programming patterns. You can see in the code right here, basically what you're doing is you're registering for a particular key, and then once you've done that, you can basically iterate over all of your clients who communicate to you via keys, and who pass the new channel you were communicating with them with in a... In a big wild loop, if you want, by iterating over all of the keys. And that way you're abstracting away all of the different sockets that you're actually talking to, instead of doing the traditional having to block until your client talks back to you along that particular socket.

One of the things to keep in mind is that the socket channel that is actually returned each of the times you access one of these keys is different than the one you had originally. So if you want to continue the non-blocking I/O, you actually need to state that you're doing non-blocking I/O once again with the configure blocking set to false.

Okay, so what do you need to keep in mind when using NIO? Well, it's definitely not free. The cost of allocating those native buffers is definitely much larger than allocating the Java arrays. It's pretty hard to reach our performance of allocating Java arrays because we actually do a very good job at doing that as quickly as possible.

The other thing to keep in mind is that the get-put methods of the native buffers are not inlined. You can use the trick to get at least that fixed for a few cases, but there's nothing you can do for int buffer and for some of the other scalar buffer types.

But the cost definitely out - sorry, the gains definitely outweigh the cost in the cases where I've been talking about, where you have heavy use of system APIs with native data structures. One of the good examples of actually taking advantage of that win is actually in the re-architecture of the AWT done by our team for Java 1.4.1. We actually took advantage of the new IO API to talk to core graphics and minimize the number of JNI transitions.

Basically, we told the classes team, try to minimize JNI transitions as much as possible, and they did that as much as they could. You definitely see the performance improvement there, and we're hoping that the shared classes on the whole will be seeing more use of new IO wherever that can be done. in the future. The other thing also is clearly if you have server I/O with multiple clients, you definitely want to be using this because the overhead is definitely costly.

So, what can you take away from all of this? Well, the main thing you need to keep in mind with the server I/O is just simply use it in those cases. And with what I told you about inlining, what you need to do actually is maximize the opportunities where we can inline your methods. This is mainly important in your hot methods.

When you do a profile, you want to figure out what the methods that you're mainly calling are, and make sure that the hottest method, all of the things that it's calling are hopefully being inlined. It only can be done at a high level. There are actually no flags to have this notified to you if your methods aren't being inlined and that sort of thing.

But the general rules of thumb are definitely if all those methods are small, that helps because we have a certain limit at which point we bail on any future inlines in the method we're trying to compile. Feel free to use accessor methods. Those have been inlined definitely since Java 1.3.1. Also, there's no need to use the final qualifier on your methods. That's superficial for performance tuning. It's definitely not superficial for object oriented programming, but we don't particularly get any benefit out of that.

And also, keep in mind that a lot of the JDK methods do get inlined, so you can keep that in mind if that's a lot of what your HOT methods are doing. There are a few things that we're still unable to inline, and you've got to keep that in mind.

Mainly synchronized methods, obviously large methods, and if you have an exception handler in your method, that can cause it not to be inlined. So keep that in mind. The last tips I want to leave you with are ones I always like to reiterate, which are things that still live on from the days of Java 1. Avoid object pools. There's absolutely no need for them in modern Java.

Our new is completely fast. It's also inlined. We also have thread local allocation. So now there's minimized contention between multiple threads allocating in the Java heap at the same time. And we also, I mean, we have precise garbage collection. So let us do the work for you in terms of figuring out when an object needs to go away. You don't need to take care of it in terms of the object pool and all that.

And also, avoid programming by exception. There definitely are situations where you want to program by exceptions. There's the case where you have, like, you want to go down a tree and then go all the way back up to certain branches in the tree. Sure. But Hotspot is definitely not optimized to compile those cases as well. For example, it can cause inlining to be prevented. And also, the actual creation of the exceptions are expensive, but that creation only happens if the exception is actually thrown. So, hope you gave away some tips for your application, and now I'll bring up Gerard.

Hello, welcome. My name is Gerard Ziemski. I'm an engineer on Java Classes team. And I'll be talking to you about graphics performance. First, I'll give you a short introduction of the state of Java graphics on Mac OS X. Then I'll give you a few actual tips and techniques on what you can do to your application to make your Java app run faster. And finally, we'll have some cool demos to show you.

Java 1.3.1. One really interesting thing that we've done there was a Java 2D hardware accelerated implementation that set on top of OpenGL. That was really terrific, terrific implementation. It was fast. It was incredibly fast. However, the problem with it was that when it worked, it worked. And it worked only 90% of the time. And to get the rest, 10%, is really difficult for us. We were making strides. We're continuing. And we're making progress. However, we really could not nail down the correctness. So when we moved to Java 1.4.1, we completely re-heterotectured our code.

We moved from Carbon to Cocoa. And the lessons that we learned in 1.3.1 was, first of all, if we cannot do hardware acceleration, we need to have something we can fall back on. And that is something called a software renderer. So when we moved to 1.4.1, we decided, let's start, let's nail it down. Let's have terrific software implementation as far as correctness is concerned. Then, in the future, once we have that done, we'll be looking for new technologies emerging right here within Apple.

And then we'll evaluate them. And then we'll see which one works for us the best. And then we'll adopt that technology. So we are right now, at this point, we are still at the transition point where we're in 1.4.1. We have brand new code. There is not even one line of code that we share with 1.3.1. It's brand new. Everything is written from scratch. But we want to nail the correctness first. And, of course, we are keeping our eyes open on what is going on around us and what technologies we can use later.

So in 1.4.1, the Java update that you guys have access to, first of all, our priorities were correctness. Second, we also didn't really want to neglect the Java graphics optimizations. As you all know, 1.4.1 is not a speed demon as far as graphics is concerned. So we worked on really on very basic architectural optimization techniques that we could put in there. And right now, we came up with three of them. That's lazy drawing, lazy pixel conversion, and lazy state management.

What lazy drawing is about is we simply collect all your primitives that you want to draw. We put them aside in a queue in a cache. And when the time comes to draw them to the screen or into your image, it's only then when we transition from Java, we go to the native, then we process that queue.

The good thing about this lazy drawing implementation that we have right now is that it's future compatible. Whatever technology we choose to use next, this lazy drawing implementation will work with it. And we work with Core Graphics guys, and we make sure that whatever we do with our lazy drawing optimization will not break them in any way. Second, lazy pixel conversion.

There are certain image types that Java provides the access to that are not supported natively. So what that means is if we want to do something with that image, the pixels are in format that are not understood natively. We have to convert them. If we didn't do this optimization, drawing of images or drawing into images would be terribly, terribly slow. So what lazy pixel conversion is, is simply a technique of converting the pixels only when it's necessary.

And then, thirdly, lazy state management. A graphics context has multiple different states that you can set. Transformations, color. And what this is about, this optimization techniques will simply let us set only those states that have actually changed. We are not quite done with this optimization. We are only part way. So unfortunately, at this point, whenever you change most of the graphic states, we have to slam all the other ones as well at that time. That is terribly inefficient, but we are working on that.

So here is one benchmark, micro benchmark, to show you. This one, the scores show simply basically performance of lazy drawing optimization. So what you have in your hands with our initial one-for-one release was the base of 100, and what you have right now, you have 175 score, which is 75% increase. That's not too bad. We're not done with this by any means. And second, Robocode.

That's real world application. There's interesting story behind this. At the time when we were working on 1.3.1, Robocode was running pretty darn slow. And we went to the developer, and I think we made a mistake, because we told him, look, the image format that you're using is not fast with our current implementation in 1.3.1.

Why don't you use the image format that we support natively, and we'll speed up your application? Well, they listened to us. They changed it, and yes, they saw the performance improvement in 1.3.1. However, when we moved to 1.1, underneath we used different implementation, different techniques, different technology. And the Robocode score plummeted down. The problem was that the image format was hard coded.

So right now what we tried to do was, there are two things that went wrong in Robocode. First of all, our lazy pixel conversion had not very efficient and the rest of the team. We're getting much closer to full frame rates that I've seen before. And if you remember on the 1.4.1 release, we were getting about four frames a second.

And that's because we were doing all that conversion right on the fly. And now, if I just start this battle up, we should be getting something closer to around 30 or so frames per second. We're getting about 32 and we may even get above there. Right now it's hard locked to 30.

So if I go up to maximum, we might even break 30 and get some more. But we're right around 30. And you can see here that we're not even using both of the processors. And there's something else going on here that you don't actually know is that we're running one of the really cool demos in the background that you'll see later in the session on this same machine. So it's actually got a lot of extra time. I could start my word processor and we'd still keep our 30 frames a second. And I also like to do this. Let's restart it again and turn on my favorite option, which is showing where the robots are scanning.

[Transcript missing]

And the future. First of all, we have lazy state management to finish. This should give us pretty nice improvement. Then there's more optimization that we can do. We have already tried implementing some of our lazy pixel conversion filters using the multiprocessors that are now available in many of our computers.

And we're not done with it. We're just testing, playing around, seeing how much improvement that can give us. But that's one of the things we're looking at. Also, all the back optimizations. So there's still quite a bit, quite a few technologies that we can use to make Java graphics go in faster.

And then we are talking to science engineers. When we are working on our lazy drawing optimization, we actually went to them. We went to Java 2D graphics engineers, and we told them, listen, guys, this is what we're thinking of doing. What do you think? Is it going to work?

Is it not going to work? Do you have any other cool ideas that we may do? And they loved our ideas, and they said, yeah, go for it. We don't even have it. So we've done something that they wished they had it. So we are definitely doing some interesting things.

[Transcript missing]

They're looking into OpenGL. They're looking into API, 2D API that sits on top of OpenGL. That's very similar to the hardware acceleration that we had in 1.3.1, but better, because we are not the only client of that. We don't have to support it. It would be system-wide, and if they come through, if they have that working, then that's definitely something that we would like our code work with them and take advantage of that. So we'll be looking very closely at what CoreGraphics guys will be doing, and certainly taking advantage of any code technologies that they have to offer.

If there's one thing that I would like you guys to take out of this session is What to do about the images. If you have to draw into buffered image, how do you determine what is the correct, what is the fastest image type that you guys should use?

And, you know, And this is the most important thing. Please do not hard code proper image types. That's an example of a code that would be hard coding it. What you can do is you can ask the system. for Compatible Image. Now, if you do this, then no matter what technology we'll use in the future, you're guaranteed that you will be given a buffered image type that will be the fastest on our platform. That is very, very important. And one more point here. If you have a choice, you have the option, ask for a volatile image. This will be - will have them part of Accelerated soon, hopefully. So if you have the choice, get a volatile image.

Now, there's one misconception among some of you with respect to indexed color formats. On other platforms, they're very fast and they're also conserved memory. So using indexed format is a way of compressing the pixel data using less memory. Unfortunately, on Mac OS X, they're not supported natively. So what we have to do internally to support that image format is to create brand new buffer just to have those pixels converted in a format that we can understand natively. And then that's the way we can use them. So indexed color format images on Mac OS X do not use less memory. On the contrary, they use more and they're slower.

If you have to use them, you have no choice. But if you do need to use them, use other image formats. And it's very easy, very often, for you just to see. You don't need to do a lot. Just change the buffer image format type. And second, and that's very important, the most optimal image format is not hard coded. It can change. It can vary from machine to machine.

If we were to move again to a technology that uses OpenGL, that is very dependent on a video graphics card you have in your system. Also, it is very important then to us to see what is the resolution of the monitor you're running, what is the depth of screen you're running on. So there will not be one and only one image format that is the best, the fastest. It will change. And you need to keep that in mind if you're writing for Mac OS X.

Now, if you really need to know what are the natively supported image formats at this particular time, and this may not hold even in the next few months, this may change, but at this point, only four image types are supported natively. Those are the fastest. And those are the image types into which you can draw, which means they're at the destination. You can create a context by perfect image, create context, get context, get graphics. So those only four are natively supported. Those are the pastest at this point.

If you need to draw an image somewhere else, meaning the image you have, the pixels, are source, then the natively supported image formats are a separate set of the destination. We have one more image format that we can support natively as a source, and that is ARGB alpha non-premultiplied. That is, by the way, the image format that Robocode uses.

And we have added special optimization in our lazy pixel conversion that actually allows the pixels to know whether they're in native format or they're in Java format. And then based on that, we have two different CG image revs. And then we can just switch very fast between two of them. And we can choose the pixels that are up to date.

And here is some techniques for the rendering. This is important in our platform as well. What we have missing, one of our technologies that we had in 1.3.1, allowed you to draw very...

[Transcript missing]

Now, this is for those of you who really need the fastest access to the image pixels, for whatever reason.

If you're writing an image manipulation program like Photoshop, something like that, then if an image is supported natively, unfortunately for you, there's no way to determine whether a certain image type is supported natively or not. It may change in the future. It's constant. Only at this point, it may change. However, if for some reason you need to do that, and you know the image format is supported natively, what you can do is grab pixels directly from data buffer.

On a non-natively supported image, you do not want to access pixels directly. If you do, we have to turn lazy pixel conversion optimization off, because the time, the second you touch pixels, those pixels are stolen. You can have access to them. We do not know when you look at them or when you use them. We have to do the conversion from native to Java every single operation.

So you do not want to touch pixels directly on a non-natively supported image. So go through graphics object and draw to it that way. Auto optimisation tips and techniques. This comes from my old All the work I've done on application was DNA sequencing application, and when I was trying to optimize it, here are the few things that I found that helped that application.

First of all, avoid creating new objects in your paint method. Obvious, this applies to all platforms. Don't create new fonts. Don't create rectangles if you need to use them, and manipulate them later on for determining the clip. Don't create any objects in paint method. Use simple primitive instead of shapes.

We have done our lazy drawing optimization, actually attempts to do that automatically for you. However, it would be, if you have the choice on Mac OS, then it is faster to drop the primitives directly using full reg 00, you know, xy, width, height, as opposed to creating a rect 2D object.

Use polyline instead of drawing lines one at a time. It simply avoids the crossing of the native from Java to native. And it's simply faster with current core graphics implementation because we can build a complex path if you have a polyline. Otherwise, we have to draw lines one at a time, and core graphics is not terribly good at that.

This will not apply probably to most of you, but if you have a limited alphabet and you know you will not be drawing complex characters, so if you're writing on text editor kind of application, this will not apply. However, if you have a limited alphabet, say, for letters, then maybe you can do that.

And the optimization is you can use bytes, not chars. Chars are 16-bit, and we do not know whether it could be a Unicode character or not. If it is, then we have to go through this more complex path to draw Unicode characters. If it's a byte, then we know it falls within the NASKE range, and then we can just bypass some of the complex text drawing routines, and we can go straight to Core Graphics to blit those characters. Use double buffering for your static portions of your applications. That is - that will apply to all platforms.

And we have added, with this release, we have added tons of runtime options for you guys to play around, to turn them on and off. You can turn off the optimizations that we provided for you guys. You can turn on and off rendering of lines or rectangles or shapes. You can use all of those runtime options to narrow down and to find out what is the problem with your application if you have one. So now for the demo, I'd like to welcome Ken Russell from Sun Microsystem.

A couple of weeks ago at Java 1, Sun announced the new Java Gaming Initiative. And one of the products of this initiative is a new OpenGL binding for the Java platform called JOGL. And JOGL is open source, and you can download the source code right now on java.net.

So just go to java.net, search for the project name, and you can get it. And thanks to Gerard and a couple of all-nighters, JOGL is now running on OS X. It's running on the developer preview that you've got with your 10.3 CDs, and it's not going to run on any earlier versions of Java for OS X, so keep that in mind.

But going forward, it will work, and it will be fast and robust. So we've got a couple of very cool demos to show you. This one is very special. This is Doby the dog, and Doby was developed by the Synthetic Characters Group at the MIT Media Lab. And Doby is completely autonomous.

He perceives his environment. He has his own internal motivations and desires. And you can actually train Doby in the same way that you would train a real dog. You can sort of lure him around and show him new motions to do. You can reward him by giving him a little click with a clicker.

You can, in some sense, scold him by ignoring him when he does a behavior that you don't like. And basically, Doby represents -- or at least it's a safe thing to say that Doby is pretty much a state of the art in interactive animated characters that can learn. And you can read more about Doby in the paper on him in SIGGRAPH 2002. Now, Doby, it turns out, is written almost entirely in the Java programming language, with a little bit of native code around the outside to get the custom device inputs.

He uses some of the more advanced OpenGL techniques, like vertex shaders, to do the shadow that you see here, and the cartoon-like shading around the edges of the dog. This demo runs at over 50 frames a second on a dual processor G4. And I should mention that the synthetic characters group is a big OS X development house, and so they do all of their development at this point with Java on OS X. This is the first demo that they've actually had to slow down, because it was running too fast. So it's actually slowed down to 30 frames a second, and because the G4s are so fast, the G5s will be even better.

So we're not actually going to train Doby right now, he's just going through his paces, but you can sort of see what's going on. There's skinning going on, there's action selection, and this is running on top of the JOGL binding for OS X. And also notice the CPU usage in the bottom left corner. There's almost nothing going on in there, everything goes through Video Graphics Card. Yep. So, cool stuff. Okay. Okay, so now here's another demonstration. This is a demo by NVIDIA Corporation that we've ported from C to Java.

Okay, now this is not real-time ray tracing. This is using a couple of tricks to get hardware acceleration for this technique of rendering glass with prismatic effects. So there is actually in some sense, many of you I'm sure are familiar with the technique of ray tracing where you send a ray of light out the camera into the scene. And that is in fact being done at every vertex on this wireframe model, but the trick is that it's being done on the graphics card by what's called a vertex shader, or a vertex program.

This is a tiny little assembly language program that is actually uploaded to the graphics card when the demo starts up that tells it, okay, we're going to take the camera's position and the vertex's position and the surface normal and figure out where should the reflected ray go, and where should the refracted ray go through the surface normal. And then it's going to go through the object. And basically it looks up in the surrounding environment, this street scene, where's the right texture coordinate for where the ray intersects the world.

And basically what it's doing is distorting the background texture in such a way on a per vertex basis that it looks like the thing's made out of glass. So it's not doing it at every pixel, it's doing it at every vertex, but it's close enough that it's really indistinguishable.

Another cool trick here is that you notice that we just talked about the fact that the camera is not actually doing the same thing. So it's not doing it at every pixel, it's doing it at every vertex, but it's close enough that it's really indistinguishable. Another cool trick here is that you notice that we just talked about the fact that the camera is not actually doing the same thing. turned off the fringe effects, what's going on is that we're rendering the scene three different times, each with a slightly different refractive index for the glass. And that makes the refracted ray go into a slightly different position in the surrounding environment each time.

Then those three things are added together, again, on the graphics card, and you get the -- what basically looks like a prism. So, I'd like to point out that the same binary for this demonstration runs on OS X, it runs on Linux, and it runs on Windows. And it runs at 100% of the speed of the analogous C++ code. Remember, this was a port, not a new demo. So, basically, we are here with respect to OpenGL performance in Java, and it's running on OS X. It looks great. So, go out and develop cool stuff.

We don't have Java 3D for you guys yet, but if you really need to use 3D graphics, then please use Joggle. And some of you might be familiar with Geo4Java, which is very similar product technology, also Open Geo binding. Geo4Java doesn't support the latest Open Geo standard, doesn't give you access to pixel shaders. Joggle does. So if you want to have 3D graphics on Mac OS X using Java, you can.