Digital Media • 23:26
Tuning for Velocity Engine and MP
Speakers: Glenn Fisher, Bob Campbell, Ken Ryall
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good afternoon ladies and gentlemen and welcome to the session Tuning for Velocity Engine and MP. I'm Glenn Fisher. I run the performance marketing group in worldwide product marketing at Apple Computer and as such have been involved with Velocity Engine and MP technologies for the last several years. Delighted to have all of you here today. We have a great crew of presenters from MetaWorks and would like to thank them for the work they've done in preparing this session and also Chris Cox from Adobe who has been instrumental in helping us with some of the demos that you'll see today.
So, why are we here? Two key technologies for Apple going forward are Velocity Engine and MP. The processor technologies that we've been developing make those technologies available to our customers or will in the future, and we need to make sure that you're on board supporting those technologies to get the best performance out of your application and to take advantage of the performance that's there in the machine for our customers.
At the same time, we recognize that it isn't always easy to take advantage of these technologies, so we're looking for ways to make it as easy as possible, and have been working closely with Motorola and Metaworks over the last few years to make it as easy as possible for you to take advantage of these powerful technologies.
For our information, how many of you have actually written Velocity Engine code or AltaVec code? How many of you have played around with it? A few more, okay. And how many of you are here to find out how to get started? Great, so that's wonderful to have new recruits to the crowd.
So what we're covering today, how to debug Velocity Engine applications using MetroWorks CodeWire, and we'll give you some specific examples of tuning code for Velocity Engine using MetroWorks CodeWire. We'll also give you a brief introduction to CodeWire support for MP coding and debugging. And to do that, I'd like to bring up Bob Campbell, who's the lead compiler engineer at MetroWorks.
Bob? I have to admit that I don't actually write a lot of AlteVet code and that's one of the reasons why we're very grateful to Chris Cox for providing us one of his examples. I look at a lot of AlteVet code but my tendency is to critique other people's and not actually write it. Along that line, Changing from scalar code to vector code requires that you put some thought into what you're doing.
And standard things that apply to scalar code still apply to Velocity Engine. So you should profile your code. You should make sure that if you're going to make something run faster, make something that you're actually spending time doing run faster. So find your hot spots and look at them and think about ways to rewrite them.
Another issue for Altevec is that alignment is important. The engine will run much faster if you have 16 byte aligned data. Then it will run if you've got things misaligned, then you're going to have to waste cycles realigning them. Another thing is Because you're going to work on things in big chunks of 4, 8, or 16 elements, you can't really have an if statement that does something different on one element that you're going to do on the other element. So you need to take advantage of instructions like vSelect that allow you to basically do conditional assignments within straight line code with no ifs. And probably a third point is that one of the most powerful features of AlteVec is the data streaming instructions.
So if you're writing code that's going ripping through memory, then you ought to tell the CPU, hey, I'm going to be going ripping through memory and I'm going to be reading every so often. Tell it how much you're going to be reading and what your stride is. That way it will read the data into the cache before you even get there.
As the example that we have that we're going to show you today, you can almost plan on restructuring your code. Algorithms that work well scalarly may be significantly different than the algorithms that you're going to work on in vectors. The example that we have from Chris Cox is a rotate example, and the rotate example was interesting because You change the way you move through your data in order to enable you to speed up. And we'll actually talk about that when we run the example. Another point about AlteVec is that You know, you see this great instruction and it'll do some across. It'll let you add four values together and produce a result.
You can't just go stick that in every time you need to do that one instruction. You need to be doing more than just using one vector instruction because there's setup overhead and you won't get back your setup overhead if you're only running one instruction. So it's not a good idea to write a bunch of little tiny functions that call one single AlteVec function.
And kind of in that point one of the things that we've discovered is that There's some bookkeeping regarding saving Altevec registers. So one of the comments has been, if you're going to do a bunch of Altevec code, there's a pragma you stick in at the beginning that says, I'm going to be using Altevec code.
Generate all your code with all the function calls, and then turn it off at the end. That saves some of the intermediate bookkeeping regarding saving the registers, because it sort of like sets it up at the beginning and cleans it up at the end. And last is, you need to look for parallelism in your algorithms.
Places where you can do four things at a time, or eight things at a time, instead of looking for sort of iterating through one at a time. At that point, I want to bring up Richard Atwell, and we're going to run through a demo of some Rotate code from Adobe. Oops. And I'm in charge of the monitors, so... So here basically Richard has launched the program in the debugger. Are we at the beginning here? Yeah, we're at the beginning. You want to run to the first breakpoint.
The part that's interesting is the original rotate algorithm just went through rows at a time, writing out columns at a time. So it's a 90 degree rotate, so if you're going along like this and you're writing down like this. Can you make the font bigger, Richard? The change to do for the Altevec algorithm is instead of to think about working in rows and columns, is to take the input and break it into tiles, where the tiles exactly fit basically 16 byte by 16 byte square.
So what we're going to do is we're going to load a 16 byte by 16 byte square into Altevec registers, rotate it in the registers, and then write it back out where it goes. So instead of going along word at a time, we're going to grab a chunk, we're going to rotate it, and we're going to write it back out. So actually Richard is here at the function which is going to do this.
I'm not really sure how in-depth we want to go into the instructions, but it essentially does. You want to step through it? Oh, we were going to show off -- Richard wants to show off some debugger features. These are what he works on. So we've got a nice register window. You can look at all the Altevec registers. You can step through code. You can watch the registers as they change, as it loads them.
So basically at this point he's going to do one more step and it's going to have like basically loaded all loaded this whole tile into memory and I'm sitting here looking at Chris going, you should explain this stuff. But it's actually a pretty neat algorithm because instead of doing what I would have done, which is individually move the elements within the vectors around to the right spots, he used a trick with the merge instruction to sort of merge part way and then a second merge which puts everything exactly where it wants to go.
And if you really want to know, are we going to make this code available? Okay, so we won't promise that yet. But it's a pretty nice thinking about the problem differently and looking at what AlteVet can do. And instead of saying, you know, move my elements individually, I move half my elements halfway where I want them. And then the second set of merge instructions moves them the other half. Richard, going down to the merges, is the entire rotate is... So basically that set of eight instructions causes everything to rotate. And I think Richard wanted to show a few debugger features.
Can you hear me now? OK. So one of the problems that we had was how to represent the vector registers because they're so large. So what we decided to do is make use of the struct paradigm that we already have in the variable view. And you can take a look at all of the scalar elements that live within the vector elements. And you can modify these things individually in order to help you with your debugging.
So because we had the ability to do that, we also wondered what we can do for breakpoints. And we have a conditional breakpoint feature in the IDE, but because of the way Motorola specified the structs as being anonymous, you couldn't access the elements inside. We had to invent some syntax for you to get inside those things.
So if you look down the breakpoints window here, we have a conditional breakpoint and we're going to set a breakpoint Line 32, and it's going to be on the source one variable. If you take a look at the syntax, it's SRC1 dot followed by what looks like an array notation. This is just something that we borrowed because we're using this array-like notation to indicate the scalar elements inside each vector. So what we can do is we can change the PC and move it up here.
If I go run, it's going to hit the breakpoint. If I change it back again, and change the condition, it should bypass the breakpoint, as it did. Any other debugger features? The Vector Register window is so huge. So we made it scrollable. and if you're small on real estate on your PowerBook or whatever, we don't have G4 PowerBook yet, but when we do, you'll be able to grow the window and expect register values and then shrink it back down and it remembers the position you had it scroll to before so you can kind of manage the debugging a bit better and not let the debugger get in the way of your programming.
Okay, so could we quit the debugger and run the non-debug version? Sure. I want to show, because basically Chris was kind enough to set us up with an example that runs the scalar code several different ways and then runs the AlteVec code. So if we could just run that, and of course it's going to build it. The scalar code was what I was talking about earlier, where it goes through memory by row, writing out columns.
Then an additional one that was tried to do it the other way was writing out by rows, but reading by columns, sort of inverting the way you do it to kind of look at the differences in the way it walks through memory. So then, you know, and I can't hardly read that.
No, it's not your fault. It's our terminal window. The first attempt up there took, what, 4.3 seconds. The second attempt took 5.3, 5.2, and the third one was... 5.1. Those were all scalar attempts of different ways of doing it. And the one thing I like about this is the fact that Chris very religiously has a base case, he writes a new case, he keeps the base case, and he always keeps comparing the times to make sure that we're actually getting better.
The fourth case is the vector case, which is the one that we stepped through in the debugger. And that one took, what does it say there, Richard? 2.7. 2.7 seconds? So not quite half, but the third one, which is actually the interesting one, was where he combines all the features. He added in the data stream touch instruction, where he sort of tells the CPU which memory he's going to read ahead. And that one runs at 1.6, which is, what did we get to? We got better than half, right? Almost a quarter.
Yeah, so it kind of shows you that there's two parts to the Velocity Engine. One part is the vector unit, but to get full performance out of it, especially for data streaming operations, you really need to take advantage of the cache hints and starting to tell it what you're going to read ahead of time.
[Transcript missing]
There.
So it's basically the DataStream touch instruction. It takes the pointer. It takes a cache pattern. You need to read through the manual a little bit. And what's the one? The one in the end, the last parameter. Oh, that's right. This is stream one. I should know this. So anyway, that's what I was going to talk about. I was actually going to save stuff for Q&A. And I want to bring up-- push the right button, push the right thing. Bring up Ken, who's gonna talk about MP debugging. Thanks Bob.
We've been doing a lot of work with the IDE in the debugger, trying to carbonize it and get the tools ready for the Pro6 Beta CD that we've got here. But we've also kept Richard really busy working on other Apple technologies, too. All of the Altevec debugging support you saw, and then also support for the newer flavors of the MP debugging, or MP APIs. And we had some MP support in the debugger before, but Richard's really revved it and made it work with the newer APIs. There's a new MetroNav extension and newer plug-in, and some new user interfaces done, too.
Let's see, we've already seen some of this in the demo, but just to recap, there's the vector register window for AltaVec. You saw how vector registers and variables can be shown as structs for easy viewing of those. And then the work we've done to support that and the expression parser and also for conditional breakpoints.
For MP, we already had sort of a paradigm of having separate thread windows in the debugger. You can also view all the threads in one window with a pop-up menu if you want to look at them that way. And so we just use the same paradigm for MP tasks. We also list the MP tasks along with the cooperative threads inside the processes window.
So you can look at specific processes and see exactly which MP tasks and which cooperative threads they have. And then we've also tweaked the registers window interface a bit so that you can look at registers for separate MP tasks in separate register windows. And so now we'll go back to Richard for a demo of some of the things I just talked about.
There we go. OK. So if you were in the session this morning on Mac OS 9 and multitasking-- George Warner was up on stage and he showed you this little demo called Close View. and Bob Campbell. MP 1.4 API. We've decided to jettison support for it too. So we've combined the extensions together so there's only one Metronub extension for debugging so you don't have to swap this thing out to do MP stuff. And it makes it a lot easier. So I'm going to debug this thing.
So the program starts up and the first thing I want to show you is the preferences in the So we can see CloseView MP under control of the debugger here. We've got the main thread, which is a cooperative thread, and it's suspended. And I'm just going to step through into the code to the point where we're just about to create the MP task to do the work. So can anyone read the text OK? OK, so I'll make it a bit bigger.
Here we go. So what we're going to do is we're going to step over this line, and it's going to create an MP task. And MP tasks are created in the runnable state. and in the main blit routine that's going to draw into that window, I've got a breakpoint set. So what should happen is I should step over this thing, we're going to get another thread window and we're going to hit the breakpoint. So here we go. So let me just resize the window here because we've got a little bug.
So under the control of the Code Warrior Debugger, we have the cooperative task that was drawing the interface, and we also have control of the MP task. So these things are running completely independently of each other. So what I can do is I can step in the main task.
and go back to the MP task and I can step independently of the two. Let's get the main task running again. We've stopped at wait next event, so I'm going to take the break point off, hit run. So we've got the program running back here again, except what's happening is we've got the break point still in the MP task, so we're not getting any drawing happening. But the cooperative task is free to run, and that's what's drawing the window and letting us drag it around. So I'm going to go back to the window here.
and Bob Campbell. So, I'm going to move this around a bit so you can see in the corner. So, the Blitter is running in a loop, so I'm going to set the Code Warrior debugger, stopping every time it hits the breakpoint. And you notice that the way this thing works is it updates the display underneath the mouse. So, as I move the mouse around, hitting resume, we're going to get the display updating.
So let's go back to the processes window. So we can see that we've got a few tasks. We've got the cooperative task. We've got our MP task, which reads as stopped. And the task above it is called the death watch task. It's something that the OS creates, and it tears down all the MP tasks when the application exits. So let's go back to the breakpoints window here. And let's temporarily disable the breakpoint. Get the MP task running again. And that's updating. Let's set the breakpoint again. Next time we blip, we stop. So basically we've got MP 2.0 support for debugging the code order.
and that's it. Thanks Richard. And that's in Beta 6 right now? Richard Wolitski: Mostly. This stuff is a little bit newer and it's going to be finished up next week and then put up for download. Yeah, so if you get the Beta 6 tool CD that we have here at WWDC and then just check our website in about a week and you can get the same stuff that Richard's been demoing here. We'll have them outside the door after this session is over.
I'm sorry? We have the same architecture for showing multiple threads, whether it's Java or MP. But it's up to Metronub to actually figure out, you know, what the threads are doing. Okay, that's it for our MP demo. So now here's Godfrey. Howdy. I want to thank our MetroWorks friends for coming and showing us this stuff. We left a lot of time for Q&A today so that we could field your questions and just run it kind of informally from that point.
So without further ado, we have three mics set up. People should queue up at the mics. And a little roadmap of some more sessions this afternoon. We have Apple's performance tools for Mac OS X, and tomorrow we'll have a debugging session in the main hall at 9 a.m. We have people queuing up. Could all of our presenters please step up to the stage?