How To Translate Altivec to SSE - WWDC 2005

Application Technologies • 38:13

By using Tiger's Accelerate framework (the high performance math library), your application can access the power of Mac hardware without having to vectorize your code. This hands-on session will walk you through different scenarios using the Accelerate framework, including BLAS/LAPACK, vForce, vDSP and vImage. This is a must-attend session if you're doing anything with math.

Speakers: Eric Postpischil, Ian Ollmann

Unlisted on Apple Developer site

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

What you'll learn in this session is what the Accelerate framework is about. Some of you already are using it, but for people who are not using it, this would be a good introduction. How to use the Accelerate framework? That would be Eric's section. And how to translate Altivec to SSC? We have a small section for hardware overview. We have the instruction overview of the SSC and the translation techniques of how to translate your Altivec to SSC.

First of all, I think the best thing to do for performance on Mac OS X on any platform for people who are hungry for that sort of thing for computational performance is to use the Accelerate framework. And what is inside the Accelerate framework? We have a state-of-the-art image processing library. We have vector mathematics, which is vForce and vMathlib.

We also have a state-of-the-art signal processing library. Some of you might be using our FFTs already. And our in-house technologies use also the FFT quite a bit. Then we also have the entire LA-PACK and the BLAS, which is the basic linear algebra subroutines. We also have large number computations.

I'd like to go over some of these briefly to tell you what these are about. The level 1, 2, 3 BLAS, the basic linear algebra subroutines, are vector-vector, vector-vector matrix and matrix-matrix computations. And if you're doing that sort of work, this is industry standard, it can be found on all sorts of platforms and all sorts of OSs. The APIs are exactly the same. LAPACK, the linear algebra package for eigensolvers, system solvers and things like that. If you're doing linear algebra work of any sort in speech or in signal processing, this would be the sort of library that you like to use.

Now a little bit in detail about what these libraries are all about. vDSP, the signal processing library that we have, it has Fourier transforms, 1D and 2D, real and complex, single and double precision. It has fast array math, convolutions and correlations, all the regular things that you can find in a signal processing library. vMathlib and vForce, we set out to produce exactly the same thing that we have on the scalar domain, which is the LibM, on the vector domain.

So if you are sitting in vector code, be it SSC or Altivec, and you are dealing with 128-bit vectors, hardware vectors here, not mathematical vectors that you point to, pointers, then you would like to have the same facilities that you have inside your scalar, like if you like to take the cosine of that vector or sine or things like that.

vMathlib is... vForce is that sort of facility. It will give you, it will allow you to stay on the vector domain, it will give you the answers back in the vector domain. So this is for people who are actually vectorizing their code, writing vector code. A lot of people who would like to benefit from vector computation in single precision on Altivec and single precision and double precision on SSC, would like to have this sort of facility, but they like to stay away from vector arithmetic.

vForce is for them. It turns out vForce has higher performance than the other libraries. vForce has higher performance than vMathlib, and the reason is that vForce operates on very large arrays. You can give it an array of 1024 elements or 1024k elements, lots of elements, and this allows the pipes to be saturated, so it will give you much better performance.

Using vForce does not make you use the vector engine directly, but underneath it decides to do it, and it will do that for you. So you don't have to be any vector programming to use vForce. Very, very capable of doing that. Very, very capable little library. We introduced that. This is the first incarnation of it for Tiger.

As we go along, before I go through this here, is that our aim is that vMathlib and vForce will mimic more and more the functionality of the Libem that we have on the scalar domain. So many of the functions which you may find not there will eventually show up there, so that we'll have parity pretty much between Libem, vForce, and vMathlib. So that's what we're planning on doing. The other two libraries are vBasicOps and vBignum. As you know, the hardware on SSC and Altevec allow you a certain amount of facilities to a certain amount of data types.

On Altevec, you have things up to 32 bits, but if you wanted to have higher than that, if you wanted to operate on 64 bits or 128 bits, let's say adds, multiplies, or shifts, or things like that, then the vBasicOps will do that for you. vBignum is essentially the same thing, but it takes it from 128 bit to 1024 and allows you to do multi-precision arithmetic on that. The last thing in here is vImage. vImage is our, as I said, state-of-the-art image processing library. It is used in all sorts of technologies within Apple and outside. It's very fast. We introduced this two OSs ago.

It has the regular things, the convolutions, the morphologies, geometry, transforms, and things of that nature. What I would like to do is I just gave you a small overview of what the Accelerate framework is. I'd like to pass this on to my colleague, Eric Pospischel, who is going to talk to you in depth about how to use the Accelerate framework by one example and go over one example. Eric. Thank you.

Good morning. Well, I just told you some of what's in Accelerate. I'm going to tell you why to use it and how. This slide shows the performance of one of our FFT routines on a G5 system. FFT stands for Fast Fourier Transform. We don't need to go into much detail about what it does, although coincidentally somebody just asked me before the talk about starting to use this.

It is a quite popular routine in signal processing. It's a famous routine and it provides a good mix of both calculation in the processor and motion of data in memory. So it's a challenging routine to optimize. The first table shows how long it takes to execute a single-position FFT on 1,024 complex elements on a G5.

If you open up Numerical Recipes and type in the 4-1 FFT routine there and run that, it's going to take about 130,000 CPU cycles to finish. If you're familiar with the FFT, you might have heard of FFTW, which stands for Fastest Fourier Transform in the West. And it's a highly optimized and self-tuning FFT developed at MIT by Matteo Frigo and Steven Johnson.

And it certainly runs faster than Numerical Recipes in about 32,300 CPU cycles. But it doesn't take full advantage of Altavec's capabilities. FFT does take advantage of vector processing to great effect. And as you can see, it runs a little faster than the fastest FFT in the West. It runs about three times faster in 11,200 CPU cycles.

So we got an advantage here out of using the Altivec capabilities. Double precision is harder because the PowerPC doesn't provide vector processing for double precision. And in spite of that, we still managed to beat FFTW quite handily from 40,000 CPU cycles for FFTW to 27,000 for ours. So that's a clear win. That should motivate you pretty strongly to want to use the Accelerate framework.

And next, here is how to use it. I'm going to go through this quickly, and then I'm going to look at a little bit of example code. So, in the source code, you need to include the Xcelerate header. Simple enough. And when you build the application, you need to tell the linker that you're using the Xcelerate framework.

And if you're using Xcode, as probably most of you are, certainly after Monday, I think a lot of you are, you do that just by adding the Xcelerate framework to your project. And once you do that, Xcode knows how to pass the framework to the linker. If you're using make or link commands directly, you just add the GCC switch, dash framework Xcelerate to the link command.

So those are the declarations you need. And here's some sample code. One reason we chose the FFT is it's a little more complicated to use than most xcelerate routines. A lot of routines, you can just call and do the work. The FFT as an alternative requires some setup.

In spite of that, it's still fairly straightforward to use. So this summarizes the code to do that. To do the setup, you call this routine, vdsp create FFT setup at the--you call it once at the start of your application, and it gets the memory it needs, and it does some calculations to prepare some constants for later FFTs, and it returns a value to you that is a handle to that data.

You do need to check that handle to see if allocation of memory failed. If it didn't fail, then you can use the FFTs as many times as you want later in your application without having to check the data anymore-- check the setup anymore. So once you've got the setup, you can go ahead and call vdsp fft zip.

And you can call this as many times as you want. This is a high-performance routine. The setup is not. You call the setup only once. You call the FFT as many times as you want while you're doing your image processing or system processing, whatever. And at the end of your application, you release the memory. You call vDSP destroy FFT setup to do that.

Could I have demo one on screen now? We'll take a look at the demonstration code. This is actual running code. It doesn't do too much because all the data is zero. You have to supply your own data. We just give you the routine. We'll take a look at some pieces of this. Obviously the include of the headers here. Another thing to look at here is we allocate the data, we allocate, excuse me, we allocate space for the data.

Malik is a common way to allocate large amounts of space and it has additional benefit of returning addresses that are aligned well for high performance. The AltaVec capability, the AltaVec processor wants its vectors to be aligned to multiples of 16 bytes and Malik will do that for you. If instead you allocate space by declaring arrays in C, they might not be aligned well and you won't get as good performance and that will, there will be some effect of that as well on SSE.

It doesn't require alignment in the same ways that AltaVec does but it still benefits from it in situations. So we do recommend using Malik to allocate the space you need. And then for complex data, we package the, We package the pointers to the real imaginary parts into a data structure here so they can be passed around as one unit. I just set the data to zero. As I said, you have to supply your own. We call the setup routine here and check its result and call the FFT.

When we call the FFT, that first parameter is that data that got returned by the setup. The second parameter is the structure that contains the pointers to your data, the signal data. The third parameter here isn't the length of the data. It's going to be the base-2 logarithm of the length. So that would be 10 to represent 1,024 elements. 2 to the 10 is 1,024.

And, the final parameter is the direction, which is forward for forward FFT and FFT_inverse for a reverse FFT. And, there's no result here, we don't have anything to test because we know it's going to work. This sample code does only show one call to the FFT, but typically this is going to be in some sort of processing loop and release our space and destroy everything at the end.

So, it's really pretty simple to use a lot of the accelerate routines. you can just insert them into your application pretty quickly on that. Now I'm going to give the floor to my colleague, Ian Ollmann, who will show you many things that you must do to optimize an application for high performance if you choose not to use our Accelerate framework or can't use it for some way.

So the reason we presented the Accelerate framework first is we really want you to investigate that first and save yourself a lot of time. You can write your own AlteVac code, and a number of you have, and that's great. There are a lot of functions that we just can't provide in the Accelerate framework because they're very specialized to your app. We'd like to build it as we go along and bring in more and more functionality.

But if you've got something really wild and crazy, you might have to write your own code. So this is always going to be something people will be looking at doing. And now that we have the Intel platform, for the same reasons you vectorized, you're probably going to want to write SSE code. It's a definite strong win over Scalar. So we encourage you to consider doing that.

So for the rest of this talk, I'll be talking about SSE. What I'm really talking about is SSE, SSE2, SSE3, and any future incarnations. It just gets to be taxing to say that all the time. So the good news about translating AlteVac to SSE is that they're actually quite similar.

They have the same vector size as 128-bit wide for the register. There's pretty good instruction homology. You'll generally find that the AlteVac series of instructions are larger than SSE in terms of numbers. So there'll be a few things that are used in AlteVac that aren't there, and you'll have to find ways around it.

In our experience, and we've ported a bunch of the Accelerate framework over to the SSE unit, most of the work that we actually had to do when writing the Altivec code was the parallelization, the DAO layout, planning and handling misalignment. The good news is you've already done that work if you have Altivec code, and you don't really have to redo it for SSC. The data's already there where you put it. It's already aligned.

It may already be padded out to multiples of 128 bits. So in our experience, and we hope that it carries over to you, you can expect the translation of Altivec to SSC to take a small fraction of the time that it took to write the Altivec in the first place, in most cases.

So as a overview of what you learned, we'll have a brief discussion of hardware. Now, Apple hasn't actually said what the specific chip is that's going to go out, so I'm just going to kind of gloss over it and show you what's on the developer transition system, but don't expect that that's necessarily going to apply to what customers get. There's a C programming model. You don't have to write an assembly. As its on-site of data types and intrinsics. And finally, I'll talk about difference in instructions, like how to map from ultimate to SSE and some of the tougher cases and what to watch out for.

So as far as the hardware overview, as I mentioned earlier, there's eight 128-bit XMM registers. If we, on the 64-bit architecture looking way out into the future, or if you're programming for Linux now, you'll find that you actually get 16 of them. The register file is flat like Altivec. It's not stack-based like the FPU on the X87.

There's no special purpose registers, so you can use all that space. The instructions generally take no more than two register operations-- operands. So you might be used to from Altivec being able to provide up to four operators, so you'll only get two here. And the instructions are typically destructive. That is, one of the inputs is destroyed and replaced with result data.

On the Pentium IV core, I just want to give you a sense that there's really a lot going on behind the scenes. You've got eight different units providing the vector support. So it's not very simple in the back end, but on the dispatch, everything goes through port ones. You kind of get this throughput of one vector instruction per cycle.

As far as the data types that are supported, they support the same sorts of types as Altivec. You can get 8-bit, 16-bit, 32-bit ints, 32-bit floats as well. In addition, there is support for 64-bit types, which means 64-bit ints and double precision floating point. The hardware provides aligned and misaligned loads and stores, so that will probably be welcome to a number of you. There is a limited permute capability on board. It's you can do predicated operation just like Altivec.

It has some saturated arithmetic support. It has this normal set of integer operations that you'd expect from Altivec, sort of multiply, add, subtracts, min, max. It also provides a sum of absolute differences instructions useful for video encoding. It does the various Boolean operations. Same thing for floating point, multiply, add, subtract, et cetera. In addition, there's hardware divide and square root, so that will save you a little time if you're tired of programming Newton graphs and opt-in.

The C programming model for SSE, there's an Intel C programming interface supported by the Intel C compiler and also various other Intel-based compilers. And it's a lot like the Altivec C programming interface. So to give you an idea, they provide a series of types. The Intel types, there's a double underbar M128I, which will stand for any packed integer. You can also use double underbar 128 for floats.

And the other type of type is the D. In Accelerate framework, we provide a whole series of types to better differentiate the various integer types you might put in a register. We actually suggest you use these things because it gives GDB more information on how to show the data. So you can actually view your data as a vector full of 8-bit types. So if you include accelerate.h, you'll be able to use the sort of VSNT8, for example, as shown here. And you can view your stuff as a series of cares.

As for the intrinsics, Altivec would have a vecadd, SSC will have a mmaddepi8. The convention here is they put the type of the data on the end of the instruction. E stands for extended, which means 128-bit register. It's not MMX. P for packed, I for signed integer. If you have an unsigned integer, it'll be EPU. 8 is the size of the element type.

Where Altivec uses vecadd over and over again for different data types, you'll see that SSC changes the function name or the intrinsic name as you go along. PES stands for packed float, packed single precision float. And in addition, the SSC unit supports 64-bit INS, which is the EPI64. You can do packed double precision. And also for the floating point stuff, it has a series of scalar operations that operate only one element in the vector.

So here's the sample function. This is what you might do in Altivec to calculate the distance from the origin to a point in 2D space. It's a simple square root of x squared plus y squared. Here's the same thing in SSE. And you'll notice as I flip back and forth that very little has changed here. You have to include an extra header. There's an xmm and trin.h, which on GCC will pull in everything. On Intel you might want to pull in one of the later headers like emm or pmm for the Prescott new instructions.

And there's a standard multiplier you're used to, mmolps. Altivec has a multiply add fuse. There is no multiply add fuse on SSEs, so what we do is we split it apart into add and multiply on the second line. And then finally, for the square root, we just call into Accelerate framework, which is present on both platforms, and you can call it with either Altivec vectors or SSE vectors.

If you're clever, you can sometimes write one piece of code that works on both platforms. So in the Universal Binary Programming Guide, you'll find an example of a matrix multiplication that's written using a shared set of code and some defines up at front that map it to either instruction set architecture. C provides a sort of leading edge support for standard operators with vectors like this example, which would work on both.

But that would tie you to GCC, so take caution before going that way. As far as what parts of the vector unit are guaranteed to be there, the hardware and the ABI supports MMX, SSC, SSC2 guaranteed. To detect SSC3, you go through the same sysketl by name interface that you use for Altivec, except you change the string that you use to test for SSC3 in particular. particular.

For Floating Point, you'll find that as far as the ISA goes, there's a very good matchup in general. Both are strongly influenced by IEEE 754. SSC is a lot like Altivec, except that you also get double precision operations, you get scalar operations, you know, the divide and square root in hardware, and it's a fully IEEE 754 compliant machine, which means you can actually get exceptions and set flags and set rounding modes. You don't get the multi-add fuse unit, and you aren't going to get vector compare bounds. I doubt any of you have used the vector compare bounds. And there's a log 2 to the x estimate on Altivec, which isn't there either, but you probably didn't use that either.

As for the Transcendentals, as mentioned earlier, just include Accelerate, accelerate.h. It's sort of like your vector math library. It will give you all of the vectors functions that you've come to expect from libm. And you get the rounding modes. If you want to change the rounding modes or exceptions or flag bits, they provide intrinsics mm_get_scr. You can read it in and you can use mm_set_scr at the bottom of this slide for changing things.

There's one additional hazard under SSC that you'll have to watch out for if you're doing floating point work. They take a trap under normal. Altivec does too. The big difference is that by default on a gValue, you can't do that. So if you're running a Mac OS X, the denormal handling is turned off.

You can turn it on if you want and you'll get a kernel trap. And that takes 1,100 cycles if you hit a denormal. On Intel, it's handled in hardware, but the denorms are on by default because we're doing most of the scalar arithmetic in your application using the vector unit.

And if you hit a denormal, it's a 1,500 cycle stall. So you may want to take special steps. You can do things like, for example, check with the integer comparers to see if you're actually a denormal and then handle it. handle those differently so you don't actually hit the denormal in the floating point segment of the vector unit.

If you want to turn denormals off, which might be advisable for real-time code, you can do that by using double underbar mmgetScr to load in the status and control register or in the appropriate bits to turn off the denormal support and then slam it back into the MxScr.

As for integer, you'll find the instructions are very similar to MMX, if you've ever done that for SSE2, except they're sort of stamped out twice in the register. You can do standard add and subtract. They provide 8, 16, 32 and 64 bits. For saturated ads, the support is more limited. You only get 8 and 16-bit saturated ads. You can do min and max operations with signed and unsigned, but only 8 and 16-bit integers. There's 5 multiplication operators.

Almost none of them look very much like Altevec, so if you're doing multiplication, you typically have to rewrite those segments, take advantage of these things. They do the low and high work 16-bit part of a 16 times 16-bit multiply. There's another one that will do two 16-bit multiplies to get a 32-bit products and add them together and give you a 32-bit result. That's a little bit like that. I can't sum. And there's some support for 32 by 32 gives you a 64-bit result.

There are various compare operations. They operate just like AlteVec. You can -- when you use them, they return minus one, where the compare returns true, and zero in the mask where it's false. But they don't set the condition register or any equivalent thereof on the x86 architecture, so what you -- you'll have to do that a different way if you want to branch directly based on what those things do. They provide a whole series of compares, actually a larger set than you'll get in AlteVec, and they're available in floating point in both scalar and pack varieties.

So the AlteVec set are here in blue, but you also get not equal to and unordered and ordered compares, which are occasionally useful for doing the right thing when NANDs come through your code. For integer compares, there are only signed compares. There are no unsigned integer compares, so that involves some dancing to get around, and there's no 64-bit integer compare.

For predication, if you actually want to branch based on the result of one of these compare tests, they have a move mask, a series of move mask instructions that you can use. I forgot to mention earlier that SSE is kind of designed with some segmentation between the data types.

The integer, single precision and double precision are different, so oftentimes you'll find instructions that do substantially the same thing. For example, the loads, which actually have three different variants. It's encouraged that you use the right load with the right data type. And here we have three different move masks. So we have one for integers, one for double and one for float.

So anyway, in this example, if you want to do a SSC version of vec any equals, which will branch if any value between A and B is equal to its counterpart in the other one, what you can do is do the regular compare, which provides you with the zero or negative one mask. Move the mask bits, it copies the high bit from each packed single into the integer register directly. And then you can test that against zero.

More commonly in vector you'll use something like a select instruction to trace through both sides of a conditional and then select between the two results based on the value of a test. There's no select instruction. So what you have to do is the and not and and then or the results together, which you might have been doing in scalar code. It's the same sort of thing.

For type conversions, you can do float to int conversions, but it only does signed ints, no unsigned support. In addition, it saturates differently. If you have a very large floating point value that's outside the range of a standard signed int, it'll saturate to a very negative number. So if you want to preserve the Altivec behavior, there's a three instruction sequence you can do that.

You first test against 2 to the 31 to see if it's an overflow and you get a mask back. Then you do the conversion and then finally you XOR the result with the mask to flip all of those overflows, which are giving a very negative number, back up to a positive saturated value.

There are various int conversions. Mostly these things boil down to kinds of permute or pack operations. In Altivec, it's the same sort of thing on SSC. There are a couple of conversions which are not there, such as the int to unsigned short saturated pack. I don't particularly use that one myself very much, so I don't miss it, but you might.

There is no unsaturated packing. That's just a permute operation, but if you need it and the permute doesn't quite do what you want, then one thing you can do is sort of shift left and then shift back right algebraically to chop off the high bits and then do the saturated pack. That will keep the saturating pack from saturating. From the small int to large int, usually you just use the unpack. They're a lot like the Altivec merge instructions. So you merge it with zero to go from a small unsigned to a larger unsigned value.

For the Permute unit-- oh, I'm sorry. This is-- I'm sorry. It says unpack, but this is really how to do a transpose. The merges are there, so you can just-- previously in Altivec you'd use merge to do-- eight merges to do a transpose operation. You can do the same thing with unpack things.

The only difference is that because the-- the data appears in the register, you swap low for high, which can be a little confusing. As for the permute shifts-- the shifts, there are shift supports for a variety of different types, but there isn't shift support for 8-bit integers. So if you're working on 8-bit ints and you need to shift them, you'll probably have to promote them to 16-bit ints, shift them, and put them back. They do have shifts for 64-bit types, which is handy.

There's a 128-bit shift like Altevec, but it'll only move in multiples of 8 bits. It's a little bit like vec slo or sro. And it's a shift by immediate only, so if you want to shift like we might have done in the permute unit for misalignment, there's no way to do that.

Finally, there's a series of shuffle commands which do some basic sort of arbitrary permutes. I wouldn't say that it doesn't cover all of the kinds of permutes you'd want to do with Altivec or that you could do with Altivec. Altivec will allow you to move any bytes anywhere.

The permute unit is a little bit more limited here. So for floating point, there is a floating point shuffle that will allow you to copy two floats from vector A and put them in the high two positions on the result. And then two floats from vector B and put them into the low two positions on the result. There's an integer permute which operates on 32-bit ints which will allow you to just randomly shuffle around within one vector.

And then there's a 16-bit thing which does the same thing except you randomly shuffle them around within the high or low 64 bits of the vector. And there's no way to do 8-bit stuff. The only thing there is is 16-bit shifts left and right. Or you can do 8-bit merges.

So as I mentioned earlier, there's a lot of things that can't be done with the permute. It's sometimes difficult to move data between the low and high 64-bit segments. The minimum permute level is the word, which is 16-bits on the x86 platform, not a byte. So if you want to do something like a byte swap, you're going to have to shift left, shift right, or the results together. There is the byteways on the pack.

But there's no good way to do data-dependent data movements. So if you are writing a matrix inverse and you need to find a pivot element and you want to rotate data around based on the pivot element, you don't know where it is at compile time, you might have to start out and use a scalar unit.

As to loads and stores, the scalar loads and stores always load into one end of the vector in a fixed position, which may be welcome. It's a little bit of a game to figure out where your data went on AltaVec if you're doing scalar loads and stores. And there's a whole series of instructions that can do that.

There's move_ss, move_sd for single and double precision, and move_d for integers. Move_d will also move directly to the integer unit, so that's kind of handy in some cases. So that doesn't really translate directly very well. And then-- The vector loads and stores on SSC, they have the align flavor like Altivec. It's very fast.

However, if your pointer is misaligned, Altivec would just return you the aligned vector. SSC will return an access exception, so it's suggested you don't do that. As for misaligned loads and stores, those are handled in hardware. There's a separate instruction for those, which is convenient. It's one and a half to two times slower in many cases. But misaligned stores are particularly costly, so I suggest you don't do that. You might want to iterate in scalar code up to the first the line boundary and then go forward with vector after that.

So in summary, the misaligned access is handled almost completely differently. It's almost orthogonal to Altivec. You can write some Altivec which does the same thing as SSC but is not a particularly fast implementation. It's suggested you always align your stores and be careful of those crashing problems. And there's one more thing when you load and store data from memory. It's little endian, so it gets bytes swapped.

The full 16 bytes, the entire vector is bytes swapped. And the end result of that is that the elements which should appear on the left side of the vector are now on the right side of the vector. So you're going to have to keep that in mind while you're programming because those shift lefts or shift rights are actually going the other way from what you think. And so there it is in the graphic.

Unrolling for performance in PowerPC you'd frequently do several things in parallel to try to keep the pipelines busy. PowerPC is especially deep pipelines compared to x86. So this was often required for performance. It's a bad idea on SSC in many cases. There aren't enough registers to hold that kind of parallelism and everything is reordered quite successfully and quite efficiently in the back end.

So you're actually better off with some serially unrolled code that just reuses the same small set of registers over and over. That'll save you some spills out onto the stack. And it all gets reordered for parallelism back in the back of the machine. So the machine stays full. You just don't have to write this massive body of parallel code. You just write the little loop.

So other performance tips. Overall, I would suggest you try to reduce or eliminate your need for the unit. There are many cases where people go to town with that thing and work to create an Altivec. It's not going to port very well. You may have to change data layouts. It's very -- it suggested you align your data.

That was always a good idea for Altivec, but it's a good idea here, too. Do not enroll as much for parallelism, as I mentioned earlier. And one other thing -- I mean, it's often a fun game in Altivec to synthesize your constants and code rather than looking them up from memory.

It doesn't really get you much here. There's no VEX splat to begin with, so it's hard to make them. And then the problem is that there's eight registers, so they'll just spill out to the stack anyway, so you'll have to load them in again. So if this is starting to sound like a bit of work, I will encourage you once again to use the Accelerate framework.

In summary, I'll say that the platform that you'll be porting to has fewer registers. It has a 64-bit ALU in the back end, so things are moving through in successive pipeline stages rather than one instruction per pipeline stage. And the pipelines are a lot shorter, so you have to enroll a lot less, or sometimes not at all, or preferentially serially in parallel. Nearly all the vector instructions dispatch through a single port, except loads and moves and stores. So there's not as much advantage to hand-scheduling stuff, so you get permuted against ALU activity, that kind of thing.

The Seq program model is there. The data types, you can use Accelerate types for both Altivec and SSC. They work interchangeably. And for intrinsics, you can include xmm.h to get them. And so we found in our case that some codes are not as good as others. In the case that some code can port over fairly cleanly, especially floating point code, if you're heavily dependent on permute or have difficult misalignment problems, then those could be more of a challenge. Remember that the element order is backwards in register from what you're used to. And I'll suggest again that you use the Accelerate framework to avoid porting worries.

So for more information, you can go to the standard WWDC website. And you can also go to various reference pages. Velocity Engine web page currently doesn't say too much about Intel, but it will. Intel reference documents are available. You can downline the, what are essentially the equivalents of the PIM and PEM from Intel. Here's the link to it. And the universal binary program guidelines will get you started. And it's a rather long URL, so we've given you a tiny one to go with it. And here's some contact slides if you need to get in touch with us.