Application • 41:40
The Accelerate Framework in Mac OS X is a powerful suite of platform-optimized libraries designed to provide high performance mathematical facilities such as LAPACK, BLAS, vMathLib, and image processing operations. This session will also discuss new advances and extensions to these libraries. The Accelerate Framework takes full advantage of G3, G4, and G5 – processors to unlock best-of-class performance in your application. View this session to learn how the Accelerate Framework can take your application's performance to the next level.
Speakers: Ali Sazegari, Ian Ollmann, Robert Murley, Steve Peters
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good afternoon. This is session 407, the Accelerate Framework. The Accelerate Framework was introduced at last year's developers conference. This year we'll rehash a little bit about what was introduced last year, but also give you part two. If you know anything about Ali's team, you know these are the guys who, in your calculus class, were the ones finishing the word problems or the homework before the class was even over. They love math. So I hope you guys enjoy this session. And with that, I'd like to introduce Dr. Ali Sazegari.
First, thank you very much for coming to our session. And what I'd like to talk to you about is the Accelerate framework and what we have done and we plan on doing for Tiger. The talk is in three parts. I'm going to give you a general overview and some snippets of results that we have.
And after that, I'm going to pass it on to my colleague, Ian Oldman, and he's going to talk about the image processing library, which was introduced last year in the Panther OS. And after that, I'm going to pass it to my other colleague, Steve, and he's going to talk more about the numerics and the linear algebra results that we have. So let's get started.
So as you know, we have had this particular configuration, the AXLRATE framework, which is a collection of all the computational underpinnings of the Mac OS X in Panther. We've had the Veclip section of it for a while. Last year we introduced Vimage. Veclip section had the signal processing, the linear algebra, the matrix computations, the BLAS, the large number computations and the math libraries which took hardware vectors, 28 bit vectors. We added image processing.
I'm happy to tell you a lot of people are using our image processing inside Apple and outside and we're going to talk a little bit more about that. One of the additions to the new operating system is V-force. V-force, we've had a lot of calls for people who wanted to use an array of elements and pass it on to the elementary functions and get the elementary function results, not just pass it on in hardware vectors. Hardware vectors are just one scalar at a time. V-force is our new library which we will talk in depth later on. Steve will talk about that.
What is delivered in Mac OS X Tiger? Basically, the Accelerate Framework is one-stop shopping for computational performance, digital signal processing. We have expanded that in Tiger. We will have about 340 new functions in the VDSP subframework. We have digital image processing. We have expanded that also and with added performance in some of the core routines such as convolution. The BLAS Level 1, 2, and 3, if you're familiar with that, these are the basic linear algebra subroutines.
Again, these are the structures of computations that people do for LAPACK. The entire LAPACK that's single and double, real and complex for all of the routines. Basically, this is the exact API that we use for all of the routines. This is the API that people who are using LAPACK are used to. VForce, the array LMs that we'll talk a little bit more about that today, and that's new in Tiger. VMathLib, the counterpart of regular LibM, which runs on Scalar. This one here runs on the Vector Engine.
I'm going to just touch on some of the performance improvements and performances that we have right now in TIGER in some of the CDs, the CDs that you've had. First, I'd like to talk about vFORCE performance. These are DLMs. We have single and double precisions. They're highly accurate. They operate on arrays instead of elements or hardware vectors, 128-bit hardware vectors.
Monotonicity is observed over the entire range of definitions. That's pretty important because there are competitors which do have functionality such as this, but they have cut lots of corners and developers have to worry about pitfalls of where to call or what not to call or what elements to call. Here, you're free to call anything.
Basically, if it's in the floating point domain, it will work, and it will not trip you, and it will not give you the wrong results. I have a small little table here showing you what the... benefits of vFORCE are. I'm quite proud of this particular piece of work that our group has done, and Steve will talk about this a little bit further. Square root is 3.5, 3.1 times, over 3 times faster than the current one.
Exponential is over 6 times, and sine is 11 times faster. Square root was already pretty fast on G5s, but we have it even faster on this. The reason that these things... are faster is that we are able to plug in the bubbles in the computational structure of the algorithms because regular elementary functions just don't have enough data to go through, and you end up having a lot of empty cycles going by. This allows it to fill it up completely and have stellar performance. The next thing I'd like to talk to you about is the LAPAC performance.
LINPAC, this is a lot of people know about these results. I just have a little bit of it here. The DLP1000. This is the double precision LINPAC 1000 by 1000 matrix, and we're above 5 gigaflops for double precision, and single precision is over 7.5 gigaflops. This is on a 2.5 gigahertz power PC.
Blas performance. The quintessential Blas performance benchmark is really DGIM, and that's the double generalized matrix-matrix multiply, which is an enhanced matrix multiply of a scalar times a matrix times a matrix plus a scalar times a matrix. And a lot of people like to look at that to see the prowess of the implementation. And what we have here is comparing that with Optron because I get asked how we compare with the competition here. Higher numbers are better on this particular one here.
Double precision size 5,500, so if you multiplied 5,500 size matrices, you will have a 12.8 on a matrix. PowerPC and over 7 gigaflops for the Optron. Now the Optron that we had on our hand was a 2.0 gigahertz machine. We just added 20% more to it. We were unable to get a hold of a live 2.4 gigahertz machine to run that stuff. We gave them exactly 20% up. Generally, frequency goes up that much, performance doesn't go up that much, but we did. So it's 12.8 versus 8.55. What I have for graders here is just for fun.
What would the S-GEM performance be? And the S-GEM performance on our machine is 23 gigaflops. I don't know, some of you know I've been in competition for a while now, and 20 gigaflops before required many millions of dollars to achieve. But 23 gigaflops is just a pittance now. You can buy yourself a PowerPC at 2.5 gigahertz and get that.
VDSP Performance, our FFT, we continue having a stellar collection of FFTs for our users to use. Single and double precision, real and complex, 1D and 2D, in place and out of place basically, and Radix 2, 3, and 5. We have them hand-tuned for the vector engine, and we also have it hand-tuned for the dual scaler. I have them for, I'm comparing them with 3.2 GHz Xeons.
This time around, we're not looking at the gigaflops, we're looking at timing, real microseconds, because signal processors really don't care very much about what the throughput of floating point is. They like to, because they're real-time folks, they like to find out exactly what timing is on that. Single precision 1024 complex, which is always etched into my mind, is 4.56 microseconds versus 6.3.
The FFT is only 1.13 on a 3.2 GHz Xeon, and these are one processor only, because the floating point, the FFT and floating point, it doesn't do that much work to dole it out to processors. Single precision 1024 real is 2.3 microseconds versus 4.27 microseconds. So, you would think this is fast enough. Why would you like to make this any faster? Just one example, the quintessential example that I like to give always is iTunes.
iTunes uses our FFT at a tune of 1.2 MHz. It's that real FFT that gets used. The more we shave off of that, the faster your decoding and encoding will go, and the more we shave off of computational things like FFTs and IMDCTs, the better your battery life will be. So, this is pretty darn important to make sure it always runs extremely fast.
Image Processing Library. I'm very, very proud of this particular set of work. We set out and worked on this for a year on Panther and delivered it. And it's used in a lot of applications that we have in-house and outside. I just have a couple of little things in here. We have planar and chunky – kind of a funny word – ARGB interleaved formats, native support for 8-bit and floating-point samples, can be used in real-time.
It's multi-threaded so that large images you can do better. I have a small table here for performance, and what we have is 8-gigabyte image blurring. I'm comparing that to the IPP, which is the Intel Integrated Performance Primitives that some of you might have heard of. The 8-gigabyte image blur is 5.5 times faster. The 8-gigabyte image emboss is 2.2 times faster.
Delivered in Mac OS X, also let's not forget the underpinnings of, again, regular computation. LibM – we are standards-conforming APIs for IEEE 754 and C99. Single and double. Nguyen Tiger is our long-lost 128-bit long-double friend, which is going to make an appearance again. We have really stellar implementation for that – very, very accurate computationally. All of these guys are numerically robust, highly accurate, worry about the environmental controls and never mess up anything. We take a lot of care to make sure that we conform to any existing standards.
Best-of-breed algorithms, basically. Coding to LibM in C – straightforward. You just call the compiler. You don't have to say "-lm" on that. Using the Accelerate Framework in C – it's also straightforward. All you need to do is just put in "-framework" "accelerate". So basically what I've done here in the last few minutes is to just give you a small sampling of what we have in the image processing, signal processing, BLAS, vforce, and LAPAC. And we're going to go into some of the details of this work as we go along. Now I'd like to pass this on to my colleague Ian Ullman, who is going to talk more about the image processing.
Thank you. vImage was introduced last year at WWDC and shipped with Panther, and since then we've gotten a lot of feedback on it. And we've taken your suggestions to heart, and so we've got more improvements for it now. vImage functionality remains much as introduced previously with some new features added on. We still have native support for 8-bit and floating-point samples. These can be arranged either in a planar, which is to say all one channel per array, or a chunky format, which would interleave several channels.
If you're doing 8-bit work and images, then we throw in saturated clipping, usually the ends of functions that can overflow so you don't get the white goes to black or black goes to white problem. We've put in a lot of effort, thinking the design over to make sure that you can use these things real-time, that we don't arbitrarily call malloc. We give you the opportunity to provide us with the temporary buffers so you won't block on that, that kind of thing. We're also re-entrant so you can call us in a multi-threaded environment, and of course it's high performance, accelerated for alt-evac.
[Transcript missing]
We also do morphology operations where you can do kind of different shape changing operations, that kind of thing. So here would be an example where we've got a nice picture except for what looks like a power line up in the top left corner. Wouldn't it be nice if we could remove that? Well, there's lots of ways, but we'll just use morphology for this example.
And so we can apply a max filter. And max will go around and look at all the pixels around this pixel in question and take the maximum value. So the power line is kind of a dark image. So as we apply the max filter, it just goes away.
But you notice that some of the white highlights got bigger. So we can apply a min filter and kind of subtract them back out again. And so you have something that looks like your original image back except now that the power line is completely gone. So you can do these for interesting effects in addition to just shape changing and that kind of thing. So here's performance on that.
We've got a new algorithm for max, which works substantially better. Here you can see the 3.2 gigahertz dual processor Xeon results. Again, normalized to one is the red line across the bottom. And as the kernel size gets larger, you can see our performance relative to Xeon gets better and better. And we're up to four times faster for really large filters.
We do alpha compositing. We can support either premultiplied images or non-premultiplied images. We have functions to premultiply/unpremultiply data. We've now added a few new functions for Tiger. You can mix non-premultiplied into a premultiplied layer, which allows you to do multiple stacks as you go along. And we added in compositing with a scalar fade value, which allows you to sort of fade in the whole image without going through and writing over the alpha channel.
So those will be available. We also have new type conversion features. This was actually surprising, at least to us. The number one requested feature. It seems that everybody has their own data format that they like to use. So we've got a lot of conversions to get that in and out of what vImage likes to use. So we now can handle 24-bit, 8-bit per channel color. Also the older ARGB 1555 and RGB 565 16-bit per pixel formats. We do also the 16-bit per channel integer support and signed and unsigned flavors.
And we've also introduced OpenEXR compliance 16-bit floating point conversion functions in case you need to work with video cards that use those things. Also added a few other things that allow you to insert channels into interleaved images or permute channels around. Like let's say you need to swap around an ARGB image to an RGBA or something like that. So those things will be there. They'll be fully vectorized and they're pretty much operated. Bandwidth limited rates.
Also added color space transforms. We originally didn't put these in because we thought we would leave these up to ColorSync. But now ColorSync wants to use our codes. So we have them in there. We have matrix multiplication. So saturated clipping for 8-bit, of course, to prevent overflow. We allow you to put in an optional pre and post bias.
Mathematically, the pre and post bias are the same, but it's a little easier to use that way. So we put that feature in. And again, like the convolution, this one only does work for non-zero elements. So you can safely pass this rather sparse matrix and we'll just do the work that we need to in order to do that. We're also introducing a whole set of gamma correction functions. These come in a variety of flavors.
You can get a generic power curve. We also provide a few specialty gammas like SRGB, which aren't exactly generic power. These are available in two different formats. They're generally floating point geared, but you can get them in either a full 24-bit or 12-bit precision variance. And the 12-bit precision obviously is appropriate for data that was 8-bit integer data to begin with.
We also have a few functions to do simultaneous 8-bit conversion with clipping while they're doing the gamma correction. And we also provide an interpolated lookup table stuff for cases where your gamma curve is not nicely described by a power function. by a power function. So I'd like to invite Steve Peters up to talk about the numerics improvements for Tiger.
I'm going to take some time this afternoon to present the credentials of our math libraries. Perhaps some of you have not used them before and would like to know a bit about the motivation and also spend some time with performance. Hey, it works. Excellent. So, job number one for us is conformance to make porting your applications, building your applications, correspond to experience you've learned on other platforms, learned in the classroom, learned from reading the standards.
Who does that anymore? And at the base, we have a… We're delivering platforms based on G3, G4, G5 chips, all of which have IEEE 754 compliant floating-point arithmetic, both single and double. When we move up one level to the elementary functions, the basic math libraries, these are also compliant with the C99 standard. All the required C99 APIs are present for complex and long double as well as we come into the tiger world.
I'm going to have to use these. We build our linear algebra, the BLAS, the basic linear algebra subroutines, from Atlas, the widely respected open source package that is automatically tuned linear algebra software. We offer the full panoply of APIs in float, double, complex, complex double. And similarly for the gold standard of numerical computing, LAPACK, all routines float double, complex, complex double, entry points for both C and Fortran.
After conformance, we're really concerned with performance. And the flagship of performance now at Apple is the marvelous G5 CPU, the PowerPPC 970, which offers dual floating-point cores. My recollection, the first in Apple's line, and has given us really stellar performance. So on each 970 CPU, we find two Floating Point Cores capable of doing double precision IEEE, single precision IEEE.
On any machine cycle, both of those units can be pressed into action. We can start a floating point instruction down each pipe on both pipes in a single cycle. All the basic arithmetic operations – add, multiply, subtract, and divide – are present. We also get hardware square root in the PowerPC 970. That's a real boon to us.
And another class of instructions that have been present in G4 and now as well in G5 called the Fused Multiply Add. Fused Multiply Add takes three operands, multiplies the first two together, adds it to the third, all in the course of one instruction. So this ends up Being a key, important operation fundamental to linear algebra, the dot product, which is essentially multiply and accumulate, multiply, accumulate, multiply, accumulate.
It's fundamental to the FFT in much the same way. If you're doing a function evaluation by, say, polynomial approximation, you'll probably want to use Horner's Rule. And if you think a little bit about the way Horner's Rule works out, it's essentially a fuse, multiply, add, win. And at the bottom line, we get to count two floating point operations per fuse multiply add. So on a machine with two floating point cores, we get four flops per cycle.
So let's see, four flops per cycle. I always have to do this in my head. Four flops per cycle, two CPUs in the dual G5, so that's eight flops across two CPUs, and we clock them at two gigahertz. So we top out at 16 double-precision floating-point operations, 16 gigaflops worth of floating-point operations on a two gigahertz G5. And now that we're using 2.5s, I have to update my thinking. It's 20. 20 gigaflops. Theoretical peak.
Theoretical peak. So how do you get to this performance? How do you get to this great double-precision performance? If you've got an existing Apple Mac OS X binary, perhaps built for G4, just bring it across. The scheduling in the CPU is really smart. As the instruction stream comes along and we start seeing floating-point instructions, they get dispatched off to dual CPUs, and they will finish faster than if they were sent to a single pipe. So part of the answer is you don't have to do anything, and you should see some performance in existing binary apps.
Second, if you're able to recompile your app – say it's an open-source application, a code you've developed – recompile with GCC, set the proper options that I'll point to in a tech note later, and let it schedule instructions in an even more optimal way for the G5, and you can see yet more gains.
It's also possible by paying special attention to algorithmic details to get even further gains. For example, if you're computing a rational function approximation, you may be able to arrange the calculation so that the numerator is computed simultaneously with the denominator on the two pipes, and at the end you just weld them together with the divide. This level of attention we've paid already to LibM, the basic math library, our BLAS, our LAPACK, and the vForce library.
Both our G4 and G5 platforms offer the altivec single instruction multiple data processor. This is a four-way parallel single precision engine. Doesn't do double precision. Not at all. Ian keeps telling me this. It'll never do double precision. It's a single precision engine with a huge appetite for floating point.
It really just rips through floating point calculations. All the basic operations are present, as well as a vector fused multiply add. So now we get two flops counted for the fused multiply add on four operands strung across the 128-bit vector. That gives us eight flops per cycle. Let's see. Can I do the math in my head for a 2.5 gigahertz G5? I think that tops out at 40 gigaflops. Thank you. Yes. 40 gigaflops tops. All right.
So how do you get to this performance? Well, sorry, you've got to do a little bit of work. You're going to have to learn a little bit about vector programming. There's an out that we've announced this week. But it helps to get in there with your code, understand where there's inherent parallelism In your algorithms, work those over with the SIMD instruction set and pass them through the compiler. Our advice is always profile first before you dig in.
Find out where the 10% of the code is, where you're spending 90% of your time, and go look at those. Shark is a wonderful tool for figuring out these cases. I hope you've seen Shark or plan to see a Shark Talk sometime this week. They're playing in a theater near you, I'm sure.
Auto vectorization is an option, and this slide was actually written before the announcement was made that GCC 3.5 will be offering some auto vectorization features. Check those out. It may be a real boon to getting better use of the SIMD unit on the G4s and G5s. There's also a third-party application called Vaas that can analyze, I think, Fortran codes to discover inherent parallelism and emit the proper Altevec code. We've gone through at Apple and paid this kind of attention, algorithmic attention, Recasting algorithms for our vForce library, our single-precision BLAS, our single-precision FFTs and digital signal processing algorithms, and heavily in vImage.
Well, when you come to our platform as a developer and kind of come to that final step, you know, how do I access these wonderful libraries, link, load, and go, we try to make that as straightforward as possible. The library APIs generally will internally dispatch for the correct platform, so we won't go off and try to execute code that's appropriate for a G5 on a machine that's a G3, for example. Thank you.
Generally, the rule is if the API uses a The Accelerate Framework is a powerful suite of platform-optimized libraries designed to provide high performance and extensions to these libraries. The Accelerate Framework takes full advantage of G3, G4, and G5 – processors to unlock best-of-class performance in your application. View this session to learn how the Accelerate Framework can take your application's performance to the next level.
Ali Sazegari, Ian Ollmann, Robert Murley, Steve Peters Libm links by default. It's part of Libsystem. You don't need to say anything about that. For our long, double, and complex APIs, please add -lmx to your link line. And for vForce, the BLAS, LAPACK, VDSP, and vImage, the one-stop shopping place is framework-accelerate. Just add -framework-accelerate to your compile and link lines.
I know that's a popular flag, so I'll let you copy that down. Well what's new for Math and Tiger? What have we been working on? Ali hit the highlights of the vForce library. Basically, we've been told people don't want to do one square root at a time. They'd really like to do 768 at a time.
And sure enough, there are advantages to be had when you can do many of these things at once. We also took a BLAS update, an update to Atlas 3.6. This helped us in a couple of places. We of course do additional Mac OS X specific tune-ups to that open source drop, and our compiler technology improved – thank you, compiler team – to give us some nice gains, and somewhat unexpected gains.
And because of the faster underlying BLAS, some improved compilation, our LAPACK is going faster too. Now, Ali always likes me to lead with the strongest graph, so I can give you a couple of performance numbers here. These are some numbers I collected for the 2.5 GHz G5 dual processor. It's a set of numbers that in the sort of computational linear algebra community you'll see quite a bit. It measures matrix-multiplied DMM and then the three decompositions, LU and the symmetric decompositions, LL transpose Koleski and the kraut, U transpose U.
For matrix multiply, we use various matrix sizes ranging from 500 up to roughly 9,000. We get our first plateau a bit over 11 gigaflops and then an interesting jump around size 5,000 as we push up beyond the 12 into the 13 gigaflop range. The decompositions are a little bit less jumpy, a little less of a step function, but look like they're hitting an asymptote at around 10 gigaflops.
Well, what's the competition up to these days? Let's just look at matrix multiply. Again in yellow is the dual 2.5 GHz G5 topping out at or above 12 GHz. On the bottom in blue is Opteron, a 2.0 GHz Opteron. And they sort of get to about 7 GigaFLOPS in the 2.0 model.
For the purposes of comparison, we know that they've got a 2.4 GHz part out there, and if they were allowed to perfectly scale, they'd hit that dashed white line and come in just a bit over 8 GigaFLOPS. We expect to see that when we measure those machines. Dual 3.2 GHz Xeon is the green. Gets up a little bit above 10, probably touches 11 in a couple of those places. So 2.5 GHz G5 seems to dominate in the matrix-multiply game quite handily.
The slide is a bit more busy, but again, the color should be the guide here. Yellow, again, G5. Green is a Xeon and Opteron in blue. And again, we've scaled Opteron by 20% for the white dashed line. G5 seems to dominate again. This looks a little bit out of place. I mentioned we did long double. I think Ali mentioned we did long double. And we'll also have the type generic math function, so that's good to know. So I want to come back to this vforce business.
And as Ali alluded to, the elementary functions in LibM – square root, cos, sine, arcsine – pass a single operand, do a fairly heavy amount of computation, and burp out a single result. It turns out that leaves bubbles in the modern RISC pipelines. So we say these C99 APIs are data-starved.
We're also required by IEEE 754 to have very careful control over the rounding modes and exceptions that might be generated in the course of such a computation. And that adds a fair amount of overhead. There are instructions that we'll have to synchronize the pipe to get that stuff right, and we pay a pretty good price for that.
So the ideas in V-force are let's pass many operands through a single call. Maybe we can get some advantage there. So if we had 768 values in a vector x and we wanted to compute the single precision floating point sine of those things, we could call VV sine f, pass x 768 in a place to stuff the answers y, or we might have 117 numbers we want the arc tan of, and there's a call for that.
We're going to insist on the IEEE default rounding modes, and we're not going to set any exception flags. So this is for, you know, close to the metal, high performance, go as fast as you can, we don't expect any big problems, and if there are any, well, we'll deal with them in some other manner than the IEEE approach.
So we also get some mileage here because given multiple operands, we can pack them together into hardware vectors on the single precision side and send them through the AltaVec engine. This is a very good thing. Similarly, on the G5, we can make sure to utilize the two pipes as effectively as possible.
We do a lot of software pipelining that is sort of arranging, let's just say arranging to fill all the available cycles on all the floating point pipes. We unroll loops like crazy, and we also have taken some algorithmic approaches that favor calculation over table lookup and try to avoid branches like the plague. It makes these things go very, very, very fast. And as Ali pointed out, we have gains in square root to 3x, x to nearly 7, and sine was almost 12, 3612.
So, some caveats, right? This is close to the metal programming. Generally, the results are as accurate as LibM, but they're not bitwise identical. Don't expect to call and compare for equality on a list of arguments. We handle almost all the edge cases according to C99 and XG for the special functions.
The exceptions are a few places around signed zeros – what happens when plus or minus zero is passed to one of these routines. We make no alignment requirements, although you will get best performance if you can 16-byte align your data. Storage returned by malloc is on Mac OS X by default 16-byte aligned.
This stuff is tuned for the G5. I mean, that's the performance flagship here. But the good news is it runs quite nicely on G4 and G3. And of course, we dispatch internally to the appropriate routine. You don't need to worry about where you're running vForce routines. They just do the right thing.
So one final change of gears here is to come back to the elementary functions themselves, where we've done a bit of tune-up work. Here are a selected sample of the probably most used and most loved elementary functions in our library. We report the number of G5 cycles on a random selection over a wide range of arguments and averaged over the number of iterations, square root taking about 35 cycles per element, sine 52, and so forth.
If you look at what the competition publishes for the performance of x87, These are essentially hardware implementations of these transcendental functions. Their square root runs at about 38. Their exponential, depending on how you want to count, runs no less than 150 cycles to do the 2 to the x part, and there's a bit of massaging to get e to the x. Their logarithm is a winner, and otherwise we get all the wins in yellow.
Now those are just sort of raw x87 numbers. When you actually package these things into a library that take account of rounding requirements and error flags such as in GNU/Linux, the performance falls off a bit more. These G5 numbers are already in the prescribed IEEE – in compliance with IEEE, so there's nothing further to say. That is LibM. That's GNU Linux on Intel. The Accelerate Framework is a powerful suite of platform-optimized libraries designed to provide high performance, high performance, and high performance in your application. View this session to learn how the Accelerate Framework can take your application's performance to the next level.
The Accelerate Framework takes full advantage of G3, G5, and G5 – processors to unlock best-of-class performance in your application. View this session to learn how the Accelerate Framework can take your application's performance to the next level. Raw elementary function performance. I think G5 wins, but I work on that stuff, so. There are some notes in our technical library: Techno 2086, tuning for the G5, Techno 2087, a quick look at the G4 and G5 if you're familiar with programming for G4, that will get you bumped up to G5 in a hurry.
I see some note-takers finishing up on that. And some really nice documentation in the developer reference library for the Accelerate Framework and some of its individual components, vImage, vDSP, and a piece that Ian mainly maintains on the Velocity Engine that's sort of a wonderful, general, gentle introduction to Simdy Programming. Is there such a thing, Bob? I don't know. That's a good point.