Application • 41:40
The Accelerate Framework in Mac OS X is a powerful suite of platform-optimized libraries designed to provide high performance mathematical facilities such as LAPACK, BLAS, vMathLib, and image processing operations. This session will also discuss new advances and extensions to these libraries. The Accelerate Framework takes full advantage of G3, G4, and G5 – processors to unlock best-of-class performance in your application. View this session to learn how the Accelerate Framework can take your application's performance to the next level.
Speakers: Ali Sazegari, Ian Ollmann, Robert Murley, Steve Peters
Unlisted on Apple Developer site
Transcript
This transcript was generated using Whisper, it may have transcription errors.
Good afternoon. This is session 407, the Accelerate Framework. The Accelerate Framework was introduced at last year's developers conference. This year we'll rehash a little bit about what was introduced last year, but also give you part two. If you know anything about Ali's team, you know these are the guys who, in your calculus class, were the ones finishing the word problems or the homework before the class was even over. They love math. So I hope you guys enjoy this session. And with that, I'd like to introduce Dr. Ali Sazegari.
First, thank you very much for coming to our session. And what I'd like to talk to you about is the Accelerate framework and what we have done and we plan on doing for Tiger. The talk is in three parts. I'm going to give you a general overview and some snippets of results that we have. And after that, I'm going to pass it on to my colleague, Ian Ollmann, and he's going to talk about the image processing library, which was introduced last year in the Panther OS. And after that, I'm going to pass it to my other colleague, Steve, and he's going to talk more about the numerics and the linear algebra results that we have. So let's get started.
So as you know, we have had this particular configuration, the accelerate framework, which is a collection of all the computational underpinnings of the Mac OS X in Panther. We've had the Veclip section of it for a while. Last year we introduced Vimage. Veclip section had the signal processing, the linear algebra, the matrix computations, the BLAS, the large number computations, and the math libraries, which took hardware vectors 128-bit vectors. We added image processing. I'm happy to tell you a lot of people are using our image processing inside Apple and outside, and we're going to talk a little bit more about that. One of the additions to the new operating system is vForce. vForce, we've had a lot of calls for people who wanted to use an array of elements and pass it on to the elementary functions and get the elementary function results, It's not just passing on in hardware vectors just one scaler at a time. VForce is our new library, which we will talk in depth later on. Steve will talk about that.
What is delivered in Mac OS X Tiger? Basically, the Accelerate framework is one-stop shopping for computational performance, digital signal processing. We have expanded that in Tiger. We will have about 340 new functions in the VDSP sub-framework. We have digital image processing. We have expanded that also, and with added performance in some of the core routines, such as convolution. The BLAS level one, two, and three, if you're familiar with that, these are the basic linear algebra subroutines. Again, these are the structures of computations that people do for LAPACK. The entire LAPACK that's single and double, real and complex for all of the routines.
Basically, this is the exact API that people who are using LAPACK are used to. V-force, the array LMs, that we'll talk a little bit more about that today, and that's new in Tiger. VMathlib, the counterpart of regular libm, which on the scalar runs on scalar. This one here runs on the vector engine.
I'm going to just touch on some of the performance improvements and performances that we have right now in Tiger, in some of the CDs that you've had. First, I'd like to talk about vForce performance. These are DLMs. We have single and double precisions. They're highly accurate. They operate on arrays instead of elements or hardware vectors, 128-bit hardware vectors. Monotonicity is observed over the entire range of definition. that's pretty important because there are competitors which do have functionality such as this, but they have cut lots of corners and developers have to worry about pitfalls of where to call or what not to call or what elements to call. Here, you're free to call anything. Basically, if it's in the floating point domain, it will work and it will not trip you and it will not give you the wrong results. I have a small little table here showing you what the benefits of vForce are. I'm quite proud of this particular piece of work that our group has done, and Steve will talk about this a little bit further. Square root is 3.1 times over 3 times faster than the current one. Exponential is over 6 times, and sine is 11 times faster. And square root was already pretty fast on G5s, but we have it even faster on this. And the reason that these things are faster is that we are able to plug in the bubbles in the computational structure of the algorithms because regular elementary functions just don't have enough data to go through. And you end up having a lot of empty cycles going by. This allows it to fill it up completely and have stellar performance. The next thing I'd like to talk to you about is the LAPACK performance.
LINPAC, this is a lot of people know about these results. I just have a little bit of it here, the DLP1000. This is the double precision LINPACK 1000 by 1000 matrix, and we're above 5 gigaflops for double precision, and single precision is over 7.5 gigaflops. And this is on a 2.5 gigahertz power PC.
Blas performance, the quintessential Blas performance benchmark is really DGIM, and that's the double generalized matrix-matrix multiply, which is an enhanced matrix multiply of a scalar times a matrix times a matrix plus a scalar times a matrix. and a lot of people like to look at that to see the prowess of the implementation. And what we have here is comparing that with Optron because I get asked how we compare with the competition here. Higher numbers are better on this particular one here. Double precision size 5,500, so if you multiplied 5,500 size matrices, you will have a 12.8 on a power PC and over 7 gigaflops for the Optron. Now, the Optron that we had on our hand was a 2.0 gigahertz machine. We just added 20% more to it. We were unable to get a hold of a live 2.4 gigahertz machine to run that stuff. We gave them exactly 20% up, which generally frequency goes up that much, performance doesn't go up that much, but we did. So it's 12.8 versus 8.55. What I have for graders here is, just for fun, what would the S-GEM performance be? And the S-GEM performance on our machine is 23 gigaflops. I don't know if some of you know, I've been in computation for a while now, and 20 gigaflops before required many millions of dollars to achieve. But 23 gigaflops is just a pittance now.
You can buy yourself a power PC at 2.5 gigahertz and get that. Thank you. VDSP performance, our FFT, we continue having a stellar collection of FFTs for our users to use. Single and double precision, real and complex, 1D and 2D, in place and out of place, basically, and Radix 2, 3, and 5. We have them hand-tuned for the vector engine, and we also have it hand-tuned for the dual scaler. I have them for, I'm comparing them with 3.2 gigahertz Xeons. This time around, we're not looking at the gigaflops. We're looking at timing, real microseconds, because signal processors really don't care very much about what the throughput of floating point is. They like to, because they're real-time folks, they like to find out exactly what timing is on that. Single precision 1024 complex, which is always etched into my mind, is 4.56 microseconds versus 6.13 on a 3.2 gigahertz Xeon, and these are one processor only, because the FFT in floating point doesn't do that much work to dole it out to processors. Single precision 1024 real is 2.3 microseconds versus 4.27 microseconds. So you would think this is fast enough. Why would you like to make this any faster? Just one example, the quintessential example that I like to give always is iTunes. iTunes uses our FFT at a tune of 1.2 million times per hour for your music. It's that real FFT that gets used. The more we shave off of that, the faster your decoding and encoding will go, and the more we shave off of computational things like FFTs and IMDCTs, the better your battery life will be. So this is pretty darn important to make sure it always runs extremely fast.
Image processing library. I'm very, very proud of this particular set of work. We set out and worked on this for a year on Panther and delivered it, and it's used in a lot of applications that we have in-house and outside. I just have a couple of little things in here. We have planar and chunky, kind of a funny word, ARGB interleaved formats, native support for 8-bit and floating-point samples, can be used in real time. It's multi-threaded so that large images you can do better. I have a small table here for performance, and what we have is 8-gigabyte image blurring. I'm comparing that to the IPP, which is the Intel Integrated Performance Primitives that some of you might be familiar with. The 8-gigabyte image blur is 5 and 1/2 times faster. The 8-gigabyte image emboss is 2.2 times faster.
Delivered in Mac OS X, also let's not forget the underpinnings of, again, regular computation. LibM, we are standards conforming APIs for IEEE 754 and C99. Single and double. New in Tiger is our long lost 128-bit long double friend, which is going to make an appearance again. And we have really stellar implementation for that. Very, very accurate computationally.
All of these guys are numerically robust, highly accurate, worry about the environmental controls and never mess up anything. And we take a lot of care to make sure that we conform to any existing standards. Best of breed algorithms, basically. Coding to libm in C, straightforward. You just call the compiler. You don't have to say minus LM on that. Using the accelerate framework in C, it's also straightforward. All you need to do is just put in minus framework accelerate. So basically what I've done here in the last few minutes is to just give you a small sampling of what we have in the image processing, signal processing, BLAS, V-force, and LAPACK, and we're going to go into some of the details of this work as we go along. Now I'd like to pass this on to my colleague Ian Ollmann, who is going to talk more about the image processing.
Thank you. Vimage was introduced last year at WWDC and shipped with Panther. And since then, we've gotten a lot of feedback on it. And we've taken your suggestions to heart. And so we've got more improvements for it now. Vimage functionality remains much as introduced previously with some new features added on. We still have native support for 8-bit and floating point samples. These can be arranged either in a planar, which is to say all one channel per array, or a chunky format, which would interleave several channels. If you're doing 8-bit work and images, then we throw in saturated clipping, usually the ends of functions that can overflow, so you don't get the white goes to black or black goes to white problem. We put in a lot of effort, thinking the design over to make sure that you can use these things real time, that we don't arbitrarily call Malik. We give you the opportunity to provide us with a temporary buffer so you won't block on that, that kind of things. We're also reentrant so you can call us in a multi-threaded environment. And of course it's high performance, accelerated for alt-evac.
We provide a variety of image filters. We have convolutions, morphology functions that allow you to do edge detection or fill in holes, that kind of thing, min-max, dilator road. We do histogram operations with color balancing, alpha compositing with some new functionality there. Geometrical transforms, we do scales, rotates, shears, affine warps. You can sort of distort the image in lots of different ways. We also do some color space conversions and data type conversions. So just to go over what you can do with convolution, depending on what kernel you provide the convolution filter, you can do all sorts of different operations, blurs, sharpens. You can do an emboss, which is essentially a first derivative over the image. You can do averaging or various other things. We've gone over and looked at the performance for Tiger and for G5 and for future processors, and we've done a lot of work to get the performance up. now on your CD you'll find that the performance for the planar 8-bit case has substantially improved over what it was and as the months go by in the near future we're going to push that forward on other things. So we've done a lot of work just to get the brute force computation about it and we've also improved the algorithm a bit. It's a lot smarter about zeros in your convolution kernel.
Most people they pass in a kernel that's 90% zeros and they just kind of expect the library not to actually do work for the zeros. But it turns out if you go look at all the high performance convols out there, they do actually do work for the zeros. But we've changed around so we don't. So in many cases now, in a comparative study between our library and the other ones, you're going to see very substantial improvement in hours over the other one. So just to give you an idea, here's an example of a somewhat blurry image of Lisbon. And you can apply a standard sharpening kernel and it looks a little bit sharper. I don't know if it shows up well on this display. You can see the kernel there that we used, which accentuates the pixel in question over its neighbors. That's how you get the effect. And here's the kind of performance you can expect on that kind of thing. Here we have a competitive graph against Xeon. It's a little hard to read. It's a 3.2 gigahertz Xeon that we're working on, and we're looking at the Intel Performance Primitives library. Intel has already gone through and multi-threaded all this for you. So both of these are dual processor results. Intel is the blue bar along the bottom. We've normalized its performance to one. And the speed of the G5, as you can see, with a dense kernel is the red line above it. So we're usually between one and three times faster than Intel for a dense kernel. And then for a sparse kernel like Emboss, which is mostly zeros, then we're up to eight times faster for those things.
We also do morphology operations. We can do kind of different shape-changing operations, that kind of thing. So here would be an example where we've got a nice picture except for what looks like a power line up in the top left corner. Wouldn't it be nice if we could remove that? Well, there's lots of ways, but we'll just use morphology for this example. And so we can apply a max filter, and max will go around and look at all the pixels around this pixel in question and take the maximum value. So the power line is kind of a dark image, so as we apply the max filter, it just goes away. But you notice that some of the white highlights got bigger, so we can apply a min filter and subtract them back out again. And so you have something that looks like your original image back, except now that the power line is completely gone. So you can do these for interesting effects in addition to just shape-changing and that kind of thing. So here's performance on that. We've got a new algorithm for Max, which works substantially better. Here you can see the 3.2 gigahertz dual-processor Xeon results, Again, normalized to 1 is the red line across the bottom. And as the kernel size gets larger, you can see our performance relative to Xeon gets better and better. And we're up to four times faster for really large filters.
We do alpha compositing. We can support either pre-multiplied images or non-pre-multiplied images. We have functions to pre-multiply, un-pre-multiply data. We've now added a few new functions for Tiger. You can mix non-pre-multiplied into a pre-multiplied layer, which allows you to do multiple stacks as you go along. And we added in compositing with a scalar fade value, which allowed you to sort of fade in the whole image without going through and writing over the alpha channel. So those will be available. We also have new type conversion features. This was actually surprising, at least to us. Number one requested feature. It seems that everybody has their own data format that they like to use. And so we've got a lot of conversions to get that in and out of what Vimage likes to use. So we now can handle 24-bit, 8-bit per channel color.
Also the older ARGB 1555 and RGB 565 16-bit per pixel formats. We do also the 16-bit per channel integer support and signed and unsigned flavors. And we've also introduced OpenEXR compliance, 16-bit floating point conversion functions in case you need to work with video cards that use those things.
Also added a few other things that allow you to insert channels into interleaved images or permute channels around. Like, let's say you need to swap around an ARGB image to an RGBA or something like that. So those things will be there, they'll be fully vectorized and they're pretty much operated bandwidth limited rates. Also added color space transforms. We originally didn't put these in 'cause we thought we would leave these up to ColorSync, but now ColorSync wants to use our codes, so we have them in there. We have major multiplications, so saturated clipping for 8-bit, of course, to prevent overflow.
We allow you to put in an optional pre and post bias. Mathematically, the pre and post bias are the same, but it's a little easier to use that way, so we put that feature in. And again, like the convolution, this one only does work for non-zero elements, so you can safely pass this rather sparse matrix and we'll just do the work that we need to in order to do that. We're also introducing a whole set of gamma correction functions, which these come in a variety of flavors. You can get a generic power curve. We also provide a few specialty gammas like sRGB, which aren't exactly generic power. These are available in two different formats. They're generally floating point geared, but you can get them in either a full 24-bit or 12-bit precision variance. And the 12-bit precision obviously is appropriate for data that was eight-bit integer data to begin with. We also have a few functions to do simultaneous eight-bit conversion with clipping while they're doing the gamma correction. And we also provide an interpolated lookup table stuff for cases where your gamma curve not nicely described by a power function. So I'd like to invite Steve Peters up to talk about the numerics improvements for TIGER.
I'm going to take some time this afternoon to present the credentials of our math libraries. Perhaps some of you have not used them before and would like to know a bit about the motivation and also spend some time with performance. Hey, it works. Excellent. So, you know, job number one for us is conformance to make porting your applications, building your applications, correspond to experience you've learned on other platforms, learned in the classroom, learned from reading the standards. Who does that anymore? And at the base, we have a We're delivering platforms based on G3, G4, G5 chips, all of which have IEEE 754 compliant floating point arithmetic, both single and double. When we move up one level to the elementary functions, the basic math libraries, these are also compliant with the C99 standard. All the required C99 APIs are present for complex and long double as well as we come into the tiger world.
I'm going to have to use these. We build our linear algebra, the BLAS, the basic linear algebra subroutines, from Atlas, the widely respected open source package that is automatically tuned linear algebra software. We offer the full panoply of APIs in float, double, complex, complex double. And similarly for the gold standard of numerical computing, LAPACK, all routines float double, complex, complex double, entry points for both C and Fortran.
After conformance, we're really concerned with performance. And the flagship of performance now at Apple is the marvelous G5 CPU, the PowerPPC 970, which offers dual floating point cores. my recollection the first in Apple's line, and has given us really stellar performance. So on each 970 CPU, we find two floating point cores capable of doing double precision IEEE, single precision IEEE.
On any machine cycle, both of those units can be pressed into action. We can start a floating point instruction down each pipe on both pipes in a single cycle. All the basic arithmetic operations, add, multiply, subtract, and divide are present. We also get hardware square root in the PowerPC 970. That's a real boon to us.
And another class of instructions that have been present in G4 and now as well in G5 called the fused multiply add. Fused multiply add takes three operands, multiplies the first two together, adds it to the third, all in the course of one instruction. So this ends up... being a key important operation, fundamental to linear algebra, the dot product, which is essentially multiply and accumulate, multiply, accumulate, multiply, accumulate. It's fundamental to the FFT in much the same way.
If you're doing a function evaluation by, say, polynomial approximation, you'll probably want to use Horner's rule. And if you think a little bit about the way Horner's rule works out, it's essentially a fuse, multiply, add, win. And at the bottom line, we get to count two floating point operations per fuse multiply add. So on a machine with two floating point cores, we get four flops per cycle.
So let's see, four flops per cycle. I always have to do this in my head. Four flops per cycle, two CPUs in the dual G5, so that's eight flops across two CPUs, and we clock them at two gigahertz. So we top out at 16 double precision floating point operations, 16 gigaflops worth of floating point operations on a two gigahertz G5. And now that we're using 2.5s, I have to update my thinking, it's 20, 20 gigaflops. Theoretical peak, theoretical peak.
So how do you get to this performance? How do you get to this great double precision performance? If you've got an existing Apple Mac OS X binary, perhaps built for G4, just bring it across. The scheduling in the CPU is really smart. As the instruction stream comes along and we start seeing floating point instructions, they get dispatched off to dual CPUs and they will finish faster than if they were sent to a single pipe. So part of the answer is you don't have to do anything and you should see some performance in existing binary apps.
Second, if you're able to recompile your app, say it's an open source application, a code you've developed, recompile with GCC, set the proper options that I'll point to in a tech note later, and let it schedule instructions in an even more optimal way for the G5, and you can see yet more gains.
It's also possible by paying special attention to algorithmic details to get even further gains. For example, if you're computing a rational function approximation, you may be able to arrange the calculation so that the numerator is computed simultaneously with the denominator on the two pipes. At the end, you just weld them together with the divide. This level of attention we've paid already to LibM, the basic math library, our BLAS, our LAPACK, and the V-force library. Thank you.
Both our G4 and G5 platforms offer the Altivec single instruction multiple data processor. This is a four-way parallel single precision engine. Doesn't do double precision, not at all. Ian keeps telling me this. It'll never do double precision. It's a single precision engine with a huge appetite for floating point. It really just rips through floating point calculations. All the basic operations are present as well as a vector fused multiply add. So now we get two flops counted for the fused multiply add on four operands strung across the 128 bit vector. That gives us eight flops per cycle. Let's see, can I do the math in my head for a two and a half gigahertz G5, I think that tops out at 40 gigaflops, thank you. Forty gigaflops tops, all right.
How do you get to this performance? Well, sorry, you got to do a little bit of work. You're going to have to learn a little bit about vector programming. There's an out that we've announced this week. but it helps to get in there with your code, understand where there's inherent parallelism, in your algorithms, work those over with the SIMD instruction set and pass them through the compiler. Our advice is always profile first before you dig in. Find out where the 10% of the code is, where you're spending 90% of your time and go look at those. SHARK is a wonderful tool for figuring out these cases. I hope you've seen SHARK or plan to see a Shark Talk sometime this week. They're playing in a theater near you, I'm sure.
Auto vectorization is an option, and this slide was actually written before the announcement was made that GCC 3.5 will be offering some auto vectorization features. Check those out. It may be a real boon to getting better use of the SIMD unit on the G4s and G5s. There's also a third-party application called VAST that can analyze, I think, Fortran codes to discover inherent parallelism and emit the proper Altevic code.
We've gone through at Apple and paid this kind of attention, algorithmic attention, recasting algorithms for our vforce library our single precision blaws our single precision ffts and digital signal processing algorithms and heavily in vimage Well, when you come to our platform as a developer and kind of come to that final step, how do I access these wonderful libraries, link, load, and go? We try to make that as straightforward as possible. The library APIs generally will internally dispatch for the correct platform, so we won't go off and try to execute code that's appropriate for a G5 on a machine that's a G3, for example. So, yeah. Generally, the rule is if the API uses a vector type hardware, SIMD vector type, Altivec SIMD vector type, you're expected as a consumer of that API to know that you're on G4 or G5. Otherwise, we'll take care of that for you.
Libm links by default, it's part of lib system, you don't need to say anything about that. For our long, double, and complex APIs, please add -lmx to your link line. And for vforce, the BLAS, LAPack, VDSP, and vimage, the one-stop shopping place is framework-accelerate. Just add -framework-accelerate to your compile and link lines.
I know that's a popular flag, so I'll let you copy that down. Well what's new for Math and Tiger? What have we been working on? Ali hit the highlights of the VForce library. Basically we've been told people don't want to do one square root at a time. They'd really like to do 768 at a time. Sure enough, there are advantages to be had when you can do many of these things at once. We also took a BLAS update, an update to Atlas 3.6. This helped us in a couple of places. We, of course, do additional Mac OS X specific tune-ups to that open source drop, and our compiler technology improved, thank you, compiler team, to give us some nice gains, some nice and somewhat unexpected gains.
And because of the faster underlying laws, some improved compilation, our LAPACK is going faster too. Now, Ali always likes me to lead with the strongest graph, so I can give you a couple of performance numbers here. These are some numbers I collected for the 2.5 GHz G5 dual processor. It's a set of numbers that in the sort of linear, computational linear algebra community, you'll see quite a bit. It measures matrix multiply DMM and then the three decompositions, LU and the symmetric decompositions, LL transpose Koleski and the kraut, U transpose U.
For matrix multiply, we use various matrix sizes ranging from 500 up to roughly, I think, 9,000. And sort of get our first plateau a bit over 11 gigaflops and then sort of an interesting jump around size 5,000 as we push up beyond the 12 and into the 13 gigaflop range. And the decompositions are a little bit less jumpy, a little less of a step function, but look like they're hitting an asymptote at around 10 gigaflops.
Well, what's the competition up to these days? Let's just look at matrix multiply. Again, in yellow is the dual 2.5 gigahertz G5 topping out at or above 12 gigahertz. On the bottom in blue is Opteron, a 2.0 gigahertz Opteron. And they sort of get to about seven gigaflops in the 2.0 model.
For the purposes of comparison, we know that they've got a 2.4 gigahertz part out there, and if they were allowed to perfectly scale, they'd hit that dashed white line and come in just a bit over 8 gigaflops. We expect to see that when we measure those machines. Dual 3.2 gigahertz Xeon is the green. Gets up a little bit above 10, probably touches 11 in a couple of those places. So 2.5 gigahertz G5 seems to dominate in the matrix multiply game quite handily.
The slide is a bit more busy, but again, the color should be the guide here. Yellow again, G5. Green is a Xeon and Opteron in blue. Again we've scaled Opteron by 20% for the white dashed line. G5 seems to dominate again. This looks a little bit out of place. I mentioned we did long double. I think Ali mentioned we did long double and we'll also have the type generic math function. So that's good to know. So I wanna come back to this V force business.
As Ali alluded to, the elementary functions in LibM, square root, cos, sine, arcsine, pass a single operand, do a fairly heavy amount of computation, and burp out a single result. It turns out that leaves bubbles in the modern risk pipelines. We say these C99 APIs are data starved.
We're also required by IEEE 754 to have very careful control over the rounding modes and exceptions that might be generated in the course of such a computation. And that adds a fair amount of overhead. There are instructions that we'll have to synchronize the pipe to get that stuff right, and we pay a pretty good price for that.
So the ideas in V-force are let's pass many operands through a single call. Maybe we can get some advantage there. So if we had 768 values in a vector x and we wanted to compute the single precision floating point sine of those things, we could call VV sine f, pass x 768 in a place to stuff the answers y. Or we might have 117 numbers we want the arc tan of, and there's a call for that.
We're going to insist on the IEEE default rounding modes, and we're not going to set any exception flags. So this is for close to the metal, high performance, go as fast as you can. We don't expect any big problems. And if there are any, well, we'll deal with them in some other manner than the IEEE approach.
So we also get some mileage here because given multiple operands, we can pack them together into hardware vectors on the single precision side and send them through the Altivec engine. This is a very good thing. Similarly on the G5, we can make sure to utilize the two pipes as effectively as possible.
We do a lot of software pipelining that is sort of arranging, let's just say arranging to fill all the available cycles on all the floating point pipes. We unroll loops like crazy and we also have taken some algorithmic approaches that favor calculation over table lookup and try to avoid branches like the plague. makes these things go very, very, very fast. And as Ali pointed out, we have gains in square root to 3x, x to nearly 7, and sine was almost 12, 3612.
So some caveats, right? I mean, this is close to the metal programming. Generally, the results are as accurate as LibM, but they're not bitwise identical, right? Don't expect to call and compare for equality on a list of arguments. We handle almost all the edge cases according to C99 and XG for the special functions. The exceptions are a few places around signed zeros, what happens when plus or minus zero is passed to one of these routines. We make no alignment requirements although you will get best performance if you can 16-byte align your data. Storage returned by malloc is on Mac OS X by default 16-byte aligned.
This stuff is tuned for the G5. I mean, that's the performance flagship here. But the good news is it runs quite nicely on G4 and G3. And of course, we dispatch internally to the appropriate routine. You don't need to worry about where you're running V-force routines. They just do the right thing.
So one final change of gears here is to come back to the elementary functions themselves where we've done a bit of tune-up work. Here are a selected sample of the probably most used and most loved elementary functions in our library and we report the number of G5 cycles on a random selection over a wide range of arguments and averaged over the number of iterations. Square root taking about 35 cycles per element, sine 52, and so forth.
If you look at what the competition publishes for the performance of x87, These are essentially hardware implementations of these transcendental functions. Their square root runs about 38. Their exponential, depending on how you want to count, runs no less than 150 cycles to do the -- essentially the 2 to the x part, and there's a bit of massaging to get e to the x. Their logarithm is a winner, and otherwise, we get all the wins in yellow.
Now those are just sort of raw x87 numbers. When you actually package these things into a library that take account of rounding requirements and error flags such as in GNU Linux, the performance falls off a bit more. These G5 numbers are already in the prescribed IEEE, in compliance with IEEE, so there's nothing further to say. That is LibM, that's GNU Linux on Intel. Um, on the competitor's hardware going quite a bit slower. So for, raw elementary function performance, I think G5 wins. But I work on that stuff, so. Thank you. There are some notes in our technical library, Tech Note 2086, tuning for the G5, Tech Note 2087, a quick look at the G4 and G5 if you're familiar with programming for G4, that will get you bumped up to G5 in a hurry. Okay.
I see some note takers finishing up on that. And some really nice documentation in the developer reference library for the Accelerate framework and some of its individual components, Vimage, VDSP, and a piece that Ian mainly maintains on the Velocity Engine that's sort of a wonderful, general, gentle introduction to VDSP. SIMD programming, is there such a thing Bob? I don't know. All right. That's a good point. Okay.