Core OS • iOS • 41:31
The ARM architecture used in iPhone 3GS, iPad and iPod touch has a powerful vector unit that can be used to improve the performance and energy consumption of array computations. Learn how to use the Accelerate framework to take advantage of the vector unit while avoiding the hassle of writing processor-specific code.
Speaker: Steve Canon
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript has potential transcription errors. We are working on an improved version.
Good morning, everyone. My name is Steve Canon. I'm a Senior Engineer in the Vector Numerics Group. And today, I'm going to be taking to you about the Accelerate framework for iPhone OS. I'm going to start off with a little bit of overview of the ARM architecture, and then we're going to go a very high-level overview of what's in the Accelerate framework, and we're going to have a few examples of how to use it effectively.
What I really want you to take away from this is first, what's in the Accelerate framework. It's new to iOS 4. So if you haven't developed for the Mac before, you're probably not familiar with it. I want you to learn where to look for references and documentation on the API's that are in there, and I also want to give you some pointers on using Accelerate effectively. So let's start with a little bit about the ARM architecture.
Apple has shipped two versions of the ARM architecture. The first one we shipped was ARMv6. This was used in the original iPhone, the iPhone 3G, then the first and second generation, iPod touches. ARMv6 has a general purpose integer unit with 16 registers, and it also has hardware floating point support, that's called VFP, and it has 32 single precision registers and 16 double precision registers.
So it's got hardware support for both single and double precision. Now, in the newer products, we're using an architecture called ARMv7. This is used in the iPhone 3GS. It's used in the third generation iPod touch, it's used in the iPad, and it's also used in the iPhone 4.
ARMv7, like ARMv6, has an integer unit with 16 registers. It's-- unlike ARMv6, it's capable of executing two instructions simultaneously. It also has legacy support for the VFP instructions, it's the hardware floating point model on ARMv6. But it also has new SIMD unit which is called NEON. SIMD stands for Single Instruction Multiple Dispatch. NEON unit has 16 128-bit registers. And all single precision floating point on the ARMv7 architecture executes on the NEON unit. NEON does not have double precision support. So double precision happens on the Legacy VFP unit.
A NEON register, as I said, it was a 128 bits wide. But the operations that the NEON instructions perform don't treat it as 128-bit block. Because it's SIMD registers, what they do they treat it as though it's several smaller units, and perform the same operation on each of those units.
So their instructions did operate as though it were sixteen 8-bit fields, as you see here, there's instructions that operate as though it were eight 16-bit integer fields, there's instructions that treat it as four 32-bit integers, as four 32-bit floating point numbers, that's a single precision floating point number, or there's even a few instructions that treat as two 64-bit integers. Not a lot, but there's a few.
So this is an example of a NEON instruction. This is the vadd.i16 instruction, it's a 16-bit integer add, and so what this instruction does is it takes these two 128-bit registers, treats them as though they contain eight 16-bit integers, and it simultaneously adds all eight pairs of integers to give you eight results.
Another example, a floating point example here, this is a vmul.f32 instruction, and this takes the two 128-bit registers, treats them as though they contain four 32-bit floats each, and it simultaneously multiplies them to give you four floating point results. So that's sort of a couple of examples from NEON. There are several different types of load and store instructions. NEON actually has a really rich set of loads and stores. It has a good aligned load support, good aligned store-- unaligned store support, it has some strided loads, it's got all kinds of stuff.
It has single precision floating point arithmetic, all the basic operations, conversations, it also has a faster reciprocal and square root estimate instruction. As I said before, it does not have double precision arithmetic. And it also has a fairly complete set of integer operations. It's got 8, 16, 32 bit, signed and unsigned, integer operations, and it's got all your favorite things. Multiply-accumulate, multiply-add, subtract, reverse subtract, wide multiplies. And then it also has a few more specialized things that are useful in fixed point arithmetic, and sort of digital signal processing applications.
NEON, however, does not have double precision. So on ARMv7, all double precision happens on the VFP unit, and that is much slower than the NEON unit. Another thing about NEON is that although NEON instructions typically consume more power than do either scalar instructions or general purpose instructions, they tend to consume less energy overall if your code is written well. The reason for that is that the energy consumed is actually the amount of power consumed over time.
So whereas a general-- if you have your code using general purpose instructions, you graph the power consumed versus time for it, you typically see, see your graph like this, were initially, the processors idle, it ramps up to do some work, it consumes power at a relatively steady state for the period of the time it takes to do work, and then it ramps back down.
Now, if you use NEON effectively, if you take good advantage of it, you write good code, what happens is, you consume much more power, or sometimes, only a little bit of more power, but for much less time. And so, by using NEON efficiently, you're able to use less energy overall which makes your battery last longer, which is something we all like. So that's a little overview of NEON. And now, we're going to dive into the Accelerate framework.
The Accelerate framework was introduced by the iOS 4 launch event as having 2000 API's for hardware accelerated math functions. Now, what kind of math functions are we talking about? Accelerate framework in Mac OS X is an umbrella framework that consists of many libraries. Not all of those libraries are going to be available in iOS 4. We're bringing three of them to iOS 4 right now. Those are the vDSP, LAPACK, and BLAS libraries.
Let's start with vDSP. It stands for the Digital Signal Processing Library, and it provides a lot of basic operations and more complicated operations. We're going to start with one of the absolute simplest functions in the vDSP library just to give you a feel for it, the dot product.
You may be familiar with dot product already. If you're not, I'll remind you. The dot product is the sum of pair of the products of corresponding elements from two vectors. So you have two vectors, your inputs, I've called them A and B here, and it gives you a scaler as an output.
And the way it gets back is by multiplying the first element of one vector by the first element of the other vector, second element of one vector by the second element of the other, third element by the third element. Take all those products, sum them together, that's the dot product of two vectors.
So suppose we need to use this in our program. The dot product is a really basic operation, gets used all the time, it's used heavily in audio processing, it's used heavily in lighting calculations for games, gets used all over the place. So say, you're writing a program, and you need to compute it.
You might just write a simple for loop to compute it. This is a simple piece of C99 code to compute a dot product, we just iterate over the length of the two vectors, we take the products, and we add them together. What could be easier? This is fine, but I think we can do better.
So we can use Accelerate to compute the dot product. Here, I have included the Accelerate headers so that I get the definitions we need, and I've called this function vDSP_dotpr, stands for dot product, and it takes pointers to the two arrays, A and B, they're these two arguments one, which we're not really going to worry about right now, but we'll talk about them later. It takes a pointer for the return value, the dot product, and it takes the length as an argument. Now, I haven't really saved a lot of code by doing this. My code is about as long as it was before. It's not a lot simpler.
Maybe it's less likely that I introduced a bug because I'm using a library function instead of writing it myself. But why should you do this? The first thing that most people would probably thing of is performance. So let's take a look at the execution time on the ARMv7 architecture. What I've got here is the execution time for a length 1024 Dot Product on ARMv7. This is something that would typically be used on audio processing applications. And the vDSP_dotpr function is about eight times faster than that simple for loop that we started out with.
So that's great. What else did we get? This is my favorite. We used a lot less energy. This is again, that same 1024 element dot product on the ARMv7 architecture, and the vDSP_dotpr is consuming about one quarter-- actually, a little less than one quarter, of the energy than that for loop would. So that means that hypothetically, if your application was doing nothing but Dot Products, you could do more than four times as much work on the same battery. So that's fantastic.
Now, what about on older processors? What about the ARMv6 architecture? As I mentioned, it has a very different floating point unit than the ARMv7 does. And code that performs well on one may not perform well on the other. Here's the graph where we're looking at the performance on ARMv6.
And that same code that we have, just calling the Accelerate framework dotpr function, gives us great performance on the ARMv6 architecture, as well as on the ARMv7. So you don't need to write a lot of architecture specific code. Now, there may be some graphics programmers out there in the audience who are saying, "This guy is crazy. Why would you want to do a 1024 element dot product?" I want to do a three element dot product.
So I'll tell you right up front that that's pretty much the worst use case possible for the Accelerate framework, because you're only going to do three things worth of work, and you are going to pay for a function call, all these overhead. The good new is that you still get pretty descent performance, more than twice as fast as that simple for loop would be.
Now, I would say, you shouldn't use it in this case. But if you do, the performance would be good. Let's take a look at what makes the dotpr function faster. This is that simple for loop that we started out with, and this is disassembly of the compiled code that resulted from it.
Now, I'm going to highlight the instructions that are doing, sort of the critical work in the loop, the actual floating point operations that we're interested in. There they are. One thing you'll notice right away is that the compiler didn't do a great job of structuring this loop. So it's using a lot of overhead for the computation. Now, let's take a look at what it is doing.
These first two instructions, these flds instructions, these load the floating point value from each array, and then we multiple those two values together with the fmuls-- the vmul.f32 instruction, and then we add dot product to a running sum with the vadd.f32 instruction. I mentioned the overhead. There's another thing that makes this loop slow, which is that the multiply comes immediately after the loads, but it can't execute until the data that being loaded is in register.
So in the best case scenario, where those loads are coming from the L1 cache, it's still going to take about three cycles before that data is loaded into register, and the vmul instruction can execute. So your processor is doing nothing at all for three cycles. Now three cycles isn't a lot, but if this is a loop that gets called a lot, your processor is doing nothing at all for three cycles every iteration of the loop. That's wasting energy.
The same thing happens on the add instruction. Again, we're waiting for the result of the multiple before we can add it to the accumulator. So for a couple of cycles, the processor is doing no useful work. It's just spending energy. Let's take a look at the vDSP_dotpr function and see what we did to do better.
This is sort of the critical interloop of the vDSP_dotpr function. Not all cases fall through this. If you have a very short vector, you won't end up in this loop. If your vectors are misaligned, you won't end up here. But in general, most of the time, this is the core loop that's doing the bulk of the work.
Again, I'm going to highlight the instructions that are doing useful work here. First thing to notice is that we have a lot less overhead. Whereas, that simple for loop we had was computing one term of the dot product every iteration that had six instructions worth of overhead. Here, we have only two instructions of overhead, but we're computing 16 terms of the dot product every pass to this loop.
Now, how are you doing 16 at once? This loop actually has four separate chains of computation interleaved with each other. I'm going to highlight one of them. This is going to compute four terms of the dot product using the NEON vector instructions that we were talking about before.
So first, we load four floats simultaneously from A, we load four floats from B, we multiply those four pairs of floating point numbers together to get four pairwise products, and then-- this actually happens on the next iteration through the loop-- we add those four products to the four running sums that we're maintaining.
Now, you could do that yourself. You can write that code-- it's not terribly hard. It takes some work. But you use Accelerate, it's really easy. You get all the performance benefits of that, but you just have to write that one function column. I think that's great. What else is in the vDSP library? Besides dot products, we have lots of other basic operations on arrays.
You can add, subtract, multiply arrays of floating point data, there are conversions between types, there's an accumulation function like the dot product, there's also things that sum arrays, there are lots of other stuff. You'll have to look through the manual to learn about all of them. There's also more complicated functions that are not so easy to write yourself. There's convolution and correlation functions, and there's also a nice set of fast Fourier transforms.
There's both complex to complex, and real to complex Fourier transforms, and there's support for any power-of-two size and some non-power-of-two sizes. The data type supported by vDSP support both single and double precision, have real and complex data, and there's also support for strided data access. This is those two ones that we saw on the vDSP_dotpr signature before. What strided data access is, is it lets you operate not on every element of array, but on every nth element of array.
So if I wanted to operate on the 0th element, the 2nd element, the 4th element, the 6th element, I would use a stride of 2. If I wanted to operate on every third element, I would have a stride of 3. So sometimes, you need to do this because your data comes in in a format where it's laid out that way. Strided data access is there as a utility so that you can still use the Accelerate functions, but you don't have to first deinterleave yourself. Alright. So now, let's take a look at the FFT.
Here, I've got the basic setup for calling an FFT. The first thing is, I'm including the Accelerate header, and I'm also including standard lib because I'm going to use malloc. The FFTs in vDSP take the length as actually, the logarithm base 2 of the length. So I'm going to perform a 1024 element FFT.
The log base 2 of the 1024 is 10, so I'm just constructing a variable to hold the length here, and also the log of the length. The FFTs take their input and output in these DSP split complex objects. They have a pointer to the real part and a pointer to the imaginary part.
By laying them out this way, we can make things a little bit more efficient in the body of the FFT, and I'm just allocating space for a 1024 real parts in the input, a 1024 imaginary parts in the input, 1024 real parts in the output, and in the imaginary part of the output.
This last line is creating this opaque structure called an fftsetup. fftsetup contains weights and other data that's needed by the FFT routines to do their computation. You have to create a setup before you can do an FFT, but you really only want to create a setup once. Creating setup is expensive and it's slow. And so, what you want to do is make one setup and then use that to perform many FFTs. You can reuse it over and over again for lots of FFTs. Now, I have that setup.
Suppose I had put some useful data into my input array, something that I want to take a Fourier transform of, I'm going to call this function, vDSP_fft_zop. That stands for a complex to complex, that's Z for complex, and OP is "out of place". So it's not going to overwrite its input. It's going to store the result of the FFT in the output structure.
So I pass it the setup that I created, I pass it the input that 1 is the stride for the inputs, I pass it the output again, with stride 1, I give it its length is a log 2, and this flag, FFT_FOWARD, that tells it to perform a forward FFT.
That's all you have to do. Now, suppose I then want to do the inverse transform and get back to where I started. Here, I'm using the fft_zip function instead of zop. Zip is a complex to complex in place FFT, so it overrides its inputs with the result of the transform.
So what this is going to do, it takes that same setup, the call looks almost the same as zop, except it only has one DSP split complex structure, so it's going to take the output of the forward transform that we did and override it with the inverse transform of the forward transform, and we give it that FFT INVERSE flag to tell it that it's doing an inverse Fourier transform. Now, you might think that the inverse transform of forward transform for a discrete FFT is the original data. And it almost is, except that you have to rescale. Everything gets scaled by a factor of N when you do a forward followed by inverse TFT.
Here, I'm using another vDSP function, the vDSP_vsmul function, to undo that scaling. Create a scale factor of 1 over N, and the vsmul function stands for Vector Scalar Multiply. And so, what it's going to do is it's going to take both that-- the first line, takes that real pointer, and scales all of the real results by 1 over N.
The second one takes the imaginary pointer, scales all the imaginary results by 1 over N. So now, up to floating point rounding, we've gotten back the data that we started with. Usually, we would do some useful work with the data at some point here, rather than just going forward and back.
But I leave that up to you to come up with cool things you can do in your app with the FFT. That's not my department. This last line, the vDSP destroy setup, just cleans up that data in that opaque structure that was allocated by vDSP setup. So basically, free for a vDSP setup.
So that's an example of the FFTs. What about performance? Now, I can't really compare the FFTs to a simple for loop in C. They're a lot more complicated. And even if I did compare them to a simple C program, it would be so much slower, the C program, that it wouldn't even really show up on the graph.
So instead, I'm going to compare to FFTW. FFTW stands for the Fastest Fourier Transform in the West. It's pretty much the best commercial portable FFT library. You can build it for power PC, you can build it for Intel, you can build it on ARM, you can build it on all kinds of platforms, and it gives you really quite good performance on all of those platforms.
If you want to learn more about it, you can go to fftw.org, you can download it there. It's a really nice library. So how do we compare? Here, I'm graphing the execution time for 1024 element single-precision FFT. So this is something that typically would be used in audio processing, but also on a lot of other things.
A lot of audio processing both on the Mac and the phone uses exactly this case very frequently. And smaller bars are better here. We're looking at the actual time it takes to execute. vDSP_fft-zop is five times as fast as FFTW. FFTW is an excellent library, but we're better.
[ Applause ]
I'm not holding my breath for them to rename it though. Maybe the Faster Fourier Transform in the East. So that's the FFTs. Let's talk a little bit about LAPACK which stands for the Linear Algebra Package. Now, LAPACK is a very venerable library. Its roots go way back in time. Its roots in fact, predate the C language even.
It's been around the long time in one form or another. It provides high-level linear algebra operations. It can solve linear systems. If you have a system of linear equations, four equations and four unknowns, LAPACK can solve it. It can also solve it even if you have six equations and four unknowns, or two equations and eight unknowns or, whatever. It's got all kinds of solves.
It provides matrix factorizations. It lets you compute eigenvalues and eigenvectors. So this provides a lot of abstract linear algebra operations that you'd like to be able to do on matrices. LAPACK supports a bunch of different data types. It supports both single and double precision, it supports real and complex data, it has lots of support for special matrix types. You can do operations on symmetric matrices, triangular matrices, banded matrices, Hermitian matrices, lots of stuff.
There's one gotcha though. Data in the LAPACK library, because its roots go way back in time, is laid out in column-major order. Anytime you're building matrices that are are going to interact with LAPACK, you're going to need to lay them out this way. And what does that mean? Those of us who speak western languages are used to reading left to right, top to bottom. So we might read this matrix as 1 minus 1, 1 minus 1, 1, 2, 4, 8. That's not the order that's used by LAPACK. Column-major order means that first, the first column is stored contiguously in memory.
Then, the next column, the next column, and the next column. So each column is contiguous. Whereas, if you're reading across a row, the axis is going to be strided. Let's look at the example of using LAPACK to solve the system of linear equations. This is one of the most basic things that LAPACK has support for. It also has support for lots of, much more complicated stuff. So here, I've got a matrix A and a vector B, and I want to solve the equation Ax equals B. So I'm looking for a vector x that satisfies this.
First thing, I'm going to set up my matrix A. I've got two nested for loops here, I'm iterating first over the rows, and then over the columns. And here, I'm setting up the lower triangle. Those are all the minus one elements of that matrix. Then I'm going to set up the diagonal elements, these are all one, and then set up the right most columns, all ones as well, and then we have these inner upper triangle elements that are zeros.
So I've set up my matrix here. You notice that the matrix access that I have here, I'm using A, j, i. Normally-- and if you took a linear algebra class, you're probably starting A, i, j when you're describing the ith row and the jth column of matrix, they're transposed explicitly so that we get that column-major ordering.
This is one easy way to get the column-major access. You can't always use this trick. Sometimes, you need to actually explicitly write out your axises. But when you can use this, it works nicely. So now, let's look at actually solving that system. We've got our right hand side vector here B, that's the vector that we're trying to-- we're trying to solve for. And then we make a call to the sgesv function. sgesv stands for Single Precision, General Solve.
So our data here is floats, that's single precision. General is what we use for just a normal rectangular matrix. It's not anything fancy. It's not symmetric, it's not banded. If you're just working with the matrix and you don't know anything special about it, you're always going to be looking at these routines that have the GE prefix. Now, what we passed to the SGESV routine is the size of the system and the number of right hand sides. That's the number of vectors B that we're solving for. That's nrhs. We give it the matrix A.
The next n parameter here is the leading dimension of the matrix. Most of the time, this is just going to be the dimension of your matrix for you. If you're working with the matrix that you built yourself, it's almost always going to be just the dimension of the matrix. In special circumstances, it may be something else, I leave you to look at the LAPACK documentation to learn about that.
We're also going to give it our vector B that we're solving for. We give it this info parameter. That is a value that it uses as a flag, to indicate whether it was able to solve the system. Not all systems have solutions. So if it can't solve it, info will return with a nonzero value.
There's also this one value, ipiv, that's a temporary workspace that's needed by it. On return, it contains some of the factorization information on the matrix. So that's all we do, and then I'm going to print out the result. You know that the result has overwritten the right hand side B that we started with.
This is something that happens throughout LAPACK. Almost all the LAPACK routines overwrite some of their inputs to hold the results. This means that if you need to keep the inputs around, you're usual going to need to make a copy of them when you're working with LAPACK. That's something important to be aware of.
So I have compiled it here, I'm going to run it. You note that I'm linking against the Accelerate framework. You have to link it against the Accelerate framework in order to take advantage of it. Most of the time, you're going to be doing this in Xcode, so you just need to go to the Add Frameworks menu item, select the Accelerate framework and bring it in. Here, I'm compiling on command lines, so I used this "-framework Accelerate" flag.
So I'm going to build this and run it, and it solves our linear system. This happens to be a very simple system. It has a very simple solution. But you can use it for all kinds of more complicated stuff. So that's a little overview of LAPACK. Let's talk about BLAS.
BLAS stands for the Basic Linear Algebra Subroutines. It's sort of the low-level underpinnings beneath LAPACK. But you can also call those routines yourself. As I mentioned, it's low level. It provides low-level linear algebra operations. LAPACK gave you high-level abstract operation on matrices, solve this system, factor this matrix. BLAS is low-level stuff that's really the basic arithmetic operations on matrices. It gives you vector-vector operations like dot products, scalar products, and vector sums.
It gives you matrix-vector operations, like if you want to multiply a vector by a matrix, or if you want to take the outer product of two vectors to update a matrix. It gives you that. And it also gives you operations on two matrices like the matrix multiply. BLAS, like LAPACK, supports lots of different data types, supports both single and double precision, supports real and complex data, and it has support for multiple data layouts. It's got-- unlike LAPACK, it supports both row and column-major order.
This is good news because it means that if you're not working with LAPACK, if you're only going to use the BLAS routines, you can lay out your data in row-major order, which is the natural order for most C programmers, and just use it easily that way. If you're going to work with LAPACK as well though, you're probably going to want to lay your data out in column-major order still so you don't have to explicitly transpose it when you're passing data to the LAPACK routines. Like LAPACK, it has support for dense matrices, banded matrices, triangle matrices, lots of different matrix types.
Another nice thing about the BLAS routines is that they can operate on your matrix data as though it were transposed, or as though they were a conjugate transpose. So you don't need to explicitly transpose the data yourself often times when you're working with some. This is really useful, and if you're using the BLAS, I encourage you to take advantage of this because transposition is reasonably expensive. And in general, we can do it more efficiently if we do it as part of whatever other operations you're doing.
This is a really, really basic example of the BLAS. This is matrix multiply, and it's multiplying two 2 by 2 matrices, A and B together here, to get a matrix C. This is a 2 by 2 example. You can do a 1000 by 1000, whatever you want. It's pretty easy to use, it's got a clean C interface, it's a little nicer to work with than LAPACK.
Here, I'm passing it the CblasRowMajor flag that tells it-- that my matrices are laid out in a row-major order. The CblasNoTrans flag is indicating that I don't want it to do a transposition as part of the multiplication. If I wanted to implicitly transpose one of the matrices, I would pass it a CblasTrans flag.
The sea of 2's in the call are the dimensions of the matrices. The first three are the dimensions of A, B, and C. The floating point 1 value, that's a scale factor to apply to the product. So I'm saying multiply A times B, scale that by 1, the 0 before the C is a scale factor to apply to C.
So here, I-- well, it's actually going to use A times B plus a scaled C, I'm scaling C by zero, so the result I get is just A times B. It's basic matrix multiply. That's a very basic example of what's available in the BLAS. Let's talk a little bit about how to use Accelerate effectively.
First off, why should you use Accelerate? We saw some things already that Accelerate gives you great performance, assuming you use it properly. Another thing that's really nice about it is it gives you functionality. Dot product is really easy to write. You can write your own dot product in, you know, a couple of seconds. But if you want to compute the eigenvalues of a matrix, or you want to compute a 1024 element FFT, those are not easy routines to write. You know, if you write them a lot, maybe you can write them off the top of your head.
But most people are going to need to go look on Wikipedia or dig out an old reference book from college, something like that, find out how to implement it, have to spend a couple of days debugging it. You really don't want to do that. You want to spend time writing a great app. So take advantage of Accelerate, all these routines are written for you already, you can spend your time on adding features, doing other cool stuff that users of your app will appreciate.
It's also easy to use. This goes back to what I was just talking about. You don't need to know a lot about what's going on behind the scenes. If you do know a lot, that's great. It will make it even easier to-- for you to find what you're looking for.
But if you don't know a lot about these abstracts mathematical operations that you're performing, you can still take advantage of Accelerate and get good performance, good energy usage, and you're also going to get architecture independence. Normally, if you want to get great performance and great energy usage, you have to write architecture specific code. This goes back to what I was talking about with how the ARMv6 architecture, the ARMv7 architecture are very different.
If you want to support the iPhone 3G, which a lot of people have, then-- and you still want to get great performance, you're going to need to have two separate code paths if you write this yourself. If you use Accelerate, we already have those two separate code paths.
We're going to give you good performance on both platforms, and you're going to get-- it's going to work in the simulator too. You don't have to write a separate set of fall back code for your ARM assembly to use in the simulator. Those are some reasons to use Accelerate.
There are some design trade offs that we make in working on Accelerate that you should be aware of for your own useful library to make sure you can take good advantage of it. We always try to make the common cases fast. It's because they're the common cases, they're where most of the work is going to be done. We want those to be as fast as possible.
So we spent a lot of effort making the common cases fast. But because we want to provide general functionality, we want to make sure you can use the Accelerate framework for all your stuff. You don't need to worry about writing an FFT. Some-- we also support lots of cases that aren't the common case. Now, they're usually be fast. We make them fast too.
But the uncommon cases are usually the uncommon cases because there's something weird about them. And the fact that there's something weird about them often means that they just can't be as fast as the common case can. Strided access will never be as fast as unstrided access on a vector processor. It's just the way it is.
So when you're writing your application, when you're laying out your algorithms, you want to use the simple common case as often as you can. If you have to use the other case that supports there, it'll work, but try to structure things so that you're using the common case. Continuing with that, you should be aware of the size of the buffers that you're using. If you're making lots and lots of calls with tiny buffers, then you're paying a lot of call overhead. There are registers that need to be safe and restored on every call.
That takes time. Calling a function takes time. You want to spend as little time as possible doing that. You want your processor spending as much time as possible doing real work. So this means that you don't want to make a thousand calls with length 3 vectors. It's much better to make three calls with vectors length a thousand.
So really try to structure algorithms so that you can use-- so you're not making tons and tons of tiny, tiny calls into the framework. At the same time, you don't want your data, your buffers to be too large. If they're too large, then it's much harder for the algorithms to benefit from the cache that's available on the processor. Usually, you call one vDSP function, you'll probably going to call another vDSP function with the output, or your own code that takes the output and does something.
You'd really like all of the output to be in the cache on the processor so that it's available for whatever comes next. It doesn't need to fetch it from memory. So it's good to aim for a working set. What I mean by a working set is serve the amount of data that's live at any point and time in your algorithm.
What you're really computing with at any moment. You want to aim for, you know, about 32 kilobytes, about the size of L1 cache. That's a good working set to aim for. What this means is that, you know, if you're working with single precision data or integers, vectors, you know, length a thousand to four thousand, that kind of range, that's a really nice range to aim for. That's where you're going to get the best performance. I keep coming back to this, but I'm going to say it again. Use contiguous arrays if you can. We have the strided access. If you can avoid it, don't use it.
So how do you not use it? A lot of times, you can't do-- you can't change a data structure that you're getting as input. Some other guy wrote it, you can't go make him change all of his code, it's coming from the internet, you can't change that, who knows. You're getting data in, it's strided, you have to deal with it.
However, the fact that the input to one of these routines is strided doesn't mean the output has to be strided. So what you want to try to do is on the very first call to a vDSP routine or a LAPACK or BLAS routine, let the input come and stride it, but produce a contiguous stride 1 output.
That way, all the subsequent operations that you do can be faster. If you have to produce a strided output at the end after-- at the end of your computation, keep everything stride 1 write up until that last step, and then use a strided output for the final result. Another thing to be aware of is just the architecture. Assuming that you're supporting the ARMv6 products, make sure to build your binary path for both ARMv6 and ARMv7.
This will let your compiled code get the best performance on both architectures, and you should also prefer single precision floating point, floats, to double precision, doubles, whenever you can. Float is a little bit faster than double on ARMv6, mostly because you can fit more data in cache. But on ARMv7, on the current architecture, float is enormously faster than double.
The reason for that is that float executes on the NEON unit, which is much faster than VFP, double executes on the Legacy of VFP unit. Also, because it executes on the NEON unit, float can benefit from vectorization. When you call the Accelerate routines, we can take advantage of vectorization on single precision data.
There's not really many opportunities, not really any at all, for vectorization on the ARMv6 architecture with either float or double data. And the VFP unit, which is what's used for double on ARMv7, also doesn't really allow you to vectorize. So if you can use single precision, you're going to get the best performance that way on ARMv7, both in your own code and when you call the Accelerate framework.
You can't always use it. Sometimes, your algorithm is so numerically sensitive that you just stuck with double. That's how it is. But a lot of times, you can use it. So check, see if you can use float. If you can, do, and you'll get better performance. If you want more information about Accelerate, you can-- make-- talk to our technology evangelist, Paul Danbold, and you can also post on the Apple Developer Forums if you have questions.
We have some documentation available on the vDSP library. LAPACK and BLAS are industry standard libraries. Probably the best documentation for them is the freely available documentation at netlib.org. There's also an excellent book on LAPACK, the LAPACK User's Guide. There's links to purchase a copy if you want one. They are on netlib.org. And the headers for the system are also a great place to look. That's where all these functions are defined. You can start poking around in there, see what looks interesting, what you might like to use.