Core OS • iOS, OS X • 46:22
The Accelerate framework contains signal and image processing, matrix and linear algebra computation, and now an optimized array-based math library for iOS. Find out how you can use the Accelerate framework to achieve dramatic improvements in performance and energy consumption.
Speakers: Geoff Belter, Luke Chang
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it may have transcription errors.
Good morning, everyone. Welcome to the Accelerate Framework session. My name is Luke Chang. I'm an engineer in Vector and Numerics Group. I'm here today to talk about Accelerate Framework. Have you ever thought about making an image processing app or audio equalizer app or any computation-intensive apps? If you have, you're in the right place. Accelerate Framework can help you make those apps and more. That's why we like to call it-- your one-stop shopping for fast and energy efficient libraries. We call it one-stop shopping for a reason. In the Xurway framework, we support a wide range of computational functionalities. We have signal processing, VDSP. We have image processing, VIMG. Linear algebra, we have LAPACK and BLAS. Transcendental math functions, we have VFORCE and VMATHLIB.
So this is really a wide range of computational functionalities. In this session, I'd like to talk about three things. First, why you want to use Accelerate Framework. Second, where in your code you could use Accelerate Framework will help you recognize that. And lastly, we're going to show you how to use Accelerate Framework. So let's get started on the first one. You want to use Accelerate Framework because it can help you make great apps. In my experience, a great app needs a few things.
First, it has to be useful. It does something I need, so I will go ahead, download it from the App Store, give it a try. After I downloaded it, it has to work. I don't want my app to crash every five minutes or give me the wrong results. Those are just the two basic requirements for a good app. It's not a great app yet. To make it a great app, I want my app to be responsive. Responsive to my every touch, every swipe, or every pinch. A slow app is not a good app.
And last, I want my app to have a long battery life. If the app is draining the battery a lot, I probably would not use it when I don't have my charger around. That would significantly cut down my usage time. That's not good. So these are the things the users want. Now how about you? As a developer, what do you want? Well, I am a developer too. Here are the things that I want for my code.
I want my code easy to write so I can finish my app sooner and put my app on the App Store so people can start using it. After I finish the app, I want my code to be readable. So when I hand it over to the next developer for maintenance, they won't keep coming back asking questions. I could focus my time and energy on the next great app.
And if there's ever a need to port from iOS to Mac OS X, like many of the games do now, they support both iOS and Mac OS X, I want the same code to just work on both platforms. All I need to do is just rebuild the project for iOS or for Mac OS X, and that's it. So, now, how does Accelerate framework help you achieve these goals?
Well, Accelerate Framework has more than 2,000 APIs. So it's more than likely that something you're trying to do will already support that in Accelerate Framework. So you don't have to write your own. And using Accelerate is very easy, so your code is going to be easy to write.
Accelerate Framework is well tested and accurate. So if your apps are using Accelerate Framework, your result will be accurate, and your app is going to be robust as well. Accelerate Framework, by its name, is fast and energy efficient. So your app is going to be responsive, it can do more work in a short amount of time, and you won't drain on the battery.
Lastly, Accelerate Framework is available on both iOS and OS X. So the same code would just work if you want to port from iOS to OS X. So these are the reasons that a lot of people are already using Accelerate Framework. A lot of great apps are using Accelerate Framework. In fact, we did a survey recently on the Mac App Store. Here's what we found.
Nine of the ten top-grossing apps in the Mac App Store use Accelerate. So, again, a lot of people are using it, great apps are using it, and so should you. Well, I'm going to let you in on a little secret on how we made Accelerate framework fast and energy efficient.
We use SIMD instructions to optimize our code. SIMD stands for single instruction, multiple data. Let's say you have four floating-point numbers you want to add it to another four. You can use SIMD instructions. In one instruction, you can do all four arithmetic calculations. On Intel, we take advantage of SSE and the latest technology AVX. On ARM, we use the NEON instruction set.
And we also write hand-tuned assembly for each processor Apple supports. So we carefully schedule our assembly instruction, and we do software pipelining or loop unrolling, sometimes both, to make sure the maximum parallelism in your code is exposed to the processor, so the processor can execute more than one instruction at a time. And lastly, we multithreaded our code using GCD. So if your system has multiple cores, which most of the Apple devices do now, you'll be using all of the core on the system to achieve the maximum performance.
Here's how you link against accelerated framework in Xcode. It's very simple. You click on the Build Faces tab right here. and you click on the "Add" button, the small "Add" button, right after the library list. ESCO is going to show you a list of frameworks that you can link against. On the top of the list, there's Accelerate Framework. So let's select that and click "Add." Now your project is linking against Accelerate Framework. In order to use the API in Accelerate Framework, All you need to do is just include an accelerated header in your source file. Then you can start using it.
If you're feeling more adventurous, you want to do it in command line, it's just as easy. You pass -framework accelerate option to your compiler, and that's it. Your code will be linking against accelerate framework and get the best performance. Now you know how to use Accelerate Framework, I'm going to present a FFT case study that we did recently to show you exactly why you want to use Accelerate Framework.
FFT stands for fast Fourier transform. It is one of the most important signal processing operation out there. It's used in audio compression, it's used in image processing, even used in video, pretty much everywhere. So if you don't know how to write an FFT, you could use the one in Accelerate Framework or pick up this popular book, Numerical Recipes in C. It has the algorithm and the implantation on how to implement FFT. So we're going to compare these two on two metrics. First one is execution time. Execution time is easy to understand. You want your code to run as fast as possible, so the execution time, you want it to be as short as possible.
And then we want to compare the energy consumption of using accelerated framework and numerical recipes in C. I'd like to spend more time on energy consumption to talk about what is the energy consumption of a function. Because there is a common misunderstanding out there that people think power consumption is the same as energy consumption. But it's not true.
In terms of battery life, what you care about is really the energy consumption, not the power. A battery can hold a certain amount of energy, so if your app is more energy efficient, the battery life is going to be longer. So let me put the relation between energy and power into equation. Energy is the integral of instantaneous power over time. In reality, we cannot get the continuous measurement of instantaneous power. What we do is use the piecewise summation to approximate the integral. The equation looks complicated, but it's actually very simple. Let me put this into a graph.
This is a typical energy consumption profile that we measure on the system. So at the beginning, the system is idle, so the system is consuming the idle power, P0. At time T0, the workload comes in, in this case the FFT. The system starts doing something, the process is processing those data. The instantaneous power jumps to P1.
And at time T1, the workload finished, system come back to the idle power P0. There are two things that you need to pay attention to in this graph. First one is the time difference between T0 and T1. That's your execution time. You want it to be as short as possible.
The second thing is the area under the curve between t_0 and t_1. That is the energy consumption of this function. You want it as less as possible. So now we know what we're comparing. We use the metrics to compare these two, the accelerated framework, FFT, and numerical recipes in C. Let's look at the competition. Our competition is numerical with B, C, and C. We asked an average programmer to write an FFT based on the book, so he did a straight from the book implementation. It's about 50 lines of code. And it looked like this.
So this code is not terribly difficult to write. I know everyone in this room is capable of writing one of your own. But there are a couple of things you need to pay attention to. First, you have to be careful about the index and where you want to use add, where you want to use subtract. After you get all the size and the details right, you're not done yet.
You still need to test for accuracy. You want to make sure the error tolerance is within your app's range. And you want to measure for performance, because if you're writing a real-time app, you don't want your app to lose audio samples or lose video frames. And last, you need to document your code. You have to document where you took the algorithm from and how you implemented it.
In our opinion, all this is just too much work. You could use the time to focus on your next project or next big feature and save the time by using Accelerate FFT. So here is how you use Accelerate FFT. There are three simple steps for you to use Accelerate FFT: Setup, Operate, and Destroy. So, of course, you have to include Accelerate header to have access to all Accelerate framework APIs.
Then you pass in the data and the FFT length. FFT length is represented in terms of log 2. So you can see the log 2n equal to 10. That means we're trying to do a 1024-point FFT. So once at the start, before you process any of your data, you call fft create setup. You'll pass in the FFT length and the radix information. Here we have a radix 2.
And then you go on to operate on your data. You call fftzip. It will do an in-place complex to complex FFT. You pass in the setup structure where the data is. And this one on the third argument is telling FFT that my data is in contiguous memory. So you access one after another. And the length information, and we want to do a forward FFT. You can call FFT zip multiple times to handle all your data with the same setup structure. You can reuse the same setup structure. And only once at the end when you're done with all your data, you call destroy FFT setup. This is to reclaim the memory that's allocated to the setup structure so you can avoid memory leak. So, using Accelerate FFT is really simple. Let's look at the performance of these two, Accelerate FFT and Numerical Recipes in C.
Here is the energy consumption profile for numerical recipes in C. And here is the accelerated FFT. As you can see, the execution time for accelerated FFT is much, much shorter than numerical recipes in C. Even though the instantaneous power of accelerated FFT is more than numerical recipes in C, but since the time is so much shorter, the total area under the curve, which is the energy consumption, is much less than numerical recipes in C. Let me put this into perspective for you. Let's normalize the performance and energy consumption of numerical recipes in C to 1. Here's what accelerated FFT looks like.
Accelerate FFT is more than nine times faster than numerical recipes in C, while it only consumes one-eighth of the energy of numerical recipes in C. So you can do more work while consuming less power. This is exactly why you want to use Accelerate framework. Now, enough with the hard-code data. Let me tell you a brief history of Accelerate framework.
Accelerate Framework has been available on Mac OS X for many years. We have VDSP for signal processing, we have linear algebra libraries, we have image processing libraries, and we have the math functions. Over the years, we're trying to bring every component from Mac OS X to iOS. We start with VDSP and linear algebra. Last year, we added Vimage and Vforce. It's been a huge success for us. So this year, I'm happy to announce that we added the last piece of Accelerate framework. We added the Vmathlib. So if your code is using a server framework on Mac OS X, you can safely port it to iOS. All the components are there. We have a complete picture for both iOS and Mac OS X.
Now, since I introduced vMathlib, let's talk about it. vMathlib is the SIMD vector library. It operates on the SIMD vector, In vector and numerics group, we support math for every data length. For your scalar data, one flowing point input and one flowing point output, we have libm. For your array data, we have a VForce. VForce operate on arrays, so it takes array as the input and generate another array as output.
vMathlib is something in between. vMathlib operates on SIMD vectors. So in Accelerate framework, we define the vfloat data type that maps to the vector register in your processor. In most architecture, vfloat will have four elements, four single precision falling point numbers in the structure, and you'll be using vfloat when you're writing your own vector code.
For those who are not familiar with LibM, here are a few words. LibM is your standard math library in C. It has a collection of transcendental functions. Here are the familiar names: exp, log, sine, cosine. For VForce, it operates on arrays. It has the VV as the function prefix, so you can see VV exp F, VV log F, VV sine F, et cetera. The mathlib has only one "v" as the prefix. So there is v_exp_f, v_log_f, v_sine_f, et cetera.
You want to use the math lab when you're writing your own vector code. While you're writing your vector code, you probably sometimes want to use -- take sine or a cosine of a value. What do you do in this case? Well, you could use libm to achieve what you want, something like this. So you include math.h to use libm, and you write a for loop to take each element in the input vector and then store the result into the output vector.
So there's an obvious problem in this code. You want to write a vector algorithm because you want to take advantage of the performance of vector units. But LibM is not using vector units, so you're not getting the performance improvement. So how about we use vforce? vforce does use the vector unit, and the code will look like this.
So include accelerate header, and v4s operate on arrays, so you have to take the address of your input vector and output vector and then tell v4s the length. Well, it works, but it's awkward, obviously. And another thing is, because v4 is designed to work on arrays, it takes the pointer to that array. It involves the memory access. So if your input and output vectors are already in register, there's no need to do the memory access. We can use vMathlib to achieve that. Here's how you write the code using vMathlib. So it's very simple. We call v sine f and pass vx as the input argument, and you'll have the result in another vector, vy. The code is much cleaner, and there is no memory access. You will get the optimum performance.
So that's the mathlet. I briefly mentioned VForce in the previous slide, but without going into too much detail. Now is the time to do so. VForce is a vectorized math library. It operates on arrays, and in addition to the transcendental functions, we have rounding functions. All four rounding modes are supported. And we also have lots of other stuff, like square root, remainder, next, after, et cetera.
Let's say you want to write a signal generator app, and you want to generate a frequency-modulated sine wave. Again, you can use LibM to do that. You write a for loop to call sine f, which is the LibM function. You go through each element in the input array and then sort the result to the output array. That works, but it's not optimal. Let's use V-force for that.
It's very simple to use V-force. You simply replace a for loop with one function call to VV sine F. Passing the address to the input buffer and output buffer and also the pointer to the length, the frequency modulated sine wave will be generated and output buffered. It's just that simple. Now, let's look at the performance comparison of the two. One is using V-force, the other one is just using the for loop.
As you can see, V-force is more than twice faster than using a simple for loop. At the same time, it consumes less energy. V-force is more than twice more energy efficient than using a simple for loop. So again, faster and energy efficient. That's the model of accelerated framework. We didn't just cherry pick sine to be our example. There are a bunch of other functions in V-force as well. We have truncation, log, x, power. Across the board, you can see a typical 2x speedup. So you can safely use V-force and expect great results.
Here's some detail about vForce. vForce supports both single and double precision falling point numbers. It handles edge cases correctly, so if your input has infinity or nans, positive zero, negative zero, you don't have to worry about it. You just pass them to vForce, and then vForce will handle those cases for you.
V4 requires minimum data alignment. We only require the native data alignment. For a single precision flowing point number, that's going to be four bytes. For a double precision flowing point number, it's eight bytes. And we also support in-place calculation. You don't have to allocate a temp buffer to hold the results. We can just operate on your data in place.
And a lot of people ask us this question: "I only have 20 elements, I only have 20 numbers. Is it beneficial to use V-force?" Well, as a rule of thumb, You will see performance improvement when you have more than 16 elements. While some function has higher threshold, some function has lower threshold, if you're interested in finding out the exact number, you can just write a simple app to do a test. But as a rule of thumb, if you have more than 16 elements in your array, you're good to go. You can use vForce and expect great results.
So that's V-force. We talked about the math library, VMathlib and V-force. I'm going to talk about another big block, VDSP. VDSP is our signal processing library. It has pretty much everything you need for signal processing. We have basic operations like add, subtract, conversion, and we also have discrete Fourier transform. The FFT that we saw earlier in the case study is actually part of the discrete Fourier transform. Discrete Fourier transform, we support Radix 2, Radix 3, and Radix 5. And we also have convolution, if you want to do your own filtering on the signal. And we also have correlation, if you want to do signal analysis. They're there for you.
In iOS 6, we added two new features. The first one is discrete cosine transform, and another one is the Bi-Qual I/O filter. These two features are requested by many developers. And we do look at those feature requests. When the time is right, we'll add them in. So if you find something that you really need, it's really great, and it's not available in Accelerate Framework, go file a feature request. Don't be afraid. We do look at them, and we're going to work on them when the timing is right. Descript Cosine Transform is very similar to the FFT that we saw earlier, so I'm going to take Biqua IR filter as an example.
Here is a series of end stage cascaded biquad ILR filter. So you pass input into end stage of second order biquad filter, and you get the output at the end. This is very common in the audio processing. So we added this in iOS 6. What it does basically is we optimize the biquad filter because there is inherent feedback loop in the biquad filter is that's very hard to optimize. We did the work for you, so you can just use the biquad IRR filter in Accelerate Framework and get great results. Here's an example on how to use it. It is the same three steps-- set up, operate, and destroy.
First, include an accelerated header, as usual. And we specify we want to do a 10-stage IR filter. And for each stage-- IR filter requires five coefficients and two delay stages. You can think of delay stages as the current state of the filter. There are two of them. And there's input, output, Once at the start of your program, you call "Biqua Create Setup" to create the setup structure that's needed by the operation. You pass in the field coefficient and number of stages. A setup structure will be created for you. And again, you can call the VDSP bi-quad to operate on the data multiple times to work on all your data with the same setup structure.
And at the end, you destroy the setup to reclaim the memory allocated. So that's the same three simple steps that you use for Biqua IL filter and FFT. Now, here are a few things more about the data type in VDSP. We support single and double precision following point numbers.
And we also support real and complex numbers. So you can do a real to complex Fourier transform FFT or complex to real FFT. Those are all supported. We'll also support strided data access. In the previous examples, I always used a stride of one, meaning the data in the memory is contiguous.
We support when you're getting data from somewhere else, you want to access every other two or every other three. We do support that. However, it's our recommendation that if you have control over your data, you want to arrange your data in contiguous memory so we can fully take advantage of the vector unit and give you the best results. If you can't, don't worry about copying the data in memory to make it contiguous. we can just operate on the data with strided access. So that is VDSP, the signal processing library.
Now, another big block in Accelerate framework is vimage, the image processing library. We're in the era of digital photography. People love to share photos, and before they share a photo, they actually like to do some post-processing on the photo, like removing the red eye or increase the saturation, all things you can do. So, V-Image is a great help. Vimage can do a lot of things. It can do convolution. Convolution can achieve various effects, like blur. I'm going to show you how to use the convolution to blur right now. Here's the effect.
And you can also do geometry. You can rotate and scale your picture like this. You can do morphology. This one is very interesting. I'm going to turn each snowflake in this picture into a bigger one. There is also alpha that you can blend two pictures together. I'm going to fade one out and fade another in.
We have transform functions that can change the hue in this picture. I'm going to send this little guy to outer space. And we have histogram. If you take a lot of photos, you're familiar with this. It is the intensity distribution of your RGB channels, like this. And lastly, we have conversion function to convert your image format from one to another. We add a few things to vImage. In Mountain Lion and IO6, we have BGRA, RGBA support. And now we also support 16-bit integer.
Because convolution is such a versatile operation, I'd like to mention a few words about it too. So I show you blur, but it's not the only thing convolution can do. It can also do edge detection. You can use different kernels on each color channel to achieve a pretty fancy result I'm going to show you right now. Here is blur. and edge detection, and different color channel operation.
So it looks very fancy, but it's actually very simple. It's a very simple idea. Convolution is basically a weighted average of nearby pixels. Let's say you have a kernel like this. The center pixel is more heavily weighted than the side or corner pixel. and you have an image like this, you can see the purple pixels form a sharp edge in this image. Let's apply the weighted average on the center pixel. What you get is a lighter purple pixel. Same idea, you apply this kernel to all the pixels in this image.
Basically, the sharp edge formed by the purple pixel is replaced by a softer edge. Essentially, this is a blurring effect. Convolution is this simple, so you might be thinking, "I can write one myself." That's true. It's just nested for loop, four of them nested together. However, there is problem in this code. First, It does not handle the edge pixel correctly. When you're at the edge, you don't have all your nearby pixels for the kernel. You need to handle that case. And it doesn't handle the overflow case, so you might see an artifact in your picture due to overflow. And the most important thing here is it's really, really slow. In our experience, a good convolution code takes more than 100 slides of code if it's not more.
So I'm going to show you the performance comparison. I will show you how to use VMH first. It's a simple function called to VMH-conv ARGB 8888. We're passing the source and destination, and we'll pass a little bit more information about the kernel. Your convolution result will be ready in the destination. So now I'm going to show you the performance result. This is a 7 by 7 convolution on the 1024 by 768 image. Vimage is more than 14 times faster than using a good scalar code.
And energy consumption, same story. It's very energy efficient. It consumes less than one-eighth of the energy. So fast and energy efficient. There are a lot of image formats, and we cannot support all of them. So we classify them into two categories. First is the core formats, and the rest is the non-core formats. There are four types in core formats. If you have a single-channel image, each pixel can be an 8-bit integer or a 32-bit phone endpoint number. We also support four-channel interleaved. So we have ARGB 8888 and ARGB FFFF.
Any format that's not in this four is considered non-core formats. We have BGRA, RGB, GBR, 16-bit unsigned integer, or 16-bit falling point number. And the main difference between core format and non-core format is that Vimage operation supports core formats extensively, but not non-core formats. So you might be wondering, OK, now, if I have an image that's in non-core format, I want to do some operation on it using the image. Can I do that? Yes, of course you can. We have all the conversion functions to help you. Here is an example. I want to do a scaling operation on the pre-multiplied planar f image.
The core format expects the image to be a non-premultiplied image. So if you have a premultiplied alpha image, you want to convert it to the core format first. So you call this conversion function, vimage on premultiplied data, planar f. It works in place. And then you can operate on the core format, which is the Planner F.
And at the end, you convert it back to your desired image format. Now, you might be thinking, "Wow. It looks like three times the workload. I have to convert it, and I have to operate on it, and then convert it back at the end. Well, I'm going to show you the performance result to tell you that it is okay to do that. As you can see, the conversion at the beginning and at the end only takes less than 2% of the execution time on that piece of code. So still, the majority time spent is on the operation itself, not on the conversion.
So if you have a file -- if you have an image that's in non-core format, you can go ahead and convert it to core format, do all sorts of operation you want to do on the image, and at the end, you convert it to the one that you desired. So you're going to get the performance improvement that you want while saving energy.
So some few more data requirements in vImage. Again, we require minimum data alignment for a single precision point-in-point number is four bytes. And the data is not containerized. So you're not copying the entire image from one memory location to another. We're just passing the pointers around. So there is no copy involved in vImage buffer. It's very efficient.
So that's the image. And the last piece is the linear algebra library. I'm going to invite my colleague, Jeff Felter. He will tell us more about linear algebra. Thanks, Luke. So for linear algebra in the Xcelerate framework, we've got two great packages. We've got a LAPACK, the linear algebra package, and we've got BLAS, the basic linear algebra subprograms. Let's start by taking a look at a LAPACK.
Well, APAC is the high-level linear algebra functionality. So if you want to solve a system of linear equations, there's going to be APIs to do that in here. Maybe you need to perform a matrix factorization, a QR or an LU factorization. There's APIs for that as well. There's also a functionality to compute eigenvalues and eigenvectors. There's several hundred APIs in LAPACK alone, so there's a huge amount of functionality here. It's probably gonna have what you need.
Let's take a look at an example of one of the really common uses of a lay pack, and that's solving a system of linear equations. One of the great things about a lay pack is it's got something for everybody. So if you want to do this with a single API, there's a routine that's going to do that for you. It's going to do the factor and the solve behind the scenes, and it's going to give you your result back. If you want to get your hands dirty a little bit, you can do it the way I have shown up here. So we've prepared our matrix in A, and we're going to do the factor first with dgetRF. That factor is going to be done in place, and then we're going to send that factored matrix to dgetrs to perform the solve. We get our result. It's pretty simple. It's pretty easy. Amen.
There's a lot that goes on behind the scenes in a lay pack, and a lot of the lay pack is built on the other package, BLAS. Let's take an example, a look at an example of how we use BLAS. So the matrix factor that we just saw there is going to spend a lot of time in the matrix-matrix multiply. That's what I'm showing here. The use case here is going to be the same. We're going to prepare our matrices A, B, and C, and we're going to call into C BLAS DGEM here. GEM stands for General Matrix Matrix Multiply, and the D prefix is Double Precision.
Blas supports both row and column major storage formats, so rows or columns are contiguous in memory, and we need to specify that when we make the function call. Also, it's going to support transposes and not transposing the data, so you don't have to manipulate your matrices before you use Blas or LAPACK. It's all going to be done behind the scenes.
There's a lot of functionality in Blas as well, so as I mentioned before, LAPACK is built pretty heavily on Blas, So BLAS tends to be low-level linear algebra operations. It tends to be categorized into three levels. So there's vector operations, dot product, scalar products, vector sums, matrix vector operations, matrix vector product, outer product, and then matrix matrix operations. So back solves, rank updates, and the matrix multiply that we just saw.
Let's take a look at some performance numbers. So for the double precision matrix multiply we saw the example of, what I've got here is a graph on the X axis I'm showing a range of matrix sizes. So from 64 by 64 matrices up through 1,024 by 1,024. And on the Y axis, performance in gigaflops. So we're looking for higher, better here. Let's look at that data.
It quickly climbs to some really great performance. Even in the small sizes we see we're getting some good performance. By about 500, we've reached a plateau and we're going to give you that great performance for as big as you want to get. I don't show small sizes here, but even down into 4x4 and 8x8 matrices, we spent a lot of time there too. You're going to get great performance for a range of matrix sizes.
For those of you familiar with performance numbers, what we're showing here is almost 90 double precision gigaflops. This is really impressive. This is an iMac, so a computer sitting on your desk giving you some really great performance. But don't take my word for it, let's put this into perspective. So what I've got here is a comparison similar to what we did in the FFT case study. We've got a straightforward C implementation of the double precision matrix multiply. So a programmer finds the algorithm for matrix multiply in a book and codes it. We've normalized the performance and the energy to one here, and let's compare that to the DGAM in the Xcelerate framework.
We see huge performance improvements here, 158 times faster. The really best part is we do this with a fraction of the energy, so one-seventy-third the amount of energy to give you this really incredible performance improvement. Amen. Let's look at some of the details of the data types that BLAS and LAPACK support. So they both support single and double precision, both real and complex numbers.
and a range of data layouts. So BLAS, as I mentioned, is going to support row and column major. A LAPACK only supports column major. But both BLAS and LAPACK are going to support dense matrices, general matrices, banded matrices, triangular matrices, even a few packed structures. And they're going to support transpose and conjugate transposes when appropriate. So again, you're not going to have to modify your data before you call into BLAS or a LAPACK.
So that's a quick look at what's available in a lay pack, and with that I'm going to turn it back over to Luke to wrap up the talk. Well, those numbers are amazing. One hundred and fifty eight times faster. So right now I'm going to give you a quick summary.
Accelerate Framework is easy to use. Most of the time, it's just one function call to replace a for loop and to replace even 50 or 60 lines of code. And it's accurate, so if your app is using Accelerate Framework, your result is going to be accurate, too. XOR framework is fast with low energy usage, so your app will be responsive and have a long battery life. It's available on both OS X and iOS, so the same code will just work if you want to port from one platform to another.
Here are a few tips for using a separate framework. So if you could, you want to arrange your data in contiguous memory and make it 16-by-aligned. We do support the data that has strided axes or just native data alignment, but in order to fully take advantage of the vector engine, we would prefer the data is in 16-by alignment and contiguous. I know that sometimes you get the data from somewhere else. You have no control over it. That's fine. But if you do, try to make it contiguous and aligned to 16 bytes. You will have the maximum performance improvement in that case.
And you also want your data to be large enough. You don't want to do just one simple vforce function call. That's not going to be effective. Whenever you have to make a setup structure, you only need to do the setup once. make all the operation that you need to operate on all your data, and then destroy at the end to reclaim the memories allocated.
So just a quick refresher, we have digital signal processing in VDSP. We have image processing in Vimage. There's linear algebra in BLOFS and LAPack. And we have transcendental function in VMathlib and Vforce. So this is really a comprehensive list of computational functionalities there. it is more than likely that you will find something you can use in your app. So on that note, I would like to say, ladies and gentlemen, let's accelerate!
I encourage everyone here that after the session, go look at your code, find somewhere in your code you could use Accelerate framework and give it a try. You will see the performance improvement with your own eyes. If you need more information, here are the contacts. So that's the end of the presentation. Thank you very much. you