Core OS • iOS, OS X • 46:22
The Accelerate framework contains signal and image processing, matrix and linear algebra computation, and now an optimized array-based math library for iOS. Find out how you can use the Accelerate framework to achieve dramatic improvements in performance and energy consumption.
Speakers: Geoff Belter, Luke Chang
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.
Good morning, everyone. Welcome to the Accelerate Framework session. My name is Luke Chang. I'm an engineer in Vector and Numerics group. I'm here today to talk about Accelerate Framework. Have you ever thought about making an image processing app or audio equalizer app or any computation-intensive apps? If you have, you're in the right place.
Accelerate Framework can help you make those apps and more. That's why we like to call it Your one-stop shopping for fast and energy efficient libraries. We call it one-stop shopping for a reason. In the Accelerate framework, we support a wide range of computational functionalities. We have signal processing, VDSP. We have image processing, Vimage. Linear algebra, we have LAPAC and BLAS. Transcendental math functions, we have VForce and VMathlib.
So this is really a wide range of computational functionalities. In this session, I'd like to talk about three things. First, why you want to use Accelerate Framework. Second, where in your code you could use Accelerate Framework will help you recognize that. And lastly, we're going to show you how to use Accelerate Framework. So let's get started on the first one. You want to use Accelerate framework because it can help you make great apps. In my experience, a great app needs a few things.
First, it has to be useful. It does something I need, so I will go ahead, download it from the App Store, give it a try. After I downloaded it, it has to work. I don't want my app to crash every five minutes or give me the wrong results. Those are just the two basic requirements for a good app. It's not a great app yet. To make it a great app, I want my app to be responsive. Responsive to my every touch, every swipe, or every pinch. A slow app is not a good app.
And last, I want my app to have a long battery life. If the app is draining the battery a lot, I probably would not use it when I don't have my charger around. That would significantly cut down my usage time. That's not good. So, these are the things the users want. Now, how about you? As a developer, what do you want? Well, I am a developer too. Here are the things that I want for my code.
I want my code easy to write so I can finish my app sooner and put my app on the App Store so people can start using it. After I finish the app, I want my code to be readable. So when I hand it over to the next developer for maintenance, they won't keep coming back asking questions. I could focus my time and energy on the next great app.
And if there's ever a need to port from iOS to Mac OS X, like many of the games do now, they support both iOS and Mac OS X, I want the same code to just work on both platforms. All I need to do is just rebuild the project for iOS or for Mac OS X, and that's it.
So, now, how does Accelerate Framework help you achieve these goals? Well, Accelerate Framework has more than 2,000 APIs, so it's more than likely that something you're trying to do will already support that in Accelerate Framework. So you don't have to write your own, and using Accelerate is very easy, so your code is going to be easy to write.
Accelerate Framework is well tested and accurate. So if your apps are using Accelerate Framework, your result will be accurate and your app is going to be robust as well. Accelerate Framework, by its name, is fast and energy efficient. So your app is going to be responsive, it can do more work in a short amount of time, and you won't drain on the battery.
Lastly, Accelerate Framework is available on both iOS and OS X. So the same code will just work if you want to port from iOS to OS X. So, these are the reasons that a lot of people are already using Accelerate Framework. A lot of great apps are using Accelerate Framework. In fact, we did a survey recently on the Mac App Store. Here's what we found.
Nine of the ten top-grossing apps in the Mac App Store use Accelerate. So, again, a lot of people are using it, great apps are using it, and so should you. Well, I'm going to let you in on a little secret on how we made Accelerate framework fast and energy efficient.
We use SIMD instructions to optimize our code. SIMD stands for single instruction, multiple data. Let's say you have four floating-point numbers you want to add to another four. You can use SIMD instructions. In one instruction, you can do all four arithmetic calculations. On Intel, we take advantage of SSE and the latest technology, AVX. On ARM, we use the NEON instruction set.
And we also write hand-tuned assembly for each processor Apple supports. So we carefully schedule our assembly instruction, and we do software pipelining or loop unrolling, sometimes both, to make sure the maximum parallelism in your code is exposed to the processor, so the processor can execute more than one instruction at a time. And lastly, we multithreaded our code using GCD. So if your system has multiple cores, which most of the Apple devices do now, you'll be using all of the core on the system to achieve the maximum performance.
Here's how you link against Accelerate framework in Xcode. It's very simple. You click on the Build Faces tab right here. And click on the Add button, the small Add button, right after the library list. Esco is going to show you a list of frameworks that you can link against.
On the top of the list, there's Accelerate framework. So let's select that and click Add. Now, your project is linking against Accelerate Framework. In order to use the API in Accelerate Framework, All you need to do is just include Accelerate header in your source file. Then you can start using it.
If you're feeling more adventurous, you want to do it in command line, it's just as easy. You pass -framework Accelerate option to your compiler, and that's it. Your code will be linking against Accelerate framework and get the best performance. Now you know how to use Accelerate framework, I'm going to present an FFT case study that we did recently to show you exactly why you want to use Accelerate framework.
FFT stands for Fast Fourier Transform. It is one of the most important signal processing operations out there. It's used in audio compression, it's used in image processing, even used in video. Pretty much everywhere. So if you don't know how to write an FFT, you could use the one in Accelerate Framework or pick up this popular book, Numerical Recipes in C.
It has the algorithm and implantation on how to implement FFT. So we're going to compare these two on two metrics. First one is execution time. Execution time is easy to understand. You want your code to run as fast as possible, so the execution time, you want it to be as short as possible.
And then we want to compare the energy consumption of using Accelerate Framework and numerical recipes in C. I'd like to spend more time on energy consumption to talk about what is the energy consumption of a function. Because there is a common misunderstanding out there that people think power consumption is the same as energy consumption. But it's not true.
In terms of battery life, what you care about is really the energy consumption, not the power. A battery can hold a certain amount of energy, so if your app is more energy efficient, the battery life is going to be longer. So let me put the relation between energy and power into equation. Energy is an integral of instantaneous power over time. In reality, we cannot get the continuous measurement of instantaneous power. What we do is use the piecewise summation to approximate the integral. The equation looks complicated, but it's actually very simple. Let me put this into a graph.
This is a typical energy consumption profile that we measure on the system. So at the beginning, system is idle, so the system consuming the idle power, P0. At time T0, the workload comes in, in this case the FFT. The system will start doing something, the process is processing those data. The instantaneous power jump to P1.
And at time T1, the workload finished, system come back to the idle power P0. There are two things that you need to pay attention to in this graph. First one is the time difference between T0 and T1. That's your execution time. You want it to be as short as possible.
The second thing is the area under the curve between T0 and T1. That is the energy consumption of this function. You want it as less as possible. So now we know what we're comparing. We use the metrics to compare these two, the Accelerate framework FFT and numerical recipes in C. Let's look at the competition. Our competition is numerical with b, c, and c. We asked an average programmer to write an FFT based on the book, so he did a straight from the book implementation. It's about 50 lines of code, and it looked like this.
So this code is not terribly difficult to write. I know everyone in this room is capable of writing one of your own. But there are a couple of things you need to pay attention to. First, you have to be careful about the index and where you want to use add, where you want to use subtract. After you get all the signs and the details right, you're not done yet.
You still need to test for accuracy. You want to make sure the error tolerance is within your app's range. And you want to measure for performance, because if you're writing a real-time app, you don't want your app to lose audio samples or lose video frames. And last, you need to document your code. You have to document where you took the algorithm from and how you implemented it.
In our opinion, all this is just too much work. You could use the time to focus on your next project or next big feature and save the time by using Accelerate FFT. So here is how you use Accelerate FFT. There are three simple steps for you to use Accelerate FFT: Setup, Operate, and Destroy. So, of course, you have to include Accelerate header to have access to all Accelerate framework APIs.
Then you pass in the data and the FFT length. FFT length is represented in terms of log2. So you can see the log2n equal to 10. That means we're trying to do a 1024-point FFT. So once at the start, before you process any of your data, you call FFT Create Setup. You'll pass in the FFT length and the Radix information. Here we have Radix 2.
And then you go on to operate on your data. You call FFT-ZIP. It will do an in-place complex-to-complex FFT. You pass in the setup structure, where the data is, and this one on the third argument is telling FFT that my data is in contiguous memory. So you access one after another.
And the length information, and we want to do a forward FFT. You can call FFT-ZIP multiple times to handle all your data with the same setup structure. You can reuse the same setup structure, and only once at the end when you're done with all your data, you call Destroy FFT Setup. This is to reclaim the memory that's allocated to the setup structure so you can avoid memory leak. So, using Accelerate FFT is really simple. Let's look at the performance of these two, Accelerate FFT and Numerical Recipes in C.
Here is the energy consumption profile for numerical recipes in C. And here is the Accelerate FFT. As you can see, the execution time for Accelerate FFT is much, much shorter than numerical recipes in C. Even though the instantaneous power of Accelerate FFT is more than numerical recipes in C, but since the time is so much shorter, the total area under the curve, which is the energy consumption, is much less than numerical recipes in C. Let me put this into perspective for you. Let's normalize the performance and energy consumption of numerical recipes in C to 1. Here's what Accelerate FFT looks like.
Accelerate FFT is more than nine times faster than numerical recipes in C, while it only consumes one-eighth of the energy of numerical recipes in C. So you can do more work while consuming less power. This is exactly why you want to use Accelerate Framework. Now, enough with the hard-code data. Let me tell you a brief history of Accelerate Framework.
Accelerate Framework has been available on Mac OS X for many years. We have VDSP for signal processing, we have linear algebra libraries, we have image processing libraries, and we have the math functions. Over the years, we're trying to bring every component from Mac OS X to iOS. We start with VDSP and linear algebra.
Last year, we added Vimage and VForce. It's been a huge success for us. So this year, I'm happy to announce that we added the last piece of Accelerate Framework. We added the VMathlib. So if your code is using Accelerate Framework on Mac OS X, you can safely port it to iOS. All the components are there. We have a complete picture for both iOS and Mac OS X.
Now, since I introduced vMathlib, let's talk about it. vMathlib is a SIMD vector library. It operates on the SIMD vector. In Vector Numerics Group, we support math for every data length. For your scalar data, one flowing point input and one flowing point output, we have LibM. For your array data, we have a VForce. VForce operates on arrays, so it takes array as input and generates another array as output.
VMathlib is something in between. VMathlib operates on SIMD vectors. So in Accelerate Framework, we define the vfloat data type that maps to the vector register in your processor. In most architecture, vfloat will have four elements, four single precision falling point numbers in the structure. And you'll be using vfloat when you're writing your own vector code.
For those who are not familiar with LibM, here are a few words. LibM is your standard math library in C. It has a collection of transcendental functions. Here are the familiar names: exp , log , sine, cosine. For VForce, it operates on arrays. It has the VV as the function prefix, so you can see VV exp , VV log , VV sin , etc. The mathlib has only one "v" as the prefix. So there is v_exp_f, v_log_f, v_sine_f, etc.
You want to use the math lib when you're writing your own vector code. While you're writing your vector code, you probably sometimes want to take sine or cosine of a value. What do you do in this case? Well, you could use libm to achieve what you want, something like this. So you include math.h to use libm, and you write a for loop to take each element in the input vector and then store the result into the output vector.
So there's an obvious problem in this code. You want to write a vector algorithm because you want to take advantage of the performance of vector units. But LibM is not using vector units, so you're not getting the performance improvement. So how about we use V-force? V-force does use the vector unit, and the code will look like this.
So include Accelerate header and B4s operate on arrays. So you have to take the address of your input vector and output vector and then tell B4s the length. Well, it works, but it's awkward, obviously. And another thing is, because v4 is designed to work on arrays, it takes the pointer to that array. It involves the memory access.
So if your input and output vectors are already in register, there's no need to do the memory access. We can use vMathlib to achieve that. Here's how you write the code using vMathlib. So it's very simple. We call v_sine_f and pass v_x as the input argument, and you'll have the result in another vector, v_y. The code is much cleaner, and there is no memory access. You will get the optimum performance.
So that's the mathlet. I briefly mentioned VForce in the previous slide, but without going into too much detail. Now is the time to do so. VForce is a vectorized math library. He operates on arrays, and in addition to the transcendental functions, we have rounding functions. All four rounding modes are supported. And we also have lots of other stuff, like square root, remainder, next after, etc.
Let's say you want to write a signal generator app, and you want to generate a frequency-modulated sine wave. Again, you can use LibM to do that. You write a for loop to call sine f, which is the LibM function. You go through each element in the input array and then sort the result to the output array. That works, but it's not optimal. Let's use V-force for that.
It's very simple to use vforce. You simply replace a for loop with one function call to vv sin f. Passing the address to the input buffer and output buffer and also the pointer to the length, the frequency modulated sine wave will be generated and output buffered. It's just that simple. Now, let's look at the performance comparison of the two. One is using V-force, the other one is just using the for loop.
As you can see, V-force is more than twice faster than using a simple for loop. At the same time, it consumes less energy. V-force is more than twice more energy efficient than using a simple for loop. So again, faster and energy efficient. That's the motto of Accelerate Framework. We didn't just cherry pick sine to be our example. There are a bunch of other functions in vForce as well. We have truncation, log, x, power. Across the board, you can see a typical 2x speedup. So you can safely use vForce and expect great results.
Here's some detail about vForce. vForce supports both single and double precision falling point numbers. It handles edge cases correctly, so if your input has infinity or nans, positive zero, negative zero, you don't have to worry about it. You just pass them to vForce, and then vForce will handle those cases for you.
V4 requires minimum data alignment. We only require the native data alignment. For a single precision flowing point number, that's going to be four bytes. For a double precision flowing point number, it's eight bytes. And we also support in-place calculation. You don't have to allocate a temp buffer to hold the results. We can just operate on your data in place.
And a lot of people ask us this question, "I only have 20 elements. I only have 20 numbers. Is it beneficial to use V-force?" Well, as a rule of thumb, You will see performance improvement when you have more than 16 elements. While some function has higher thresholds, some function has lower thresholds, if you're interested in finding out the exact number, you can just write a simple app to do a test. But as a rule of thumb, if you have more than 16 elements in your array, you're good to go. You can use vForce and expect great results.
So that's V-force. We talked about the math library, VMathlib and V-force. I'm going to talk about another big block, VDSP. VDSP is our signal processing library. It has pretty much everything you need for signal processing. We have basic operations like add, subtract, conversion, and we also have discrete full-wave transform.
The FFT that we saw earlier in the case study is actually part of the discrete full-wave transform. Discrete full-wave transform, we support Radix 2, Radix 3, and Radix 5. And we also have convolution, if you want to do your own filtering on the signal. And we also have correlation, if you want to do signal analysis. They're there for you.
In iOS 6, we added two new features. The first one is discrete cosine transform, and another one is the Bi-Qual I/O filter. These two features are requested by many developers, and we do look at those feature requests. When the timing is right, we'll add them in. So if you find something that you really need, it's really great, and it's not available in Accelerate Framework, Go file feature requests. Don't be afraid. We do look at them, and we're going to work on them when the timing is right. Descript Cosine Transform is very similar to the FFT that we saw earlier, so I'm going to take the Biquad IL filter as an example.
Here is a series of end-stage cascaded biquad ILR filter. So you pass input into end-stage of second-order biquad filter, and you get the output at the end. This is very common in the audio processing. So we added this in iOS 6. What it does basically is we optimize the biquad filter because there is an inherent feedback loop in the biquad filter that's very hard to optimize. We did the work for you, so you can just use the biquad ILR filter in Accelerate Framework and get great results. Here's an example on how to use it. It is the same three steps-- set up, operate, and destroy.
First, include Accelerate header as usual. And we specify we want to do a 10-stage IR filter. And for each stage, AI filter requires five coefficients and two delay stages. You can think of delay stages as the current state of the filter. There are two of them. And there's input, output, Once at the start of your program, you call "Biqua Create Setup" to create the setup structure that's needed by the operation. You pass in the field coefficient and number of stages. A setup structure will be created for you. And again, you can call the VDSP biquad to operate on the data multiple times to work on all your data with the same setup structure.
And at the end, you destroy the setup to reclaim the memory allocated. So that's the same three simple steps that you use for Biqua I/O filter and FFT. Now here are a few things more about the data type in VDSP. We support single and double precision, falling point numbers, and we also support real and complex numbers. So you can do a real-to-complex Fourier transform FFT or complex-to-real FFT. Those are all supported. will also support strided data access. In the previous examples, I always used a stride of one, meaning the data in the memory is contiguous.
We support, you know, when you're getting data from somewhere else, you want to access every other two or every other three. We do support that. However, it's our recommendation that if you have control over your data, you want to arrange your data in contiguous memory so we can fully take advantage of the vector unit and give you the best results. If you can't, don't worry about copying the data in memory to make it contiguous. We can just operate on the data with strided access. So that is VDSP, the Signal Processing Library.
Now, another big block in Accelerate Framework is vImage, the image processing library. We're in the era of digital photography. People love to share photos, and before they share a photo, they actually like to do some post-processing on the photo, like removing the red eye or increase the saturation, all things you can do. So, Vimage is a great help. Vimage can do a lot of things. It can do convolution. Convolution can achieve various effects, like blur. I'm going to show you how to use the convolution to blur right now. Here's the effect.
And you can also do geometry. You can rotate and scale your picture like this. You can do morphology. This one is very interesting. I'm going to turn each snowflake in this picture into a bigger one. There is also Alpha that you can blend two pictures together. I'm going to fade one out and fade another in.
We have transform functions that can change the hue in this picture. I'm going to send this little guy to outer space. And we have Histogram. If you take a lot of photos, you're familiar with this. It is the intensity distribution of your RGB channels. Like this. And lastly, we have Conversion function to convert your image format from one to another. We add a few things to vImage. In Mountain Lion and iOS 6, we have BGRA, RGBA support, and now we also support 16-bit integer.
Because convolution is such a versatile operation, I'd like to mention a few words about it too. So, I show you blur, but it's not the only thing convolution can do. It can also do edge detection. You can use different kernels on each color channel to achieve a pretty fancy result I'm going to show you right now. Here's blur. and edge detection and different color channel operation.
So it looks very fancy, but it's actually very simple. It's a very simple idea. Convolution is basically a weighted average of nearby pixels. Let's say you have a kernel like this. The center pixel is more heavily weighted than the side or corner pixel. And you have an image like this. You can see the purple pixels form a sharp edge in this image. Let's apply the weighted average on the center pixel. What you get is a lighter purple pixel. Same idea, you apply this kernel to all the pixels in this image.
Basically, the sharp edge formed by the purple pixel is replaced by a softer edge. Essentially, this is a blurring effect. Convolution is this simple, so you might be thinking, "I can write one myself." That's true. It's just nested for loop, four of them nested together. However, there is problem in this code.
First, It does not handle the edge pixel correctly. When you're at the edge, you don't have all your nearby pixels for the kernel. You need to handle that case. And it doesn't handle the overflow case, so you might see an artifact in your picture due to overflow. And the most important thing here is it's really, really slow. In our experience, a good convolution code takes more than 100 slides of code, if it's not more.
So I'm going to show you the performance comparison. I'm going to show you how to use VMG-First, a simple function called to VMG-Conv ARGB 8888. Passing the source and destination, and we'll pass a little bit more information about the kernel. Your convolution result will be ready in the destination. So now I'm going to show you the performance result. This is a 7x7 convolution on a 1024x768 image. The image is more than 14 times faster than using a good scalar code.
And energy consumption, same story. It's very energy efficient. It consumes less than one-eighth of the energy. So fast and energy efficient. There are a lot of image formats, and we cannot support all of them. So we classify them into two categories. First is the core formats, and the rest is the non-core formats. There are four types in core formats. If you have a single channel image, each pixel can be an 8-bit integer or a 32-bit folding point number. We also support four channel interleaves. So we have ARGB 8888 and ARGB FFFF.
Any format that's not in this floor is considered non-core formats. We have BGRA, RGB, GBR, 16-bit unsigned integer, or 16-bit falling point number. And the main difference between core format and non-core format is that Vimage operation supports core formats extensively, but not non-core formats. So you might be wondering, OK, now if I have an image that's in non-core format, I want to do some operation on it using the image. Can I do that? Yes, of course you can. We have all the conversion functions to help you. Here is an example. I want to do a scaling operation on the pre-multiplied PlanarF image.
The core format expects the image to be a non-premultiplied image. So if you have a premultiplied alpha image, you want to convert it to the core format first. So you call this conversion function, vimage on premultiplied data, planar f. It works in place. And then you can operate on the core format, which is the Planner F.
And at the end, you convert it back to your desired image format. Now, you might be thinking, "Wow. It looks like three times the workload. I have to convert it and I have to operate on it and then convert it back at the end. Well, I'm going to show you the performance result to tell you that it is okay to do that. As you can see, the conversion at the beginning and at the end only takes less than 2% of the execution time on that piece of code. So still, the majority time spent is on the operation itself, not on the conversion.
So if you have a file, if you have an image that's in non-core format, you can go ahead and convert it to core format, do all sorts of operations you want to do on the image, and at the end, you convert it to the one that you desired. So you're going to get the performance improvement that you want while saving energy.
So some few more data requirements in vImage. Again, we require minimum data alignment. For a single precision, point-in-point number is 4 bytes. And the data is not containerized. So you're not copying the entire image from one memory location to another. We're just passing the pointers around. So there is no copy involved in vImage buffer. It's very efficient.
So that's the image. And the last piece is the linear algebra library. I'm going to invite my colleague, Geoff Beltor. He will tell us more about linear algebra. Thanks, Luke. So for linear algebra in the Accelerate framework, we've got two great packages. We've got a LAPAC, the linear algebra package, and we've got BLAS, the basic linear algebra subprograms. Let's start by taking a look at a LAPAC.
Well, APAC is the high-level linear algebra functionality. So if you want to solve a system of linear equations, there's going to be APIs to do that in here. Maybe you need to perform a matrix factorization, a QR or an LU factorization. There's APIs for that as well. There's also a functionality to compute eigenvalues and eigenvectors. There's several hundred APIs in LAPAC alone, so there's a huge amount of functionality here. It's probably going to have what you need.
Let's take a look at an example of one of the really common uses of a lay pack, and that's solving a system of linear equations. One of the great things about a lay pack is it's got something for everybody. So if you want to do this with a single API, there's a routine that's going to do that for you.
It's going to do the factor and the solve behind the scenes, and it's going to give you your result back. If you want to get your hands dirty a little bit, you can do it the way I have shown up here. So we've prepared our matrix in A, and we're going to do the factor first with dget rf. That factor is going to be done in place, and then we're going to send that factored matrix to dget rs to perform the solve. We get our result. It's pretty simple. It's pretty easy.
There's a lot that goes on behind the scenes in a lay pack, and a lot of the lay pack is built on the other package, BLAS. Let's take an example, a look at an example of how we use BLAS. So the matrix factor that we just saw there is going to spend a lot of time in the matrix-matrix multiply. That's what I'm showing here. The use case here is going to be the same. We're going to prepare our matrices A, B, and C, and we're going to call into C BLAS DGEM here. DGEM stands for General Matrix Matrix Multiply, and the D prefix is Double Precision.
BLAS supports both row and column-major storage formats, so rows or columns are contiguous in memory, and we need to specify that when we make the function call. Also, it's going to support transposes and not transposing the data, so you don't have to manipulate your matrices before you use BLAS or LAPAC. It's all going to be done behind the scenes.
There's a lot of functionality in BLAS as well. So, as I mentioned before, LAPAC is built pretty heavily on BLAS, so BLAS tends to be the low-level linear algebra operations. It tends to be categorized into three levels. So there's vector operations, dot product, scalar products, vector sums, matrix vector operations, matrix vector product, outer product, and then matrix matrix operations.
So back solves, rank updates, and the matrix model. So, there's a lot of functionality in BLAS as well. So, as I mentioned before, LAPAC is built pretty heavily on BLAS, so BLAS tends to be the low-level linear algebra operations. It tends to be categorized into three levels. So there's vector operations, dot product, scalar products, vector sums, matrix vector operations, matrix vector product, outer product, and then matrix matrix operations. So, back solves, rank updates, and the matrix model. multiply that we just saw.
Let's take a look at some performance numbers. So for the double precision matrix multiply we saw the example of, what I've got here is a graph on the x-axis I'm showing a range of matrix sizes. So from 64 by 64 matrices up through 1,024 by 1,024, and on the y-axis performance in gigaflops. So we're looking for higher, better here. Let's look at that data.
It quickly climbs to some really great performance. Even in the small sizes, we see we're getting some good performance. By about 500, we've reached a plateau, and we're going to give you that great performance for as big as you want to get. I don't show small sizes here, but even down into 4x4 and 8x8 matrices, we spent a lot of time there, too. You're going to get great performance for a range of matrix sizes.
For those of you familiar with performance numbers, what we're showing here is almost 90 double precision gigaflops. This is really impressive. This is an iMac, so a computer sitting on your desk giving you some really great performance. Don't take my word for it. Let's put this into perspective.
So what I've got here is a comparison similar to what we did in the FFT case study. We've got a straightforward C implementation of the double precision matrix multiply. So a programmer finds the algorithm for matrix multiply in a book and codes it. We've normalized the performance and the energy to one here, and let's compare that to the DGAM in the Accelerate framework.
We see huge performance improvements here, 158 times faster. The really best part is we do this with a fraction of the energy, so 1/73rd the amount of energy to give you this really incredible performance improvement. Let's look at some of the details of the data types that BLAS and LAPAC support. So they both support single and double precision, both real and complex numbers.
and a range of data layouts. So BLAS, as I mentioned, is going to support row and column major. A LAPAC only supports column major. But both BLAS and LAPAC are going to support dense matrices, general matrices, banded matrices, triangular matrices, even a few packed structures. And they're going to support transpose and conjugate transposes when appropriate. So again, you're not going to have to modify your data before you call into BLAS or LAPAC.
So that's a quick look at what's available in a LayPack, and with that I'm going to turn it back over to Luke to wrap up the talk. Well, those numbers are amazing. 158 times faster. So right now I'm going to give you a quick summary. Accelerate Framework is easy to use. Most of the time, it's just one function call to replace a for loop and to replace even 50 or 60 lines of code.
And it's accurate, so if your app is using Accelerate Framework, your result is going to be accurate, too. Accelerate Framework is fast with low energy usage, so your app will be responsive and have a long battery life. It's available on both OS X and iOS, so the same code will just work if you want to port from one platform to another.
Here are a few tips for using Accelerate Framework. So if you could, you want to arrange your data in contiguous memory and make it 16-by-aligned. We do support the data that has strided axes or just native data alignment, but in order to fully take advantage of the vector engine, we would prefer the data is in 16-by alignment and contiguous. I know that sometimes you get the data from somewhere else, you have no control over it, that's fine. But if you do, try to make it contiguous and aligned to 16 bytes. You will have the maximum performance improvement in that case.
And you also want your data to be large enough. You don't want to do just one simple vforce function call. That's not going to be effective. Whenever you have to make a setup structure, you only need to do the setup once. Make all the operations that you need to operate on all your data, and then destroy at the end to reclaim the memories allocated.
So just a quick refresher, we have digital signal processing in VDSP. We have image processing in Vimage. There's linear algebra in BLOFS and LAPack. And we have transcendental function in VMathlib and Vforce. So this is really a comprehensive list of computational functionalities there. It is more than likely that you will find something you can use in your app.
So on that note, I would like to say, ladies and gentlemen, let's Accelerate! I encourage everyone here that after the session, go look at your code, find somewhere in your code you could use Accelerate Framework, and give it a try. You will see the performance improvement with your own eyes. If you need more information, here are the contacts. So that's the end of the presentation. Thank you very much.