System Frameworks • iOS, macOS, tvOS, watchOS • 38:43
Learn about Sparse Matrix Solvers in the Accelerate framework. In addition, find out how Accelerate's Basic Neural Network Subroutines (BNNS), vDSP, simd and other subframeworks give you fast, energy-efficient signal and image processing and handle large-scale mathematical computations.
Speakers: Eric Bainville, Steve Canon, Jonathan Hogg
Unlisted on Apple Developer site
Downloads from Apple
Transcript
This transcript has potential transcription errors. We are working on an improved version.
Thank you. And welcome to the Accelerate Session. My name is Eric Bainville. I'm with the CoreOS Vector and Numerics Group. Our group is maintaining the Accelerate framework. And this is our agenda for today. So first I will introduce Accelerate, what's inside, how to use it, and through a few examples I'll show you why you want to use that. And then we focus on new improvements and additions we have this year, first with Compression, lossless data compression library, then BNNS, Basic Neural Network Subroutines.
And after that my colleague Steve will come on stage and tell us what's new in simd. And finally, our very own Jonathan Hogg will come on stage and tell us -- introduce the sparse matrices package. It will be the first time commercialized shipping with sparse matrices solver package. But let's start with Accelerate.
Accelerate is a low-level system framework dedicated to high performance primitive on the CPU. Actually, it's everywhere when you have a significant workload to run on the CPU. Inside Accelerate there's a lot of libraries, like vImage for image processing, the Swiss knife of image processing. We also have vDSP inside for DFT's and FFT's signal processing, vForce for vector functions.
And then we have a whole lot of linear algebra libraries for dense and sparse matrices. So that's BLAS and LAPACK for dense vectors and matrices. And Sparse BLAS and Sparse Solvers for sparse vectors and matrices. Last year we also introduced BNNS, Basic Neural Network Subroutines, which is also [inaudible] for neural networks.
This is used by, for example, Core ML and also Vision and NLP frameworks. Slightly outside Accelerate but still maintained by us, we have simd, which is a set of headers and function allowing you to directly talk to the CPU vector units. And finally, Compression, a lossless data compression library.
How do you use Accelerate? Well, so you import Accelerate or you include the Accelerate header and then link with Accelerate frameworks on your own set. Now, I will show you a few examples about why you want to use Accelerate. Let's start with that one. So let's say you have this vector X of floating point values and you want to scale them. So you have this scale. And so you multiply each value and store it in the Y vector. So it's a very simple loop. That's perfectly fine. But actually, there is a function for you in Accelerate doing the same thing, it's called vsmul.
So that's one single line replacing your code. And the good thing is that we optimize it for you on all the supported hardware. And of course, we maintain it for you so you don't have your loop to maintain again. And, of course, it's faster, as Accelerate says. So that's a reference speed and energy consumption for the loop. And this is what you get with Accelerate. Six times faster.
[ Applause ]
And six times less energy consumed, which is very important. Let me give you another one. So this time we still have this array X, and we want to clip the values between the low and high bound and store that in Y. Again, you could go with this code.
That's perfectly fine. That's good. But we have a function for you in vDSP; it's called vclip. Does the same thing. And, again, we maintain it for you. We optimize it for you. And, of course, it's faster. That's a reference for the loop. And this is what you get with Accelerate, eight times faster and also eight times less energy consumed.
Another one, matrices. So let's say you have two matrices A and B, and you want to compute the product of A and B and add the result into the C matrix. It doesn't sound simple. But actually, it's very simple. That's just three lines of code. You can see that. And we have, of course, a function for you in Accelerate doing that; it's called sgemm for the Cblas package.
And really, really, you never want to write code computing metrics vector products or matrix metrics or anything related to metrics. You don't want to write that code ever. You want to call BLAS, LAPACK, or anything in Accelerate. Why? Well, first because we maintain that for you, and it will be optimized and multithreaded on all the architectures we support. And this time -- this is a reference for the loop -- this is what you get with Accelerate. I'm not sure you can see it.
Yeah. That's 100 times faster and 26 times more energy efficient. That's your battery here. Okay. One more example, this time in vImage. So let's say you have a 32-bit per pixel image with four components per pixel. That's alpha, red, green, and blue. And you want to apply a 4 by 4 transformation matrix to every pixel in the image.
But you could write the code. You could write it here, that would be too long. But really, you don't want to write this code. We have a function for you in vImage, MatrixMultiply -- that's one of the most used functions in vImage -- doing exactly that and optimized to as fast as it can on all the architectures we support.
Last one, this is convolution layer. That's the working horse of the neural networks, the convolution neural networks. So this layer takes input stacks -- that's the right thing on the left -- that's a stack of images. And it will put output image stack, the blue thing. And each pixel in the output is the result of a treaty convolution on all the entire input stack. And we do that for every of the three that I mentioned for output image. So at the end, that's a six-dimensional loop. And you really don't want to write this loop.
And even when the dimensions are small, you're multiplying all of them together. So that's millions or even billions of floating points operation here. Of course, we have a function for you this time in BNNS doing that. And when you run a Core ML model, you will spend, like, 80% of the time in this function.
All right. So that was a few examples. I could continue almost forever because we have more than 2,800 API's in Accelerate. So usually that would be a function for you inside. And every time you use an Accelerate function, the benefits are that's less code for you to write; we maintain it for you; of course, it will be faster and more energy efficient; and it will be optimal -- as close to optimal as possible on all the architectures we support, including the new ones. So when we release new hardware, you will get your code running as fast as it can from day one. Okay. So that was it for Accelerate. Now let's focus on Compression.
Compression is a lossless data compression library. It's a very simple API with a few select compressors inside. So they are represented on this little graph here. On the x-axis you see the relative compression ratio compared to ZLIB. ZLIB is in the center. And on the y-axis is the compression speed. It's not that exponential.
So inside the compression library we have a selection of compressors, as I said. We offer LZMA for better compression, optimized versions of LZ4 for fast compression. Of course we have ZLIB with the optimized ZLIB decoder, and our very own LZFSE, which compresses slightly more than ZLIB but much faster. And last year we open sourced LZFSE; it's on GitHub.
Okay. The API now. That, too, API's in the compression. The first one is a buffer API. So that's when you have the entire data to compress. And so you have a buffer with the data to compress, you just call one function, provide the output buffer, and you will get the output in one single code. That's good for encode and decode.
And if the data is huge or you get it in small pieces, you want to use a stream API. In that case you will create a stream object and send data inside and get smaller data outside. And you will call that repeatedly until the entire stream is processed.
And this, what we have, what's new, we added a compression tool command line. So you can compress with Compression from the command line. Okay. That's for Compression. Now let's switch to BNNS, Basic Neural Network Subroutines. As I said, this is the energy running of the CPU and supporting all the neural network and the machine learning libraries. So you use BNNS almost anytime. When you type on the keyboard or when you run face recognition, all these applications use BNNS. And BNNS provides lower-level functions for this. So that's convolutional layers, pooling layers.
We also have fully connected layers. And these, we added separate activation layers doing just the activation function. And these layers also can do efficient conversion, data type conversions. Speaking of which, data types. So when you train a machine learning model, what you get is megabytes or hundreds of megabytes of data, usually 32-bit floating point data for your model.
That's the convolution weights, etc. It turns out you can convert these guys into smaller types, like 16-bit floating point or even 8-bit integer signed or unsigned and still get the same precision for your model. But, of course, when you convert 32-bit floating point to 8-bit, your model is four times smaller.
And that's something you ship with your app. So you want to consider that. This year we optimized BNNS to support these types. So this is what we support now, optimized in the convolutional layers. The green stuff is new. See, we added fp16 storage for input and weights and also int8.
And this is what we support for the fully connected layers. So we're still accumulating to 32 bits, but we can take 16-bit or even 8-bit inputs and weights. Now, for the activation functions, this is what we had last year. And this year we added a few more, including the most requested, Softmax. So now we have an optimized Softmax in BNNS. And if you set the activation function to identity, then you can change the input and output types to different combinations. This is what we support now. And you will get optimized type conversion from BNNS.
Last but not least, performance. So we worked a lot with the Core ML team and the Vision and NLP teams to optimize what they really use a lot. So that includes convolutions with padding and Stride 1 and 2 convolutions also and smaller kernels. Because the new neural networks have a lot of layers with smaller convolutions -- that's 3 by 3 and 1 by 1.
So we optimized these cases. And especially for the 3 by 3 case, we have Winograd convolutions, which can be up to four times faster than the reference implementation. And that's it for BNNS. So let me invite Steve on stage, and he will tell us everything about simd. Thank you.
[ Applause ]
Thanks very much, Eric. Thank you, everyone. As Eric said, my name's Steve. And today I'm going to talk to you a little bit about the simd module. I don't think I'll get to cover everything, but we'll cover some. So the simd module lives outside of Accelerate. It's a small collection of headers and user include. And it's a module you import in Swift.
And there's sort of three main use cases that are going to drive you to use simd. The first is if you're doing 2 by 2, 3 by 3, 4 by 4 vector and matrix arithmetic, the sort of thing that comes up all the time when doing graphics or geometry operations.
It also provides a bigger set of vector times, both integer vectors and floating point vectors on lengths up to 64 bytes for doing sort of general vector programming. It lets you target all the architectures we support on all the platforms we support pretty easily. And you can write one piece of code that gets you good vector code gen for all of those architectures.
The last reason you'll use simd is that it's a great set of types and operations for interoperating between all of the various things that do 3 by 3, 4 by 4 operations on the platform. So things like SceneKit, SpriteKit, ARKit, Vision -- all those different things you have lots of matrices and vectors flying around.
And the simd types are a great set of things to use with that. I should say that SpriteKit in particular -- or SceneKit in particular added a bunch of new stuff this year. So check out their session. There's some nice stuff for working with simd there. I'm going to show you a few examples of what you can do.
So let's say you want to multiply a three-dimensional matrix by a vector. You can do that using BLAS, which Eric talked about earlier, looks like this. And this is fine, but BLAS takes all the parameters just as raw pointers. So we have to tell it these are the dimensions of the matrix, these are the dimensions of the vector, this is how the memory's laid out.
There's a lot of other information we have to pass, which both makes the code a little harder to write and it makes it harder to read. In you're not fluent with BLAS already, this last line here, it's not really obvious that that's doing a matrix vector product. So we'd like to have something simpler than that.
We could also write this with GLKit. GLKit makes it considerably nicer. We have dedicated types for a three-dimensional matrix and for three-dimensional vector; we call this GLKMatrix3MultiplyVector3 function. It's pretty clear that's a multiplication. But we can make this even nicer using simd. This is what it looks like in simd. Okay? Totally explicit that it's diagonal matrix.
And multiply a matrix by a vector is just using the multiplication operator. It's really nice, it's really simple, and it's also really fast. All of simd for the most part is implemented as header inlines. So this is just going to give me three vector multiplications on my CPU. It's really fast, there's no call overhead, there's no parameter checking, there's no nothing. I get just nice simple code gen from this. So that's a Swift example. The next example that I'm going to show you will be in C.
Here I'm going to show you an example of how you can use simd to write vector code. So let's say that we want to compute a logistic curve with a given midpoint and maximum slope. This is a function that comes up all the time in sort of computational mathematics, especially machine learning. So this is a useful function to be able to optimize. We care a lot about this. And what I've put in the comment in the body of the function here is sort of a typical scalar implementation, just mathematically what this looks like.
But we want to compute this on 16 floating point values simultaneously because we can get better efficiency that way. So that's what this simd float16 type in the function is -- that's just a vector of 16 floats. And we're going to see if we can implement the body of this function.
So vector code is complicated. I just break this apart into three pieces so we can write each one of them individually. Let's start with this first section, the linear section. So we're just subtracting a scalar from a vector. We're going to subtract it from every lane of the vector, and then we're going to multiply by a scalar. What does that look like in simd? We just subtract and we multiply. It's really, really simple. This works in C, it works in Swift, it works in C++, it works in Objective-C. It's really nice. It looks a lot like shader programming if you've done any of that.
And this last piece down here where we take a reciprocal, again, the same thing, we just write code that looks like the math. It looks just like the scaler code, it looks just like the math we're doing. What about this middle section? That's a little bit more complicated. What we want to do here is for every element in the vector, we want to compute the exponential function of that and put that in the corresponding element of the result factor.
We can do that with a for loop. And this is fine, this works. But we have a nice new feature for you this year, which is that basically all of the math functions just work on vectors of floats and doubles. So any length of vector, float, and double. You can call XPath, you can call sine, you can call cosine, whatever. The math functions, they just work on them.
It's a really nice convenience feature when you're writing this kind of code. And this is going to target the NEON extensions when I write for ARM. And when I compile for Intel, it's going to target AVX and SSE. So I'm going to get fast code on all the platforms.
This is really nice. We have one other big new feature this year that a lot of people have asked for, which is quaternions. I'm going to give you a real quick introduction to them. Just that quaternions extend the complex numbers in the same way the complex numbers extend the reals.
So complex number, you might remember from school, has a real part and an imaginary part. Quaternions have a real part and they have three imaginary parts; sometimes you call that the vector part. My mom always said that if having one square root of minus one is good, having three square roots of minus one must be great. She was a smart lady. You should listen to her.
There's a lot of fascinating mathematical structure about the quaternions -- we don't really care about that. We're interested in one thing. Quaternions have a notion of length that's just like the complex numbers. You sum the squares of the components and you take the square root, that's the length of a quaternion. Quaternions of length one, we call those unit quaternions, and they have this really nice property, which is that you can use them to represent rotations in three-dimensional space. That's all we care about. Forget about all the other math that I just mentioned.
So I'll show you a quick code example of that. Say we have a vector that's a vector in the XY plane. And let's build a quaternion. This is a quaternion that represents a rotation around the y-axis. Quaternions act on vectors -- that's the simd act function. It's not multiplication. When you multiply a vector by a matrix -- when you rotate a vector by a matrix, you just multiply it. With quaternions it's called an action. And we don't care too much about the details of that, but that's why this is the act function.
So this is a nice way to represent rotations. There's a lot of other ways to represent rotations. Why do you want to use this one and when do you want to use this one? You know, you might use matrices instead, you might use Euler angles or yaw/pitch/roll, you can use axis and angle representation. There's lots of choices. Quaternions are an especially good choice for a certain class of operations that I'm going to tell you about.
The first nice thing about quaternions is they take less memory than matrices do. This is nice and it's great, but that's not really why we want to use them. The better thing about quaternions is that while they are not especially fast to do a rotation with, you can see here they're about a third the speed of using matrices to actually compute a rotation.
When you want to do a sequence of operations, when you want to combine rotations or you want to interpolate rotations, you want to do anything like that, they're the most natural setting to do those operations in. So when we want to multiply two rotations together, you can see quaternions are about 30% faster than vectors and matrices. But they also let us do things that are hard to do with matrices. So let's say we want to interpolate between two different rotated coordinate frames. That's a little subtle with matrices, but it's really natural with quaternions.
And the reason it's natural has to do with this sphere that I've drawn over on the side of the screen here. You might say, "Well, why are you drawing rotations on a sphere?" There's a good reason for that. Quaternions you can think of as points on the four-dimensional projective sphere.
And that sounds complicated and mathematical, and it is, but it's the natural space of rotations for three-dimensional space. So when I do operations on the surface of this sphere, that corresponds exactly to your natural intuition for what should happen with rotations. So for instance, if we want to interpolate between them, we just interpolate along a great circle on the sphere -- that's this simd slerp function that stands for spherical linear interpolation.
And that's really easy to use and it does exactly what you want. If we have a whole sequence of points that we want to interpolate between, we could call slerp repeatedly to interpolate between them, but that will have a noticeable jump a little bit in the rotation when you get to corners.
Instead we can use the simd spline function to get a completely smooth interpolation using a whole sequence of rotated coordinate frames. There's a ton of other operations like this in the simd headers that you can use to do these types of operations. If you want to work with rotations, the API is really pretty full. I encourage you to check it out.
And as I said, this was one of the most requested features from developers. So I encourage you to file bugs to say, "Hey, can you guys add this other thing as well?" We're really responsive to that. We love to get feature requests from our users. With that, I'm going to turn you over to Jonathan Hogg, who's going to tell you all about really, really big matrices.
[ Applause ]
Hi. So you've heard from Steve about simd, which is really great for very, very small matrices. And I'm going to tell you in a little while about very big matrices. But first I'm going to tell you about BLAS and LAPACK. These are libraries for handling dense matrices. So you can get up to 30,000 or 40,000 rows of columns here on a MacBook.
What do we have? So BLAS stands for Basic Linear Algebra Subroutines, and these do basic operations on matrices and vectors. We have BLAS 1, which does vector vector operations. And then we move through BLAS 2 for matrix vector, to BLAS 3 for matrix matrix operations. And you've already seen from Eric that we can be 100 times faster on the matrix matrix multiply than your simple loop. If you want to do things more complicated than this, then we have LAPACK. These do your matrix factorizations, your linear solves, your find Eigenvalues, Eigenvectors, SVD's -- pretty much everything you want to do.
That's all I'm going to tell you about dense matrices because we want to actually talk about sparse matrices. What is a sparse matrix? So James Wilkinson was one of the founding fathers of computational linear algebra. This was his definition, "Sparse matrix is any one where exploiting zeros is useful to us." Let's see what sparse matrix actually looks like.
So here are two sparse matrices. One's actually the Cholesky factorization of the other. And each pixel here represents multiple non-zeros. Where they are white, all of the entries behind that pixel are zero. Where there's blue, that means at least one non-zero's present. So you can see these matrices are mostly empty.
In fact, if you were to store this as a dense matrix, it's about 30,000 by 30,000. So it takes us 6.5 gigabytes. If we store it as a sparse matrix, we require 260 times less storage -- only 26 megabytes. If we wanted to multiply this matrix by a vector, we'd require almost 200 times fewer floating point operations. But if we want to do something more complicated like factorize this matrix, things get better, at least in the floating point. We require 2,000 times fewer floating point operations to factorize this matrix.
And the dense matrix is still the same size, the factor's filled in a bit. It's slightly less sparse than it was, so we're only 30 times better on the storage. Now, to drive that point home, we've set up a little race. We decided we'd run the sparse solver on a Watch and we put it up against our best dense matrix solver from LAPACK on a Macbook Air.
And this is what happened. Now, this is running at five times real time. And you can see that the Watch is finished in only 16 seconds. You got to remember while watching this that the floating point throughput on the MacBook Air is about 50 times that available on the Watch.
[ Laughter ]
Now, there's actually two phases to this factorization in the sparse world. First we find where the positions of the non-zero entry's going to be -- that's the symbolic phase. Then we have a numeric phase which calculates the actual values. These times are just for the numeric phase, but the symbolic phase only takes about two seconds on the Watch.
So rather than 16 seconds, you could say it takes about 18 seconds. But if you're doing more than one factorization on the same pattern, you can skip that symbolic phase on the second and third factorization. Even if you include that, we're still more than 10 times faster than the Macbook due to dense computation.
Hopefully I've now convinced you that it's worth using sparse matrices. So let's tell you how to actually define one. So here is a very, very small sparse matrix. You'll notice it's missing entries -- those are zero. We are going to store this using a standard format called compressed sparse column, which uses these three arrays.
And we'll start off with the rowIndices rate. So let's put the row numbers onto our matrix. And we'll just copy those up into the rowIndices array. And let's do something similar for the values. You can see we got a one-to-one correspondence here. That first entry is in row zero and has value two.
So let's just put on the positions of those entries in that rowIndices and values array. And the trick to compressed sparse column is that all these values have to occur in order of increasing column. All entries in column zero actually occur before those from column one. And then we get to the trick to this format, which is we're only going to store the position of the first entry in each column and one additional piece of information, the total number of entries in the matrix. That means that we know how long that last column is.
If you've already using a sparse solver, you should probably have your data in either this format or a coordinate format and we provide a converter. To use it in Accelerate, we need to wrap it in some metadata, just telling it how many rows, how many columns. And we're going to say this one is an ordinary sparse matrix. I've got an example of an unordinary sparse matrix in a couple of slides.
Now I've got my sparse matrix, what can I do with it? So you can do pretty much anything you'd expect. You can multiply a dense vector or a dense matrix by it; you can add two sparse matrices or sparse vectors together; you can permute the rows or columns; or you can find various useful matrix norms. All that functionality is provided by the Sparse BLAS, which we introduced a few years ago.
So what's new this time? The ability to solve sparse systems, that is, given a matrix equation A times X equals B where we know the matrix A and to the right-hand side B, find that vector of unknowns X. So we got two approaches to this for you. The first is matrix factorization. This is exactly what happens in LAPACK. It's simple, it's accurate, it's easy to use. But mathematicians being mathematicians, they came up with a more complicated way of doing things.
That's iterative methods. And I'll tell you a bit more about those later. So now a matrix factorization. For those of you who haven't come across this before, we want to take our green matrix on the left here and factorize it into the products of two triangular matrices on the right.
That's because we know how to solve a system with a triangular matrix very well. If we're not square, we have to do things slightly differently; we have to pick a rectangular and orthogonal factor here. And this is your QR factorization if you've heard of that before. So let's see how to actually do this. Here is a sparse matrix equation.
And let's define that matrix just to begin with. So this is going to be very similar to what I just showed you, except this matrix is special. It's symmetric. That means that the lower triangle is just the mirror reflection of the other triangle. So we can take advantage of that and let's only store those lower triangle entries.
Wrap it in that metadata and this time we're going to specify that this is not ordinary, this is a symmetric matrix. And we're going to tell it that we're passing the lower triangle. We could pass the upper triangle if we wanted, we've chosen the lower triangle here. So we got our matrix. Next let's look at that right-hand side. So this is a dense vector.
Let's just have a simple array. Wrap it in a little bit of metadata, telling us how long it is, and that's how easy this is. Which gets us to the interesting part, how do we actually find that vector X? So let's define some storage to put the answer in. This is exactly the same as flat dense vector B we just saw, except we don't have to supply any values.
Then we're going to call SparseFactor. And I know this matrix is positive definite; therefore, I can tell it use a Cholesky factorization. I've got a flowchart for you in a couple of slides which tells you how to pick a factorization to use, but here we're using Cholesky. That gets us L times L transpose factorization, which we then feed into SparseSolve, pass that right-hand side and the storage we specify it. And we get our answer.
We can put that back into the equation. And we can see this is correct. So what if A is not square? Well, this is where we have to use that QR factorization [inaudible] I mentioned before. But we got two different cases here. It's not entirely simple. We can be overdetermined since we have more rows than columns. Unless you've picked a very special system here, that probably means there's no exactly correct answer. In fact, you're in this sort of situation. Put a straight line through these four points.
Clearly that's impossible, but if you remember back to your school days, you probably did some least squares fitting. We pick a line which minimizes sum of the square of the arrows. It's exactly what we do in this case. Remember we want to solve X equals B. So the arrow is X minus B. Let's take the two normals out, which is effectively the sum of the square of the arrows in this example, and minimize that.
If we're underdetermined, that is, we have more columns than rows, we're in a slightly different situation. It's equivalent to saying put a line through this point. Obviously, there's an infinite family of lines which goes through that point. So how do we pick one to return to you? We give you the solution with minimum norm. Let's look at that on a code slide.
Here it's very, very similar to that Cholesky factorization we saw before. In fact, the only difference is that we say to use a QR factorization rather than Cholesky. And this will automatically pick whether to do the least squares or the minimum norm depending on the dimensions of your matrix. And I told you that we had a flowchart for you on how to decide which factorization to use. So the first question you have to ask is: Is your matrix symmetric? If it isn't, you have to use the QR factorization.
But if it is, we have another question for you: Is your matrix positive definite? Now, if you don't know the answer or if you're sure it isn't, you can do a symmetric indefinite factorization, LDL transpose. But if you know that extra bit of information that you've got positive definite matrix, you can use the Cholesky factorization L times L transposes.
And that's all we have to tell you on matrix factorizations. Now, I said there was this other technique, iterative methods. So what is an iterative method? Well, we pick a starting point, our best guess at the solution before we start. And this can be zero if you don't have any idea what the actual answer is going to look like. And we want to get within some small radius of our actual solution. And the way we do that is we iterate through a series of points which converge to that solution. Now, there's a couple of caveats with using these.
Typically they're only going to be faster than that matrix factorization approach if you've got a really, really, really big sparse matrix. And further, to actually get to that performance, you need to know a bit mathematically about your problem. You need something called a preconditioner, which is a very approximate solution. And if you check the literature for your field, you'll probably find quite a number have been derived by mathematicians.
What does this actually look like to use? So here's that matrix equation we had earlier. This time I'm going to use iterative method to solve it. In fact, I'm going to use conjugate gradients. So this is positive definite. So we just specify to use the conjugate gradient methods. And you'll notice there's some brackets ever this.
That's actually because this is a factory function which produces the methods and you can specify method-specific parameters in those brackets. The other thing I'm going to do is I'm going to use a diagonal precondition. This matrix is diagonally dominant. That means that the entries down the diagonal are very large compared to those off the diagonal; therefore, I know this diagonal preconditioner will work very well.
And indeed, if we look at the output of the algorithm, you can see that this arrow AX minus B is decreasing its iteration and we get to machine precision in four iterations. That's because it's 4 by 4 matrix. Mathematically, we should converge in at most N iterations where N is the size of matrix. So this is behaving as expected. But if you've got a much larger matrix, you probably don't want to go that many iterations, which is why you get an approximate solution.
And you can see you the get same answer as before. Now, let's say we want to solve the least squares problem instead. We offer a least square solver, which is iterative. We don't offer an underdetermined system solver, however. In that case you can just pick a [inaudible] of zeros and call them in solver square system instead.
And to use this, we use the method LSMR and a slightly different preconditioner which is, again, problem-specific. And you can see that we get there three iterations this time because this is a 4 by 3 matrix. But there are some very cool things about this particular way of doing things.
The first is that I don't actually need my matrix explicitly. As long as I have a function which performs the mathematical operation A times X or A transpose times X, that is, you have an operator, I can substitute a block of code in place of this actual matrix argument. The second is you're not restricted to using our preconditioners. You can write your own and just provide a function pointer in this argument.
Now, you're probably saying, "How do I know which iterative method to use?" I've got another one of those flowcharts for you. This time our first question is not whether you are symmetric but whether you are square. If you're not square, you're going to have to do a least square solve.
However, if you are, we go straight to that question are you positive definite? If you're not, we have GMRES -- that will handle pretty much any square matrix. But if you know that extra bit of information, you can, of course, use the famous conjugate gradient method. Now, that's everything I have to tell you today about sparse matrices.
So we've got one thing we want to point out, and that is that you can now use Accelerate on the Watch. We have provided you that SDK. Now, the framework has always been there. So it's even better. Using today's SDK you can back deploy to previous Watch OS's. So that means that you get everything that we've told you about today on the Watch.
So let's just summarize what that is. By using Accelerate your code will run faster. It will be more energy efficient. It will run across all our devices. And at the end of the day you have less code to maintain. You get everything here we've told you about today -- that sparse solver library, the new compression tool, changes to the BNNS, improvements to simd, and many more, and increased performance across the framework.
So if you want some more information, including some extensive sample code we've developed for the sparse solver, it's all available here. And you may be interested in reviewing some of these sessions, which have already been or going to the Metal session this afternoon. Thank you for your time.
[ Applause ]