Accelerate your machine learning workloads with the M5 and A19 GPUs - Tech Talks

2026 • iOS, macOS • 34:26

Discover how to take advantage of the M5 and A19 GPUs to accelerate machine learning. Find out how to use the Neural Accelerators inside each GPU core to perform accelerated tensor and matrix operations for AI training and inference, and learn how to optimize performance using the Metal debugger and Metal system trace.

Speaker: Zak Burka

Open in Apple Developer site

Downloads from Apple

HD Video (148.5 MB)
SD Video (61.8 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

Hi, I’m Zak and I manage the GPU Driver Performance team at Apple. And today we're going to talk about the M5 and A19 family of GPUs and what they mean for AI and ML on Apple silicon. We've added some powerful new capabilities to the GPU specifically designed to accelerate machine learning workloads, and we'll be diving into what they are and how you can take advantage of them in your own applications. Now, AI and ML are not new to Apple platforms. They're already deeply integrated into the tools and experiences you use every day.

MetalFX uses ML for temporal upscaling and denoising, letting games hit higher frame rates at larger resolutions. AirPods use it for spatial audio noise removal and real time translation in the fusion camera system on iPhone powers features like Deep Fusion cinematic video and portrait mode, all driven by ML.

For developers, we have a rich set of frameworks to bring ML into your apps. There’s Core ML for on device inference. Create ML for building and training models. And intelligent coding features built right into Xcode. These are just a few examples. ML is foundational across Apple platforms. And with the M5 GPU we're taking a major step forward, accelerating the workloads you're already running and making it possible to run more powerful models than ever before.

In this talk, we're going to cover four main areas. First, we'll look at the ML performance gains in the M5, GPU and what developers are already seeing in real world applications. Next, we'll introduce neural accelerators what they are, how they work, and why they matter for ML performance. Then we'll walk through the tensor APIs introduced in metal four, including recent updates that make it easier to get your ML workloads running efficiently on Apple silicon.

And finally, we'll show you how to use the developer tools to profile and optimize your ML applications. So let's jump into ML on the M5 GPU. This is a significant step forward for machine learning on Apple silicon. The M5 introduces neural accelerators inside the GPU, along with faster math rates, more bandwidth, and larger caches, all of which combine to deliver a major performance boost to your ML workloads.

And customers are seeing these performance gains on their existing applications today. For example, on the new iPad Pro with M5, AI image generation, apps like Draw Things can produce images up to four times faster over M4 using the latest diffusion models like Qwen-image and Flux. And on the new 14 inch MacBook Pro with M5, AI video enhancement and Topaz video is up to 7.7 times faster than on M1.

And for large language models, the gains are even more dramatic. Time to first token is up to four times faster and token generation is up to 25% faster. We'll dive deeper into how the M5 GPU achieves this later. But the key point is that these speed ups come from the new hardware, not code changes. So to summarize, this is a major performance update for ML on Apple silicon with neural accelerators in every shader core. Your inference and training workloads run faster right out of the box using the same APIs and frameworks you use today.

And as you move up the M5, family workloads that efficiently use neural accelerators will scale well with the increased core counts. We've also made substantial improvements to math rates and memory bandwidth, which combined with neural accelerators to deliver even greater speedups. And finally, we have new tooling support to help you profile your workloads, identify bottlenecks, and squeeze out maximum performance.

While this talk focuses on ML, we're also releasing a companion talk on M5 for graphics and gaming, which covers the rendering features and capabilities of the new architecture in detail. Both talks build on foundational concepts from our M3 architecture, tech talk like dynamic caching and occupancy management. So I'd recommend checking those out as well.

Before we dive into neural accelerators, let's take a quick look at what's new in the M5 GPU. Overall, this will help you see where the ML features fit in. At the top, all new neural accelerators for AI and ML. That's our main focus today. And we'll be spending most of our time here. But there's a lot more.

There are major improvements for graphics and gaming, which are covered in detail in the companion talk. Second generation dynamic caching, which automatically improves performance by adjusting your occupancy based on cache usage. Universal image compression reduces memory bandwidth by compressing texture resources in more use case, improving power performance and thermals.

Ray tracing gets faster, too, with third generation hardware acceleration for metal RT instanced workloads. And we have a new suite of developer tools tailored specifically to M5, which we'll see in action later. So that's the big picture. Let's look at the key rates that make this such a big leap for ML.

Beyond the new features, M5 also delivers significant rate improvements that directly benefit ML workloads. Matrix multiplication operations, often called gems, are up to 4 to 8 times faster, depending on precision. We also have larger GPU caches and more memory bandwidth. Combined, these improvements make M5 our most powerful GPU ever for AI and ML.

That was a quick overview of the new M5 GPU. Now let's dive deeper into neural accelerators. The key takeaway is simple neural accelerators dramatically speed up matrix multiplication. Combined with the rate and bandwidth improvements we've discussed, this is what makes ML workloads run so fast. On M5. Before we go further, let's briefly cover matrix multiplication and why it matters so much for ML matrix multiplication and really matrix multiply accumulate. It's a straightforward operation.

Multiply two matrices A and B and add the result to a third matrix C. Simple concept, but matrix multiplication is the fundamental building block that modern AI and ML are built on. The large language models that power intelligent assistants. Vision models, which enable computational photography and diffusion models which generate AI images, are all powered by matrix multiplication.

Under the hood, and in modern ML models, we're not just multiplying a few small matrices. A single LLM inference pass can involve hundreds of matrix multiplications, each with millions of elements. An important thing to note here is that the size and shape of these matrices significantly impacts their dynamic performance characteristics. Let's explain what we mean here.

Depending on the size and shape of the matrices, the performance bottlenecks shift between different GPU resources. For large matrices where both inputs have lots of elements, we have high arithmetic intensity, meaning that the math we need to do greatly exceeds the amount of memory we need to load. We call this being compute bound, and the performance of this scales with math rates. GPU frequency and core count.

On the other end, there are cases where one of the matrices may be skinny, sometimes just a single row where we have low arithmetic intensity, meaning we only do a little math before needing to load more data from memory for the next operation, which becomes our bottleneck. We call this being bandwidth bound, and the performance of this scales with how fast we can move data in and out of memory.

Keeping these scenarios in mind is key to understanding how neural accelerators, along with the other rate bandwidth and cache improvements in the M5 GPU, work together to accelerate your ML workloads. Let’s see these concepts in action with a real example - LLM inference. We’re not going to cover modern LLM architectures in detail here, but introduce a few key concepts to help showcase where the M5 GPU architecture improves the performance of these types of workloads.

LLM inference happens in two distinct phases. First is the prefill phase. This is where the model processes your entire input prompt using large matrix multiplication operations to compute the first output token. It's compute bound, meaning the GPU is doing heavy math. The prefill phase ends when we generate the first output token.

The time from submitting your prompt to receiving the first token is called time to first token. You can intuitively think of this as how long it takes your AI assistant to come back to you with a response after you've asked a question or assigned it a task to do.

The faster the time to first token, the more responsive the LLM feels. After Prefill, we enter the decode phase. This is where the stream of output tokens is generated one at a time, until the complete response is produced by the model. The performance characteristics of this phase are at the other end of the spectrum.

Now, tokens are processed one at a time and represented by the tall skinny matrices introduced earlier. While the full weights still need to be loaded from memory to perform the multiplication in this phase, the time spent generating each token is dominated by reading data from memory to feed the compute units, and is thus memory bound. The decode speed is typically represented as a rate in tokens per second, and you can intuitively think about this as how fast it takes the model to complete a response.

Combining these together on M5, the time to first token performance in the prefill phase is up to four times faster, thanks to the neural accelerators, and in the decode phase, the increased memory bandwidth and larger GPU caches in M5 speed up token generation by up to 25%, And we can see these gains on M5 in real LLMs such as Qwen3 and gpt-oss, where both time to first token and token generation performance are dramatically improved. These performance gains are not limited to LLMs, but really any workloads that make heavy use of matmul operations, such as convolutions and running diffusion models.

And as we go up the M5 family tree, we see great scaling as the neural accelerator capacities and GPU bandwidth increase. Now that we understand why matrix multiplication matters, let's look at what neural accelerators actually are and why they're so efficient for compute heavy workloads. A few key points. Neural accelerators are dedicated hardware and M5 purpose built for matrix multiplication. They're built into each shader core right alongside the other GPU pipelines such as ALU, raytracing, and so on. This makes them inherently scalable and efficient for algorithms that mix matrix multiplication with traditional ALU operations.

Here is a simplified diagram of the M5 GPU. It contains the shader cores arranged behind the last level cache, alongside the command stream processors that load vertex, fragment, and compute programs onto the cores. If we zoom in to a single shader core, we can see that it is composed of a few key units. First, the scheduler blocks which schedule and dispatch SIMD groups onto the core. Next, the ALU pipelines, which execute the math operations in your shaders. Memory pipelines for reading and writing to textures and buffers. And the dynamic cache memory, which keeps all shader storage on chip.

The neural accelerators sit right here alongside the ALU pipelines. This physical locality enables fast, seamless interoperation with code running on other GPU pipelines. It also means neural accelerator capacity scales directly with core count. So workloads that use them efficiently will scale well as you move up the M5 family. Zooming back out to our M5 diagram. Note that each shader core has its own neural accelerator, and as we move up to the M5 max, the same pattern continues the neural accelerators and the performance boost that they provide scale directly with the shader core count.

So to recap, neural accelerators are a new hardware block in M5 which dramatically accelerates matrix multiplication. They're inherently scalable, accelerate operations at the heart of ML, and interoperate seamlessly with other GPU, CPU, and A&E work. So how do you actually use neural accelerators? The good news is, if you're already running on Apple silicon with Apple frameworks, it just works. The APIs you're using today are automatically accelerated on M5, and if you're new to the platform, you'll find a rich set of frameworks and APIs to help get you started.

Here's how the stack looks At the top, you have domain specific frameworks like MetalFX that use ML internally. You get the benefit of neural accelerators automatically in the middle. We have host side frameworks like Metal Performance Shaders, MPSGraph and Core ML. These are great for deploying ML models with minimal code, and you can expect great performance out of the box for ML training and research. Open source tools like MLX, llama.cpp, and PyTorch already leverage neural accelerators under the hood, and at the bottom we have metal performance primitives and TensorOps. This is the lowest level and give you direct access to neural accelerators from your metal shaders.

Most developers won't need to go this deep, but if you're building custom ML kernels, optimizing a specific operation, or doing research that requires fine grained control, this is what you should be using. And if you're already writing your own custom kernels in metal using SIMD Group matrix API, you should move your workloads over to adopt TensorOps instead.

So when would you actually use tensor Ops instead of a higher level framework? First, when you're building custom ML kernels and need specific optimizations that the frameworks don't expose. Second, when you need to mix matrix operations with other shader code, TensorOps integrates directly into your metal kernels so you can combine Matmul with custom pre or post-processing in a single pass. And third, when you need fine grained control over how the work is tiled, how memory is managed, and how threads are scheduled. If any of these apply to you, TensorOps gives you direct access to neural accelerators with full control.

So that's neural accelerators dedicated matrix multiplication hardware in every shader core designed to dramatically accelerate ML workloads. The frameworks you're already using leverage them automatically. But if you need to go deeper, building custom kernels or optimizing specific operations. That's where TensorOps comes in. We'll cover TensorOps in three parts. First, a quick overview of the API and recent updates. Then we'll build up a tiled matrix multiplication kernel step by step. You'll see exactly how to structure TensorOps code. And finally we'll cover optimization techniques to get the most out of neural accelerators.

Let's start with the overview. TensorOps are matrix multiplication and convolution primitives purpose built to leverage neural accelerators on M5. The API is portable. The same code runs across Apple's entire GPU family from M1 to M5. On older GPUs without neural accelerators, TensorOps falls back to optimized shader implementations.

And because TensorOps lives inside your metal shaders, you can seamlessly mix it with other metal shader code. Custom activations, data transformations, or whatever else you need. We introduced TensorOps at Dock 25 in the combined metal for machine learning and graphics session. If you haven't seen it, I'd recommend checking it out. It covers the fundamentals we'll be building on here.

Since we introduced TensorOps, we've continued expanding the API in iOS and Mac OS 26. In 26.1, we added bfloat tensor support, critical for modern ML models that use Bfloat16 in 26.3. We added support for cooperative tensors as inputs to matmul. This lets you build custom dequantization routines inside your kernel, essential for running quantized models efficiently.

And in 26.4, we added four bit and eight bit integer tensors, so quantized map models can fully leverage neural accelerators. So let's put this into practice. We'll build up a tiled matrix multiplication kernel step by step using TensorOps. When performing large matrix multiplications. Calling a single tensor op on the full matrices would not be efficient. The key to getting great performance is tiling. We slice the input matrices into smaller tiles and then perform tile wise matrix multiplications using TensorOps. This approach allows the neural accelerators to work on data that fits well in the cache, maximizing throughput and data reuse.

Let's look at some code. If you're using metal four, you can declare your tensors on the host side using metal tensor and simply pass them directly into your metal kernel. Here we're creating three tensors with fp16 precision. The dynamic extent or d extents value of two indicates that these tensors have a 2D coordinate layout, which is exactly what you'd expect for standard matrix operations.

There is also support for creating a tensor directly from a pointer inside your compute kernel. Using tensor inline to create a tensor. This way, you specify the data, type the coordinate extents and mark it as tensor inline. You then pass in the buffer, holding your data along with the extents and strides. This is particularly useful when you need to work with data that's already in a buffer format, or when you want more dynamic control over your tensor creation. For now, let's stick with the host bound tensor approach to keep things clear.

To extract the slice of data that each threadgroup will work on, we need two things. The tile size we've chosen and the threadgroup IDs. Remember, we've partitioned our matrix multiplication across a 2D grid of thread groups, so each one is responsible for computing a specific tile of the output. We use the slice function on our input tensors to extract exactly the portion each thread group needs. As you can see here, we slice matrix A, matrix B and matrix C to get the relevant tiles based on our thread group's position in the grid.

Now that we have our slices, let's set up the actual Matmul operation that each thread group will execute. First, we create a descriptor that defines the shape of our tile computation. Notice here that we're using a dynamic shape for the k dimension. This tells TensorOps to loop over the full extent of the tensor for you, rather than a static k dimension.

We can also configure whether we want to transpose the left or right input matrix. This is useful for many ML algorithms that require transposed weight matrices. Next, we specify how many SIMD groups will participate in this operation using execution SIMD group. In this example, we're using four SIMD groups working together. And finally we simply call the run function on our extracted tensor slices. That's really all there is to the basic setup.

But of course, in real world ML applications, you often want to do something more sophisticated with your results. Let's look at that next. A very common pattern in AI and ML is to apply an activation function to the output of a matrix multiplication. These activation functions are crucial building blocks in neural networks With the basic approach we just covered, you would need to write the output tensor to device memory after the Matmul completes. Then read it back in to apply the activation function and finally write it out again. This double trip to memory is costly and can significantly impact your performance.

This is exactly the problem. Cooperative tensors solve with cooperative tensors. The output of your matrix multiplication stays in fast on chip memory distributed across the threads which are participating in your operation. You can then modify these elements in place, applying your activation or any other operation while they're still in thread memory. Only after you've finished your modifications do you write the final result to device memory. This eliminates the inefficient round trip entirely and can make a substantial difference to your kernel's performance.

So what exactly is a cooperative tensor? It behaves just like a regular tensor, but with one key difference. The data is distributed across multiple threads in the threadgroup. Each thread owns a subset of the tensor elements. This minimizes how much thread memory each individual thread uses, while still allowing every thread to manipulate its portion of the data in place. This essentially keeps data local while spreading the storage burden across your thread group.

Looking at this diagram, you can see how the elements are laid out. Thread zero holds the first two elements shown in blue thread one holds the next two elements in green, and so on. The data is interleaved across threads in a way that maximizes parallelism while keeping memory usage efficient. Let's see how we can use cooperative tensors to build a kernel that applies an activation function to our matmul output.

First, we need to create the destination cooperative tensor for our operation. In the template arguments, you provide the types corresponding to your input tensors. The Decltype keyword helps you infer these automatically. The last argument specifies the data type for your destination tensor. Here we're creating a half precision tensor. Next we run the matrix multiplication just as before, but instead of passing the regular tensor tk, we pass in our cooperative tensor t.

The matmul result now lives in thread memory distributed across the threads participating in this operation. From here, we iterate over all the elements in the cooperative tensor we use. Get capacity to find out how many elements this thread owns, then extract each element and apply our activation function, in this case a rectified linear unit or ReLU operation directly.

Once we've modified all our elements, we write the results back to device memory by calling the store function on co op t with our output slice as the parameter. Notice how we've gone from device memory to thread memory, applied our activation and written back all in a single efficient pass. These are the fundamentals of using TensorOps to implement tiled matrix multiplication with custom operations.

But if you really want to squeeze every last bit of performance out of the neural accelerators. There are some important optimizations to consider. Let's walk through the main ones. The first optimization involves tile sizes. A fixed tile size won't be optimal for all input shapes. The performance characteristics shift depending on your matrix dimensions. Increasing the tile size in the M and N directions allow better data reuse among SIMD groups within the Threadgroup. This means less redundant loading from memory.

On the other hand, increasing the SIMD group tile size can reduce traffic between cache levels, but be careful if you go too large, you may start spilling registers, which hurts performance. Templating your kernel so you can easily adjust tile sizes for different workloads is a good idea. Refer to the Programing Guide for detailed examples of how to find the optimal tile sizes for your specific use case.

The second optimization is about keeping your SIMD groups in sync when processing the k dimension, TensorOps will tile and loop over it for you automatically, but there's a subtlety here that's worth understanding. As execution proceeds, SIMD groups within a thread group can start to diverge in their progress through those K tiles.

If we look at the matrix data actively being used by SIMD groups in a thread group, each SIMD group represented by a dark square. In this diagram, we see that they start out synchronized, but over time they drift apart. When SIMD groups drift apart, you end up with larger, more scattered cache usage patterns. This hurts your cache hit rates and overall performance.

The fix is to manually synchronize your SIMD groups using thread group barrier. To do this, you will want to tile the k dimension explicitly in your code so that you can insert barriers every few iterations. This keeps your SIMD groups moving together through the k dimension, maintaining good cache locality. Refer to the programing guide for examples of how to tune the barrier frequency for your workloads.

The final optimization I want to highlight is about how your thread groups traverse memory. The default approach is a linear raster order traversal thread groups march through row by row. Simple and intuitive. But from the perspective of your last level cache, this doesn't give you great data reuse in the Y dimension.

A better approach is to use a space filling curve like Morton Order or Hilbert order. These traversal patterns keep thread groups that are close in time also close in space, which significantly improves cache locality and hit rates in the last level cache. In. The performance difference can be substantial, especially for large matrices. Again, refer to the programing guide for examples of how to implement these traversal optimizations in your kernels.

So we've covered TensorOps and how to build efficient matrix multiplication kernels. Now let's talk about optimization, specifically how to use the developer tools to profile your workloads and make sure you're getting the most out of neural accelerators. We're going to profile a real workload and see exactly what's happening on the GPU. I have a simple app that performs a 4K by 4K matrix multiplication, which is a large compute bound workload three different ways. First, using SIMD group matrix, which is how you might do it today. This will serve as our baseline.

Then we'll look at an implementation using TensorOps in order to run on the neural accelerators. And finally we'll add in Morton ordered Threadgroup dispatch the optimization technique we just discussed. We'll use two tools. First metal system trace, which is our live timeline viewer that shows you exactly what's happening on the GPU in real time.

Then we'll dive into the Xcode Metal debugger to profile the workload for deeper analysis. We'll look at the overall latencies of the different variants, as well as performance counter metrics to help us understand how well the workloads are utilizing the hardware. Let's take a look at the demo.

Here's the Xcode project containing our three 4K by 4K gem variants. When run, the app will execute each variant sequentially. Here we see our metal shaders, which implement our variants V1, which is the classic SIMD Group matrix Implementation V2, which moves to TensorOps in order to use neural accelerators, and v3 which optimizes v2 with a Morton ordered Threadgroup dispatch.

Let's first build the project with command B, then we launch instruments with command I. When instruments comes up. We'll choose the metal system trace template from the list, and then we can optionally select any parameters that we'd like to. For the capture. For this example, I'll select the performance limiters counter set.

This gives us GPU performance counter data to understand what the workload is doing and how well it's utilizing the hardware. Next, we hit record, which will launch the program and capture a trace of the execution. I have an example trace already captured here, which we'll look at now.

Here's what a metal system trace timeline looks like. It captures everything happening on the system, including CPU, GPU, memory, and display, and shows what is happening and when. This is where you start when you want to quickly understand the execution dynamics of your workload. Each track in the main window represents a different set of events in the system, with time increasing in the X dimension for our workload. We will focus on the M5 metal device events, which I'll expand now.

Inside the M5 Metal Events track. There are many different useful metrics to help us understand the workload. There are memory tracks which tell us the wired footprint and allocation and deallocation activities. There are CPU side driver processing, which shows us where the metal driver stack is processing and submitting work to the GPU itself. And below this we have our vertex fragment and compute tracks, which show us exactly when command buffers are actually being executed on the hardware. Using this flow, you can track work from your API calls down to the GPU hardware.

In this example, we have run our application and captured it on the Mac that we're using for development. Notice that we have vertex, fragment and compute work all happening at the same time. This is what is really happening on your system when you run workloads. UI is still updating. Other apps which may be using the GPU are still running, and this is all captured in the trace.

Despite this, we can still clearly see that we have our three GPU compute command buffers from our example app running on the hardware right here in green. We have gem v1, which takes over two seconds to execute, followed by v2, which takes over just a half second to run. Followed by v3, which takes around a third of a second to complete. This matches what we expect given that v1 is not using TensorOps, v2 is and v3 is using the Morton Walk order optimization techniques we discussed earlier.

Let's pin this track so we can look at it in line with performance counters. You can see that there are many counters captured in line to help us understand our GPU activity. And you can search for specific counters using the track filter. I'll use this to find and pin the neural accelerator utilization counter.

Notice that during v1, our neural accelerator utilization is 0%. All of this compute work is happening on the ALU, which means the dedicated matrix hardware is sitting completely idle as we move to V2, which uses TensorOps. We're doing the same math, but now we are well above 50% utilization and also over four times faster than V1. But the utilization percentage tells us that the neural accelerators could be doing more. They're waiting for data. And here's v3 which uses Morton ordered Threadgroup dispatch. Same TensorOps but with a different traversal order utilization jumps to close to 100% and the workload gets substantially faster.

So Metal System Trace gives you that quick system level view. You can see your workload in the context of everything else running on the system. It's great for rapid iteration and understanding the big picture. But sometimes you need to go deeper. You want to isolate just your GPU work. Minimize external influences and get more detailed metrics. That's when you use the Xcode metal debugger.

With the metal debugger. You capture a GPU trace of your workload and then replay it through Xcode. This isolates just your GPU work and removes outside system activity. I've captured a GPU trace of a single K loop iteration for each variant. This keeps the capture small while preserving the performance characteristics we care about.

Here are the same three variants, but now in isolation. The metal debugger gives you a much more detailed breakdown of what's happening inside each compute pass. We can again see that each variant is getting faster, and also that V2 and v3 are using neural accelerators and also seeing much higher cache bandwidth.

Additionally, the cost graph view lets us see where you're spending time in your shaders directly in line with your metal source. You can pull up the runtime statistics to get a breakdown of the register, usage, divergence, and types of instructions executed. Note that in this v1 example, which uses SIMD group matrix, the majority of our instruction types are math. In this v3 example, almost all of the instructions are being executed by neural accelerators. This is where you go when you need to really understand exactly what your code is doing and how it is executing on the hardware.

I won't go deep into the metal debugger today, but I encourage you to explore it, especially the per shader performance counters in the cough graph. Views between metal system trace for the big picture and the metal debugger for deep dives. You have everything you need to understand and optimize your GPU workloads.

So that's the workflow. Start with metal System trace for the quick system level view. When you need deeper analysis, capture a GPU trace and use the Xcode Metal debugger and the results demonstrate the performance potential here across all three variants. It's the same 4K by 4K matrix multiplication running on the exact same hardware. The difference is almost seven times faster execution just by understanding how to use and feed neural accelerators efficiently.

All right, let's recap what was covered about the M5 and A19 GPUs for machine learning. First, the performance improvements in the M5 GPU family. You're getting up to four times faster. Time to first token for LLMs and 25% faster token generation performance gains. You’ll see without changing a single line of code. Then neural accelerators the new dedicated matrix multiplication hardware in each shader core. Understanding compute bound versus memory bound workloads and arithmetic intensity is key to getting the most out of this hardware.

Next, the metal tensor APIs, which give you first class tensor support directly in your shaders. Cooperative tensors keep intermediate results in thread memory and tiled matrix multiplication with activation. Fusion lets you build highly efficient ML kernels. And finally, optimizing your workload profile first to identify bottlenecks and use metal system trace in the Xcode Metal debugger to understand how your workloads are utilizing the hardware. And be sure to check out the Metal Performance Primitives Programing Guide for more in-depth details on how to best use TensorOps, as well as the companion graphics talk for more details on the graphics and rendering capabilities of the GPU. Thanks for watching.