Tailor your Metal apps for Apple M1 - Tech Talks

2020 • 25:30

Discover the advances in Metal performance and capability delivered with the Apple M1 chip on Apple silicon Macs. Apple M1 unites the top-end graphics and compute abilities of discrete GPUs with the features and power efficiency of Apple silicon, creating entirely new opportunities for developers of Metal-based apps and games on macOS. We’ll explore the Metal graphics and compute fundamentals of Apple M1, then take you through four important Metal features to make your Mac apps really shine on Apple silicon: tile shading, memoryless render targets, programmable blending, and sparse texturing.

Speaker: Luc Semeria

Unlisted on Apple Developer site

Downloads from Apple

HD Video (103.3 MB)
SD Video (60.7 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

This transcript was generated using Whisper, it has known transcription errors. We are working on an improved version.

Hello, I’m Luc Semeria from the GPU software team at Apple. In this video, I’m going to introduce you to Metal for the Apple M1 chip and show how you can make your Metal applications shine on the new Apple silicon Macs. First, we will dive into the features that Apple M1 supports for Metal apps on the Mac. And we will cover some of the key improvements we’ve made to increase graphics and compute performance for Pro applications and Metal games.

Then, we will highlight a few key Metal features that you may now leverage in your applications to take full advantage of the Apple M1 GPU. Apple M1 is engineered to bring incredible performance and a whole new set of features to the Mac product line. And Metal gives you access to the GPU at the heart of Apple M1.

We have unified the Metal features of iOS and macOS to seamlessly support modern Mac applications and tap into the power of our Apple-designed GPU. And to take it even further, we’ve made key improvements to improve responsiveness and increase graphics and compute performance for Pro applications and Metal games.

In Metal, we’ve been exposing two distinct sets of GPU families: one on iOS, targeting our latest iPhones and iPads and leveraging our Apple-designed GPUs, one on macOS for our latest Intel Macs, where the features and capabilities available may differ between the power-efficient integrated GPU and the high-performance discrete GPU.

Apple M1 supports the full set of Metal features from iOS and macOS. This means all the features found on a high-end discrete GPU are now available on our most portable Macs. So, if you are making apps targeting the A14 on iOS, you will find those same advanced features on Apple M1.

Likewise, if you are a Mac developer, all Metal features, including features only found on discrete GPUs, now work seamlessly and with great performance on Apple silicon Macs. In your codebase, support family queries will return true for both Apple 7 and Mac 2 on Apple M1. Let’s dive into the features.

It all starts with full support for the GPU-driven pipeline. This includes Tier 2 indirect argument buffers with support for nested argument buffers and up to half a million textures and a thousand samplers. It also includes support for indirect command buffers for both render and compute commands to reduce CPU overhead and allow command reuse. It even supports encoding indirect commands directly on the GPU to fully maximize CPU-to-GPU parallelization.

Apple M1 also supports barycentric coordinates and pre-clipped primitive ID, which enables different rendering techniques such as visibility buffer rendering. MSAA arrays and MSAA layered rendering for shadow maps and multi-viewport rendering are fully supported, including F32 MSAA Resolve. As you would expect, Apple M1 supports raster order groups to precisely control the memory access from fragment shaders for order-independent transparency, multilayer Gbuffers, and voxelization. It also supports ray tracing and function pointers that we introduced at DCLS WWDC.

On the texturing side, Apple M1 supports sparse textures with access counters for the first time on a Mac. This is perfect for texture streaming on a GPU timeline. We’ll cover this in more details later. BC127 highly compressed texture formats are fully supported. Your existing game asset pipeline can use them without any modification. Apple M1 supports 32-bit floating-point filtering, including bilinear and anisotropic. It also supports advanced texture clamp modes such as clamp-to-border colors: black, white, or transparent.

On the compute side, Apple M1 supports all SIMD group reduction and permutation functions. Additional SIMD matrix multiply functions are also available in the Metal shading language to further optimize your machine learning and image processing algorithms. Next, let’s talk about GPU performance. With Apple M1, we have taken our expertise in designing best-in-class, power-efficient GPUs and applied it to the Mac. We are relying on our TBDR and unified memory architecture to provide great performance with minimum power. In the process, we are providing even better performance for Metal apps while providing an industry-leading level of UI and multi-app responsiveness.

The Apple M1 GPU uses our new generation of Apple-designed shader cores and scales it further for the needs of pro applications and games running on macOS. For starters, Apple M1 has faster F32 floating-point operations. These are common for image processing, machine learning, ray tracing, and games. Apple M1 also provides better thread group memory performance in compute and tile shaders. It also increases ALU utilization using improved instruction scheduling across SIMD groups. Finally, to account for more complex shaders on the Mac, the shader core also speeds up function calls and stack accesses.

Performance on Metal features is also great. If your application already makes use of our GPU-driven graphics pipeline features, you will see significantly improved performance on Apple M1. We have also made FP16 texture filtering significantly faster for better support of HDR content. FP32 and INT32 texture image reads are also faster. And to keep memory bandwidth to a minimum without loss of precision, the Apple M1 GPU leverages our latest compression hardware to bring even more bandwidth savings to your applications.

Finally, performance improvements are also visible in terms of application responsiveness. Apple M1 supports multiple concurrent compute workloads running in parallel with geometry and fragment processing. It also supports fast context switching for great UI responsiveness. And scheduling on the GPU has also been tuned to handle multiple applications running concurrently or applications with multiple Metal command queues. Putting this together, Apple M1 supports all Metal features to seamlessly run modern Mac applications with great performance and responsiveness. Now, let’s see some of this in action.

We worked with our partners at Larian Studios to bring their upcoming title, Baldur's Gate 3, to the Apple M1. And here it is. This scene is running with ultra settings in 1080p resolution. It’s using a number of advanced rendering techniques including PBR, cascaded shadow maps, temporal anti-aliasing, and several post-processing effects such as SSCO and depth of field. Like in this cutscene.

Before Apple M1, this had only been possible on Macs with discrete GPUs. Diving in further, Larion uses our unified memory architecture to speed up sharing buffers between CPU and GPU. They have also been taking advantage of parallelization on the GPU for mesh surface generation and shadow passes. Performance is further improved using lossless compression and FP16 math in shaders.

So that’s a quick look at Baldur's Gate 3, and we think you will love playing it on Apple silicon Macs. As we just saw, Apple silicon Macs with Apple M1 can seamlessly run your applications, supporting all of our most advanced Metal features with great performance. As we’ve been developing Apple M1, we’ve been looking at a wide range of Mac applications and have been considering how they could be optimized even further in terms of performance, memory bandwidth, and memory footprint.

We found a few key features that you may use when optimizing for Apple silicon Macs and would like to tell you more about them. These features are tile shaders, memoryless render targets, programmable blending, and sparse textures with GPU access counters. First, let’s explore how you may leverage tile shading to greatly reduce memory bandwidth.

But before we dive into tile shading, let’s quickly recap how the TBDR or Tile-based Deferred Renderer architecture of modern Apple GPUs works. This will help understand what tile shading is and why you should use it. Our TBDR architecture has two main phases: vertex processing, where geometries are processed; fragment processing, where pixels are processed. Let’s focus on the fragment stage, which is where tile shading is introduced. Fragment processing operates on really fast on-chip memory.

On-chip depth and stencil memory is used for efficient hidden-surface removal to shade only visible pixels. On-chip tile memory is used for fragment shaders for reading and writing color values. But it doesn’t stop there. How good would it be if you could dispatch a compute-like shader that could access all of the pixels in your tile memory and that could be inserted right in the middle of your fragment processing? Well, this is exactly what a tile shader is. It lets you interleave fragment and compute work in the same render path and save a significant amount of memory bandwidth in the process.

Mixing compute and render passes has become more common in recent years. But in some cases, this can consume a large amount of memory bandwidth because the tile memory must be stored and reloaded from system memory between passes. Instead, you can use tile shading on Apple silicon GPUs to mix render and compute in a single pass, saving memory bandwidth.

Results between fragment draw calls and tile shading dispatches are fully consistent due to the implicit synchronization and barriers introduced by Metal. Note that in this case, there is no repeated load store between system memory and tile memory. This saves a huge amount of memory bandwidth and can greatly improve the performance of your application. Let’s see this in action on a real use case.

A typical example is tile-based light culling, which is a rendering technique where the render target is split into small tiles and a light culling compute shader builds a list of all lights that affect each tile. The final lighting path uses the per-tile light list to shade only the lights affecting a given tile. This reduces the shading cost dramatically, especially if you have a lot of spotlights in your scene. As you can see, both render and compute encoders are used. Let’s see how tile shading can help rendering this in a single pass.

Using tile shading, you can generate per-tile light lists in the middle of a render command encoder. Note that there is no break in your render path. All the computation is done in one pass, keeping all necessary data on tile memory without going to system memory. And since most of the intermediate data stays in tile memory, you could use memoryless render targets to also reduce the memory footprint, but more on that later.

To summarize, tile shading is the new shading stage only possible on Apple TBDR GPUs. Tile shading has full access to tile memory, thread group memory, as well as device memory. Synchronization between fragment draw calls and tile shading dispatches is guaranteed by Metal and results are fully consistent due to implicit barriers. We encourage you to check in your renderer if you are mixing compute and fragment processing. If you do, you may consider using tile shading instead of separate compute passes. This will save memory bandwidth and bring in more performance to your app.

Now, let’s take a look at memoryless render targets to understand how you can use them to reduce memory footprint. Memoryless render targets are render targets not backed by system memory. They are only backed by on-chip tile memory. They are created using Metal storage mode memoryless. This is typically used for attachments that do not need to be stored at the end of a render pass. Let’s see some use cases.

The first use case is with MSAA textures. Here, we have a large 4x MSAA texture as well as a resolved texture. At the end of the pass, both the MSAA and the resolved textures are stored in system memory. By using the Metal Resolve Store action, we can save all the bandwidth required to store the large MSAA texture.

This is because MSAA resolve always happens from tile memory, so there is no point storing the MSAA samples. This makes the unresolved multi-sample texture fully transient. We never load it or store it. So in that case, we can just make it memoryless and save all the footprints of the MSAA texture.

For the next use case, let’s take a look at memoryless for depth and stencil textures. Depth attachments are required for depth testing, but in some cases, the depth texture does not need to be stored back to system memory. This means that you should be able to reduce both memory bandwidth and footprint together.

Using Metal Store Action Don’t Care, we can save the memory bandwidth used to store the depth texture, while depth testing is still happening correctly on the on-chip depth memory. On top of that, since we don’t need to read or write the depth texture in system memory, we can mark it as memoryless and save its memory footprint.

Finally, let’s go back to our previous tile deferred rendering algorithm with tile shading. We have three passes: G-buffer rendering, light calling, and light accumulation. Note that most of the intermediate attachments are only accessed in tile memory. This means that these attachments don’t need to be stored back to system memory. Again, we can use don't care stall action with memoryless storage modes to save memory bandwidth as well as memory footprint.

In summary, memoryless render targets are a great way to reduce the memory footprint of your application. You can typically use them with MSAA attachments, Depth and Stencil attachments, and, more generally, with any temporary attachments fully produced and consumed in the same render path. Let’s explore another unique TBDR feature that we think can provide tremendous performance benefits and is really easy to use: programmable blending.

Programmable blending allows fragment shaders to access pixel data directly from tile memory. This means that a fragment shader can read the output of the previous fragment shader for the same pixel position. Programmable blending allows you to merge multiple render passes into a single pass, drastically reducing your memory bandwidth in the process. Let’s look at a simple example.

Here, we have a classic example of deferred shading ported from Intel-based Macs. The G-Buffer attachments are rendered and stored to system memory by the first render command encoder. The attachments are read back immediately afterwards in the per-pixel lighting path, consuming significant memory bandwidth. If you only need to read the output of the previous fragment shader at the current pixel position, then you should really use programmable blending. It allows you to fold the two render passes in a single pass, avoiding extra load stalls to system memory.

And now, your render target attachments are intermediate, though they can be marked as memoryless to save extra memory footprint on top of the memory bandwidth you already saved with programmable blending. We encourage you to inspect your code and check if some fragment shaders are reading the pixel values from previous fragment shaders. Programmable blending is very easy to use. Once you know the fragment shader to modify, just add the color attachment in your fragment shader inputs and read back the data. This allows more complex effects in your app with massively reduced memory bandwidth.

We just covered a few techniques that leverage Apple silicon's TBDR GPU architecture to reduce memory bandwidth and footprint. Now, let's talk about sparse textures, another key feature that is available for the first time on macOS. Sparse textures with texture streaming can reduce the memory footprint of your application while maintaining the quality of your render.

Before diving into sparse textures, let’s do a quick recap on what texture streaming is. Texture streaming lets you render incredibly large scenes with a fixed memory budget by streaming in only the texture mipmap needed for the current view. The lowest levels of the mipmap pyramid are kept resident in case higher-quality mipmap is requested but not yet streamed in.

Let’s see how Metal's sparse textures improve this approach. First, sparse textures provide texture access counters that tell you how often each region of a sparse texture is accessed. This lets you prioritize texture loading, as regions that are accessed more often are generally more visible in the current view.

Residency can be managed at a sparse tile granularity instead of MIP map granularity. Taken together, sparse textures let you stream more visible details within the same memory footprint, improving quality, or it lets you stream the same amount of details with a smaller footprint. Let’s see how to use sparse textures.

A renderer using sparse textures can be split into four stages. First, initialization, where the sparse heap and sparse textures are created. Then, render, where the scene is rendered by sampling the sparse textures. After rendering, you query the texture access counters, which tells you how often sparse tiles are being used for your scene. Finally, you stream your textures, where you can map and load new content into sparse tiles and unmap tiles to save memory. Let’s now dive into each stage.

First, you need to create a sparse heap of a fixed size and allocate one or more sparse textures from it. Note that you only need to bind the sparse heap. This heap defines your memory budget and is where sparse tiles are actually stored. Sampling textiles from the sparse textures will automatically read the content from the sparse heap. As part of initialization, you also load the lowest MIP levels of your texture as sample fallbacks.

Here is how to create a sparse heap with a given memory size using Metal. You first calculate the size of your heap as a multiple of the sparse tile size. You then generate the sparse heap descriptor. You set its type to sparse and specify its size. Then you create the sparse heap from the descriptor.

Once the sparse heap is created, you just need to create a texture from it. First, create a texture descriptor as usual. Then, create the sparse texture using the sparse heap object and the texture descriptor. After creating the sparse texture, you would typically also map its lowest MIP levels. We will see an example of texture mapping later.

After the initialization stage, you can start your rendering loop by rendering the scene. This render will sample your sparse textures. Here is a fragment shader that does this. You start by sampling your sparse texture using the sparse sample method which returns a sparse color object. You can then call the resident method on the return object to determine if the texture region you are trying to sample is mapped in memory or not. If it is, you simply return the color value. Otherwise, you take the fallback path and sample the sparse texture with a traditional sample command and an LOD clamp.

After rendering, it is time to query the access counters to get a sense of which sparse tiles are being accessed the most in the current view. This is how you collect the texture access counters in Metal. You start by creating a buffer to store the access counters. Then you encode a bleed command to copy the counters into your buffer, specifying the sparse texture, MIP level, slice, and region you are interested in.

Once you know which sparse tiles are the most visible based on the access counter values, you can then stream in your texture content. To do this, you map the sparse tiles that are being used and load the texture content into these tiles. At this stage, you can also unmap the tiles that are not as visible to save memory in your heap. Here is how you can map and unmap sparse tiles using Metal.

First, you create a Metal resource state command encoder. Then, you can simply encode map and unmap operations. You specify the regions, slices, and MIP levels of the texture to map or unmap. Once regions are mapped, you can then bleed or create your texture data onto the mapped memory. Let’s see this rendering loop in action on a simple example.

First, the sparse heap and the sparse texture are created. As part of the initialization, the sparse tiles corresponding to the two lowest MIP map levels of the texture are mapped into the sparse heap. Their content is loaded using bleed commands. Both mapping and bleed commands happen on a GPU timeline. At this point, the memory budget is far from being used. Quality will be refined after each subsequent rendering loop iteration.

After the first render, the access counters are queried, and based on the counter values, additional tiles are mapped and loaded. As the scene changes in subsequent render, the access counters are getting updated. Based on the new values, sparse tiles that are not visible anymore can be unmapped, and the associated memory can be used to stream in new sparse tiles corresponding to the new visible regions. This is how you can continuously add new details to your scene under a fixed memory budget. Let’s now look at a real example in action.

In this demo, you can see our modern rendering bistro scene presented as a split screen. On the left, the scene is rendered using traditional texture streaming, while on the right, sparse texture streaming is being used. And at the bottom, we’re tracking the texture memory footprint for the full-screen render. As you can see, texture streaming with sparse textures uses less than half of the memory footprint of traditional texture streaming with at least the same quality. Take a look at the menu. The quality is even better.

With sparse textures, we only map and load the tiles that are accessed. And with the texture access counters, we have an accurate view of the texture regions that are visible. On top of that, sparse texture streaming is done on a GPU timeline with great performance. So that's sparse textures.

To conclude, with Apple M1, we are unifying our iOS and macOS Metal features and bringing this support to the Mac platform. Apple M1 brings all of the high-end graphics and compute features that you would expect from a discrete GPU to our most portable Macs, and it does it with a great level of performance. It also brings key features such as tile shading and sparse textures that you may now use in your Mac applications to take full advantage of the Apple silicon GPU.

As a next step, you can learn more about the Metal features we discussed here by looking at our previous Tech Talks and WWDC sessions by clicking the videos link on developer.apple.com/metal. We have talks covering optimization for Apple silicon and the A-series GPU, deeper dive into Metal features such as ray tracing and GPU-driven pipeline, and a whole lot more. Be sure to check it out. Thank you for watching.