Bring your Metal app to Apple silicon Macs - WWDC 2020

Graphics and Games • macOS • 33:51

Meet the Tile Based Deferred Rendering (TBDR) GPU architecture for Apple silicon Macs — the heart of your Metal app or game’s graphics performance. Learn how you can translate or port your graphics-intensive app over to Apple silicon, and how to take advantage of TBDR and Metal when building natively for the platform. We’ll look at how TBDR compares with the Immediate Mode Rendering pipeline of older Macs, go through common issues you may face when bringing an app or game over, and explore how to offer incredible performance when building with the native SDK. We’ve designed this session in tandem with “Optimize Metal Performance for Apple silicon Macs.” After you’ve watched this session be sure to check that out next.

Speakers: Gokhan Avkarogullari, Samuel Mokrani

Open in Apple Developer site

Downloads from Apple

HD Video (230.2 MB)
SD Video (86.7 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

Hello and welcome to WWDC. I'm Gokhan Avkarogullari, and I lead the graphics acceleration driver teams for Apple Silicon. Today, my colleague Sam and I will tell you how to bring your Metal-based apps to Apple Silicon Macs, and then how to make them faster, more robust and ready for the future.

For many years, Apple has designed the graphics processors found in the amazing A series chips used in iPhone, iPad and Apple TV. For the first time, we're bringing our deep expertise in GPU design to the Mac. So, let's get started. The Apple Silicon Mac contains an Apple-designed GPU that is designed to handle the workloads found in desktop games and Pro Apps, all while delivering new levels of speed and power efficiency.

This is DiRT Rally from Feral Games. This was recorded directly from an Apple Silicon Mac running the unmodified x86-compiled binary, translated using Rosetta. You can see all the modern rendering techniques applied to high-quality game assets. Complex shading, reflections, particle effects, motion blur and lush vegetation make the game look great. And, of course, the Apple-designed GPU makes it run smoothly. How do we deliver this level of performance? We do it using the same powerful Tile Based Deferred Rendering architecture used on iPhone, iPad and Apple TV.

If you are familiar with these devices, then you know that there are several important advantages to this architecture that are exposed via the Metal graphics and compute API. Capabilities such as Programmable Blending, Tile Shaders and Local Image Blocks are now available to take your Mac app or game to a whole new level of performance.

In fact, the GPU in the Apple Silicon Mac supports a unified Metal feature set that combines what was previously available on both macOS and iOS. And we've done our homework to make this move as easy as possible. This enables you to get the benefits of the faster, more power-efficient GPU with more features without having to retool your app or game. However, it's really important for you to follow the guidance in today's talk to account for certain behavioral differences and really unlock the full potential of the platform. Before we get into the specifics, let's take a look at the path you will take.

On Intel-based Macs, your app runs natively. On Apple Silicon Macs, your app will run out of the box under Rosetta's highly optimized translation layer. Under translation, your app will be automatically opted into several Metal consistency features that ensure your apps look great too. Translation and consistency features do have some performance costs, so your next step will be to recompile your app for Apple Silicon's native architecture. You will then find and fix any Metal consistency issues.

And if you want optimal performance on Apple GPUs, check out the session by my colleagues Michael and Dom, "Optimize Metal Performance for Apple Silicon Macs." Let's look at today's agenda. The first section is about adapting your app for the Apple GPU. I will start by reviewing the GPU architecture of Intel-based Macs and the differences with the Apple GPUs, and I'll discuss the APIs supported on the Apple Silicon Mac and Metal features you can use in your apps.

Different GPU families will behave differently when it comes to undefined behaviors, programming errors and performance characteristics. So in the second section, we will talk about a few common issues we observed on some apps coming form Intel-based Macs. And finally, Sam will talk about what we recommend you do for consistent rendering using Metal on the Mac.

So let's start with adapting your app for the Apple GPU. Let me highlight some of the major changes from the Intel-based Mac to Apple Silicon Mac when it comes to the GPU. Apple Silicon Mac contains an Apple-designed GPU, whereas Intel-based Macs contain GPUs from Intel, AMD and Nvidia. Apple GPU is built on the Tile Based Deferred Renderer architecture, whereas Intel, AMD and Nvidia GPUs are known as immediate mode renderers. I will cover the differences in a minute.

Apple-designed GPU combine the feature set of iOS and macOS. Both Mac 2 and Apple GPU family feature set are supported. Intel-based Macs support only Mac 2 family. Now, let me describe to you how Tile Based Deferred Renderer differs from immediate mode renderers. Let us start with an immediate mode renderer, abbreviated here as IMR.

This is a logical view of the architecture. From left to right, the hardware fetches vertex attributes, executes the vertex shaders and transforms geometry into screen space with the help of fixed-function hardware. Then, the fragment stage rasterizes the triangles and performs some early depth testing for efficiency. An important note here is that triangles can span the entire screen, or two triangles from the same draw call might end up in widely different places on the screen, meaning that the hardware needs the entire depth buffer and only can cache a small subset of it. Early depth test is followed by shading. When early depth test cannot be done, the depth testing is done post-fragment shading.

There's also alpha test and blending, where access to color buffers are required. Just like as it is for the depth buffer, the color buffer is needed in its entirety, and only a small part of it can be cached. Now that we reviewed IMR architecture, I will move on to the Tile Based Deferred Renderer and highlight the differences step by step. The first change is tiling the geometry in screen space and creating a database for all geometry in memory that I'm going to refer to as Tiled Vertex Buffer.

Since we have all the geometry information for an entire tile, we can rasterize it up front. As a result, for each pixel in the tile, we know for which opaque fragment the shading will have to happen. We also know which opaque fragments will fail the depth test. As a result, the hidden surfaces are entirely removed. We don't shade any fragment that will later be occluded by another one. We can do this with just tile size depth buffer on the chip rather than the entire depth buffer.

It does not matter if a triangle is full-screen or in the same draw call triangles are widely distributed on the screen. We get perfect testing with just a tile size on-chip depth buffer thanks to the Tiled Vertex Buffer. Additionally, if we are not going to use that depth buffer later, we can use memoryless render target feature and not store the depth data and not use memory for the depth buffer. It's a win-win situation.

Not only that, we also can do blending and alpha testing without ever having to load from the full color buffer in memory. Because we process an entire tile at once, we only need a tile's worth of color buffer on the chip. Please note that the arrow for color buffer in memory is one-way. We never need to access it for alpha testing, write mask or blending or for anything else.

This architecture allows for alpha test and blending operation to be performed by the shader core instead of fixed-function hardware. It also allows you access to the contents of the color buffer in your shader, which you can use for programmable blending, custom MSAA resolve, and so on. Obviously, feedback such as depth writes between the depth buffer and shading is covered by this architecture.

To recap, Tile Based Deferred Renderers have two main phases: tiling, where all of the geometry will be processed, rendering, where all the pixels will be processed. Let us summarize. On a TBDR GPU, a well-written app uses a lot less memory bandwidth than an IMR GPU. Since processing is done for a full tile, all blending happens in registers, so it's really fast and power-efficient. This is great for games, Pro Apps and UI performance as well.

No reason to re-fetch color, depth or stencil buffers over and over again. As a result, memory bandwidth is saved, performance increases, power consumption goes down. Now that we reviewed the architecture, we can move on to app's view of the GPU. With Apple Silicon Mac, you get what you had before and more. This new Mac supports everything the Intel-based Macs support.

All the APIs we and you love, such as Metal, MPS, Core Animation, and so on, are available. All the other APIs that are also used, such as OpenGL and OpenCL, are also available on Apple Silicon Mac. I will briefly talk about GL and CL first and then move on to Metal.

OpenGL and OpenCL are still deprecated, but they are not yet removed. They are present for backwards-compatibility purposes with the library of existing Mac apps that use these legacy technologies. OpenGL is supported up to version 4.1 and OpenCL to version 1.1. For features not in the core, you should always use extension or query checks before using them. Note that Apple-designed GPUs are engineered for Metal. If you are still using OpenGL and OpenCL in your app or game, you should prioritize moving to Metal as soon as possible. Otherwise, you will continue to leave performance and capability on the table.

Now, on to more exciting things. Let's talk about those new features that Apple-designed GPUs bring to the Mac. Apple Silicon Macs have the feature set defined by the Metal GPU Family Mac 2. Additionally, for the first time ever, the features that are specific to Apple GPUs are coming to the Mac. These features provide tremendous upside for your apps in terms of better performance, lower poser consumption and lower memory usage. Through the use of the features, such as Programmable Blending, Tile Shaders and Local Image Blocks, you can accelerate deferred, forward, and all types of tiled rendering.

With memoryless render targets and on-chip MSAA resolve, MSAA feature shines on this system. You can even use MSAA with deferred rendering because all the processing is done on the chip in this architecture without the cost of additional memory bandwidth and storage. There are many other features, such as ASTC textures, image block sample coverage control, and so on that open up opportunities for new techniques and optimizations. Let's recap. Your app should work fine, as all APIs are already supported.

Be aware of the underlying Tile Based Deferred Renderer architecture and use its best practices. And don't stop there. Take advantage of the new features and the new architecture. They will improve your app's performance even more. Now we are going to look into some issues that we observed on Metal API misuse on the Mac on some apps. We will review what the right usage is.

I will start with Metal feature detection. After that, Sam will take over and will talk about a few app-correctness issues that we observed and how we can fix those, such as load and store actions, position invariance and Threadgroup Memory synchronization and sampling previous depth while forming up new depth values.

Let us start with Metal feature detection. Previously, we talked about the unified feature set support. So how do you go about finding the availability of these features? Applications should query Metal GPU features directly. Metal organizes features into different GPU families that can be queried for a given device.

Some individual feature queries are also available. Make use of those for features that vary within the same GPU family. One important note. Querying the GPU name to know if a particular feature is supported is not the best practice. The code is not future-proof. Let's see an example. Here is some code basing feature usage on the OS, differentiating between macOS and iOS platforms.

Then, it uses the GPU name to decide some other information, such as SIMD group size and if the device is marked as low-power. And finally, it assumes that Apple GPU features are not available on macOS. This is not true anymore with Apple Silicon Mac. Instead, applications should gather this information using the Metal API. The supportsFamily API can be used to know if Apple GPU features are supported. The SIMD group size can be found using the method threadExecutionWidth on a compute pipeline object.

And to know if a GPU needs to be treated as integrated or discrete, use the isLowPower API. Note that for Apple GPUs, isLowPower returns "false," which means that you should treat these GPUs in a similar way as discrete GPUs. This is because the performance characteristics of Apple GPUs are in line with discrete ones, not the integrated ones.

Despite the property name, though, Apple GPUs are also way, way more power-efficient than both integrated and discrete GPUs. Now that we reviewed Metal feature detection, it's time to review how to correctly use load and store actions. Sam will tell you all about it and more. Sam? Thanks, Gokhan. Let's now talk about a common issue related to Metal load and store actions.

First, a quick recap. Load and store actions control the action performed on an attachment, either color, depth or stencil, at the start and end of a render pass. On Apple GPUs, they directly control the state of the on-chip tile memory. This memory is initialized based on load actions and flushed based on store actions.

An incorrect usage can cause performance issues if attachments are unnecessarily loaded or stored. Please refer to the video "Optimize Metal Performance for Apple Silicon Macs" to know how to best use load and store actions. But in this section, let's focus on correctness issues. Here is an example showing some incorrect usage on the left.

We had rendered the skybox in an earlier pass, and then drew the scene in the final pass. However, we used a DontCare load action for the final pass, causing the skybox to not be uploaded to tile memory, generating graphics artifacts. Setting the load action to Load avoids this issue.

Let's take a look at a diagram to understand what is happening. We first start with a previously rendered texture in system memory. And if a DontCare load action is chosen, Apple GPUs will not upload the textures from system memory to tile memory, leaving it uninitialized. The scene will then be drawn on top of this uninitialized data, and if the drawing does not cover the full frame buffer, uninitialized data will remain, creating correctness issues. However, if loadAction load is chosen, the tile memory is correctly initialized.

In this case, the scene is rendered on top of the previous content. Let's now look at store actions. If a DontCare store action is chosen, the tile memory is not being flushed to system memory, leaving it potentially uninitialized. This is not the desired behavior if this texture is consumed by a later render pass.

However, if a storeAction store is being used, the rendering done in tile memory is populated to system memory. Let's summarize how to correctly use load and store actions. For load actions, make sure to only load if the previous content needs to be preserved. This is the case when a draw does not cover the full frame buffer and accumulation is needed on top of the previous rendering. If you don't need accumulation, do not use a loadAction load, as it costs performance. Similarly, choose to store only if the content will be consumed in the later pass. Otherwise, make sure to not store to avoid extra memory traffic.

Let's carry on with an issue related to position invariance. The results of the same vertex position calculation across different vertex shaders can be slightly different. This is because, by default, the Apple GPU compiler highly optimizes vertex shaders. As a consequence, different optimizations between vertex shaders might lead to position value differences. If an application relies on position invariance without enabling it, correctness issues might be observed.

So, in Metal, position invariance is not guaranteed by default. It has to be explicitly enabled. This is exactly what is happening in this image. Due to position invariance being disabled, position outputs are slightly different between two passes, causing many pixels to be discarded. Here is the same image with position invariance correctly enabled.

Let's illustrate this issue with a diagram. Let's assume that the first pass of a multipass algorithm has rendered some geometry and stored the depth texture in system memory for later usage. Then comes Pass2 that loads this depth texture. Also note that the tile memory that is holding color values is cleared at the beginning of the render pass.

Pass2 then starts rendering the same geometry as Pass1, but this time with the depth test set to EQUAL. And due to position invariance being disabled, Pass2 generates slightly different depth values than Pass1. This causes some pixels to fail the depth test and be discarded. For those pixels, the cleared color is populated to system memory, causing correctness issues. Let's now take a look at the code.

Here we have the two vertex shaders used in the previous diagram, vertexPass1 and vertexPass2. As you can see, the position computation is exactly the same between the two shaders. It calls computePosition in both cases. But there is no guarantee that it will actually produce the exact same values for the position.

This is due to the code around the position computation that might trigger differences in compiler optimizations, affecting the position itself. How can this be avoided? First, make sure to pass the preserveInvariance option to the Metal compiler when creating a Metal library. Note that there is a similar option if you use the off-line compiler.

And for all the vertex shaders within this library that require position invariance, just add the invariant keyword on the position output. Note that you need to set these two options to ensure position invariance. And since position invariance comes at a performance cost, carefully enable these options only for shaders that really require it.

We encourage you to go through your application and identify vertex shaders that require position invariance. This will likely be needed where the depth-compare function is set to "equal," as it will require matching depth values between different passes. And to enable position invariance, make sure to use the preserveInvariance flag during compilation and the invariant keyword on the position output in your vertex shaders. Now let's take a look at Threadgroup Memory Synchronization. This is an area where you can get great performance if you have a good understanding of the underlying hardware architecture.

Let's first refresh the notion of threadgroups and SIMD groups. Compute passes launch a set of individual threads organized into threadgroups. The threads within a threadgroup execute together and can share a common block of threadgroup memory. Access to this memory needs to be properly synchronized for correct ordering. Threadgroups are further organized into SIMD groups that execute concurrently on the GPU. However, the SIMD group size is GPU-specific. It is 32 on Apple GPUs. Knowing the SIMD group size can be very useful to optimize your compute shaders... particularly if there is only one SIMD group per threadgroup. There is no need to synchronize between SIMD groups.

In order to ensure correct synchronization on any GPU, you should query the SIMD group size through the Metal API at runtime. And if an application does not use the correct SIMD group size, synchronization might be missing, causing correctness issues. Let's see an example. The artifacts in this image are due to the fact that the application expects only one SIMD group in a 64-threadgroup size, omitting any kind of threadgroup memory synchronization.

Here is the correct image with proper synchronization. Let's dig into the compute shader used in this application. This kernel makes use of threadgroup memory. It has also been launched using a threadgroup size of 64. You can also see that the threadgroup memory is first initialized and read back later in the shader.

Note that there is cross-thread communication, as the buffer written from one thread will be consumed on another thread. However, nothing in the code guarantees correct ordering. Something is missing here. You probably already guessed what is missing: memory synchronization. Here is what the code looks like when adding proper synchronization. First, the application queries the SIMD group size using the built-in threads_per_simdgroup. These can then be used to inject the correct synchronization.

If there is only one SIMD group per threadgroup-- in this shader, it means that simd_size is 64-- then a simple SIMD group barrier is required to guarantee memory ordering. However, if there are several SIMD groups, a full threadgroup barrier is required to synchronize all the SIMD groups. And even though extra control flow is generally not a good practice, in this particular case, as simd_size is a constant, the compiler is able to optimize out the code path that will not be taken. This code will ensure correct synchronization on any GPUs.

For best performance, you might want to rewrite your shaders with a 32 SIMD group size in mind to avoid threadgroup barriers, as they are expensive. Applications could have different sets of shaders optimized for a given SIMD group size and would choose at runtime which variant to use. With that in mind, go through your compute shaders and ensure correct threadgroup memory synchronization. GPUs have different SIMD group size, which will require different synchronization. Make sure to query the SIMD group size either in your shader, using the built-in threads_per_simdgroup, or with the Metal API, using threadExecutionWidth.

And for better performance, write multiple versions of your shader optimized for a given SIMD group size. Let's now explore a final issue that we observed related to depth and stencil texture sampling. Applications should ensure correct texture and attachment bindings. More specifically, a texture used as an attachment cannot also be sampled in the same render pass.

This might create a concurrent read-write access to the same rendering texture and trigger correctness issues. This is exactly what is happening in this image. The current depth attachment is also being sampled in the same render pass. No artifacts are being observed if the application avoids this undefined behavior. A diagram will help us understand what is happening. We first start with a depth texture in system memory.

Then the first draw comes in and starts accumulating depth. Fragment shaders within this draw execute and sample the depth texture from system memory. Then comes a later draw. Same story here. It will accumulate depth and sample the depth texture. One important point in this diagram is the fact that depth is accumulated prior to fragment shader execution. This will be the case for opaque geometry, where fragment shaders don't modify the depth.

At this point in the render pass, all the depth information is finalized. Apple GPUs will then start flushing the on-chip depth and stencil memory back to system memory. This is where a concurrent read-write access to the same depth texture introduces correctness issues. And note that this race condition can happen on any draws in your render pass, not only the last one.

In summary, ensure that your application does not sample the current depth and stencil attachments in the same render pass. This is undefined behavior on any GPUs. Also, do not use texture or memory barriers to work around this issue. Those are very expensive, especially on Apple GPUs using a tile-based default render architecture. Instead, if your application really requires sampling the current attachments, create a second copy for sampling. Let's now wrap up with what you need to do in order to get consistent rendering using Metal.

In our application testing, we observed three main issues affecting graphics correctness. That is why we developed three Metal work-arounds that are enabled for backward compatibility. These work-arounds are only applied for applications compiled with a macOS Catalina SDK or earlier. The first issue is when applications are using DontCare load actions where it should have been a load. In this case, Metal is remapping all DontCare load actions to actual loads to avoid correctness issues.

The second issue is related to position invariance. Some applications rely on position invariance without enabling it. Metal is then forcing position invariance for all vertex shaders. And finally, if the same depth texture is both sampled and rendered in the same render pass, Metal will snapshot the texture prior rendering. Again, these work-arounds are only enabled for applications compiled with a macOS Catalina SDK or earlier.

And as you probably suspect, they also come at a performance cost, so this is really not what you want. To help you identify these issues, we augmented the Metal API validation layer to catch a few of those, such as the misuse of DontCare load and store actions, as well as sampling the current depth and stencil attachments.

Here is a recap of the Metal best practices to get consistent rendering across different GPUs. Go through your application and make sure that this is already what you are doing. And if not, just follow these simple recommendations. Regarding Metal GPU feature detection, make sure that it is API-driven by querying the Metal GPU features directly.

Carefully choose your load and store actions to address both correctness and performance issues. Do not unnecessarily load or store attachments, as it will increase memory traffic. Also consider using memoryless attachments to lower the memory footprint of your application. These can typically be used for depth attachments. Enable position invariance where required.

Ensure correct threadgroup memory synchronization by querying the SIMD group size and using the appropriate synchronization primitives. And finally, avoid sampling the current depth and stencil attachments by snapshotting the attachments prior rendering. We are very excited to bring Apple Silicon to the Mac. This means that your applications will be more power efficient and run faster out of the box.

We can't wait to see how much faster they will run once you start optimizing for Apple Silicon. This will also enable you to use brand-new features on the Mac that were not possible before, such as programmable blending, tile shaders, memoryless frame buffers, and more. And finally, this transition aligns the technology used on Mac platforms with existing Apple platforms. That means that you can easily share a common code base across all Apple products. And this makes it even easier to bring all your iPad and iPhone applications to the Mac.

Apple Silicon Macs brings up new opportunities for you. At the beginning of this session, we showed how Games gets amazing graphics performance while playing DiRT Rally. On Apple GPUs, even the most demanding pro applications will get great performance. As an example, I would like to show you a video of Cinema4D.

This is recorded live on an Apple Silicon Mac. Note that it is also running the original x86 binary and a Rosetta translation. In this video, the application is rendering a high polygon model in real time with realistic camera properties such as user-specified aperture, f-stop and focal point. Enjoy how smooth it runs on Apple Silicon Mac while rendering a graphics-demanding scene.

Where does this all fit in the transition process to Apple Silicon Macs? At this point, your application should look great without Metal work-arounds running on new macOS SDK. But your journey should not stop here. To take your application to the next level and greatly improve performance, make sure to check out the session "Optimize Metal Performance for Apple Silicon Macs." Thank you.