Build GPU binaries with Metal - WWDC 2020

Graphics and Games • iOS, macOS, tvOS • 36:16

Power up your shader pipeline with enhancements to the Metal shader compilation model — all leading to a dramatic reduction in Pipeline State Object (PSO) loading time, especially upon first launch. Learn about explicit PSO caching and sharing of GPU binaries using Metal binary archives and dynamic libraries. And we’ll detail the toolchain to create libraries and improve your shader compilation workflow.

Speakers: Kyle Piddington, Ravi Ramaseshan

Open in Apple Developer site

Downloads from Apple

HD Video (174.1 MB)
SD Video (72.4 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

Hello and welcome to WWDC. My name is Kyle Piddington. I'm an engineer in Apple's GPU software technology group. In this session, my colleague Ravi and I are going to show you some of the new advancements to ship your new or existing precompiled GPU code. This talk will consist of four parts. First up, I'll provide an overview of Metal's current shader compilation model.

Next, I'm going to introduce Metal Binary Archives, a new way for you to take control over shader caching, and ship precompiled GPU executables to your users. I'm excited to share the new support for dynamic libraries in Metal. This feature will allow you to link your compute shaders against utility libraries dynamically. And finally, Ravi will present in detail the set of tools that you have in your toolchain to make the most out of these new features. Let's get started.

We'll begin with a review of the shader and pipeline compilation process on Apple platforms. As you know, the Metal Shading Language is our programming language for shaders. Metal compiles this code into Apple's Intermediate Representation, also known as AIR. This can be done off-line in Xcode, or at runtime on the target device itself.

Building off-line avoids the runtime cost of compiling source code to AIR. In both cases, however, when creating pipeline state objects, this intermediate representation is further compiled on device to generate machine-specific code needed for each particular GPU. This process occurs for every pipeline state. To accelerate recompilation and recreation of pipelines, we cache the Metal Function variants produced in this step for future pipeline creation.

This process is great, and has served us well for many years. But for games and apps that adopt the Metal best practice of building pipeline objects early on to provide a hitch-free experience, this process can potentially result in long loading screens. Additionally, under this model, apps are unable to reuse any previously-generated machine code subroutines across pipeline state objects.

We've gathered feedback coming in from our ecosystem of developers, and have been able to identify a concrete set of needs that will enable you to address these challenges. You might want a way to save the entire time of pipeline state compilation, from source, to AIR, to a GPU binary.

You might also want a mechanism that enables sharing common subroutines and utility functions without the need of compiling the same code twice, or having it loaded in memory more than once. Having the ability to ship apps that already include the final compiled code for executables as well as libraries gives you the tools to provide a fantastic first-time launch experience. And allowing you to share these executables and libraries with other developers make their development easier.

One of the ways we are addressing these needs is via the Metal Binary Archives. Since the beginnings of Metal, apps have benefited from a system-wide shader cache that accelerates creating pipeline objects that have been created from previous runs of the application. With Metal Binary Archives, explicit control over pipeline state caching is being provided to you.

This direct control over caching gives you the opportunity to manually collect compiled pipeline state objects, organize them into different archives based off of usage or need, and even harvest them from a device and distribute them to other compatible devices. Binary Archives can be thought of as any other asset type. You have full control over the Binary Archive lifetimes, and these persist as long as desired. Binary Archives are a feature of the Metal GPU Families Apple3 and Mac1.

Creating a Binary Archive is simple. For this feature, we created a new descriptor type for Binary Archives. I use this descriptor to create a new Metal Binary Archive from the device. This descriptor contains a URL property. And this is used to determine if I want to create a new, empty archive, or if I want to load one from disk. When we request a new archive be loaded, this file will be memory-mapped in. And we can immediately start to use these loaded archives to accelerate our subsequent pipeline build requests.

The Binary Archive API allows me to directly add pipelines I'm interested in to the archive. I can add Render, Compute, and TileRender pipelines. Adding a pipeline to the Binary Archive causes a back-end compilation of the shader source, generating the machine code to be stored in the archive. Finally, once I'm done collecting all the pipeline objects I'm interested in, I call serializeToURL to save the archive to disk. Once I have my Binary Archives on disk, I can harvest them from device and deploy them on other compatible devices to accelerate their pipeline state builds.

The only requirement is that these other devices have the same GPU and are running the same operating system build. If there's a mismatch, the metal framework will fall back on runtime compilation of the pipeline functions. Once I have my Binary Archive populated, reusing a cached pipeline is straightforward.

When creating a pipeline, I set the pipeline descriptor's binaryArchives property to an array of archives. The framework will then search the array linearly for the function binaries. If the pipeline is found in any of the Binary Archives on the list, it will be returned to you, avoiding the compilation process entirely and will not impact the Metal Shader Cache.

In the case that the pipeline is not found, the OS's MTLCompilerService will kick into gear and compile my AIR source to machine code, return the results, and cache the results in the Metal Shader Cache. This process takes time, but the pipeline will be cached in the Metal Shader Cache to accelerate any subsequent pipeline build requests.

Now that I've gone over the workflow, let's take a look at the API to accomplish it. First, I create the MTLBinaryArchiveDescriptor. This is used to determine whether I want to create a new, empty archive or to load an existing one. Creating a Binary Archive is always done from a descriptor.

In this case, I set the URL to nil. The device will create a new, empty archive. Finally, I call the function "makeBinaryArchive" to create it. Next, I'll populate a Binary Archive using pipeline descriptors. I can add Render, Compute, and TileRender descriptors to the Binary Archive. Reusing compiled functions from a Binary Archive allows me to skip the back-end function compilation. I can create my pipeline descriptors just as I always have and use the new binaryArchives property to indicate which archives be searched. I want to do this before creating the pipeline.

Once you have collected all the pipelines you are interested in, you can serialize the Binary Archive to a writable file location on disk, using the method "serialize." Here, I'm serializing the archive to my application's "documents" directory. On the next run of my app, I can now deserialize the archive to avoid recompiling the pipelines that were previously added and serialized.

I simply set the URL to point to the location of an existing, pre-populated cache on disk. Now, one final note about archive search. Depending on your use case, you may find it helpful to be able to short-circuit the fallback behavior of compiling a pipeline when it's not found in the archive. In this case, you can specify the pipeline compile option FailOnBinaryArchiveMiss.

If the pipeline is found in any of the archives, it is returned to you as usual. However, in the case that it is not found, the device will return nil. One use case I recommend is using this workflow for debugging purposes. Avoiding the compilation process will let you diagnose any problems in your app's logic, or your archive's data.

Let's take a moment to discuss the memory considerations of using Binary Archives. As mentioned before, it is important to note that the Binary Archive file is memory-mapped in. This means that we need to reserve a range of virtual memory in order to access the archive's contents. This virtual memory range will be released when you release your cache, so it's important to close any Binary Archives that are no longer needed for optimal use of the virtual address space.

When collecting new pipelines, Binary Archives present a similar memory footprint to using the system's Metal Shader Cache. But unlike when using the Metal Shader Cache, we have the chance to free up this memory. Having explicit control over archive lifetime allows you to serialize and release a Metal Binary Archive when you are done collecting pipeline state objects.

In addition, when you reuse an existing archive, the pipelines in this archive do not count against your active app memory. You can serialize, and then reopen this archive and only use it for retrieving cached pipelines, effectively freeing the memory that was used in the collection process. This is not possible when relying on the system's Metal Shader Cache.

I'd like to wrap up this part of the session by discussing some of the best practices for working with Binary Archives. Although there is no size limit for Binary Archives, I recommend dividing your game assets into several different caches. Games are an excellent candidate for breaking up caches into frequently used pipelines and per-level pipelines. Dividing the cache gives you the opportunity to completely release no-longer-needed caches. This will free memory in case we've collected any new pipelines, as well as the virtual memory range in use.

When following this guidance, Binary Archives gives you granular control and should be favored over prewarming the Metal Shader Cache. Modern apps often have too many unique permutations of shader variants that are generated based off of user choices. With Metal Binary Archives, you can now capture them all at runtime.

Let's take a look how this all comes together in practice. We've partnered with Epic Games to quantify exactly how reusing a pre-harvested Binary Archive can help improve pipeline state object creation times, as well as the developer workflow in the context of Unreal Engine. For this test, we used the pipeline state workload of a AAA title, Fortnite. Fortnite is a large game. It's got a big world and many character and item customization options. This makes for a large number of shader function variants and pipeline state objects, over 11,000, in fact.

Epic Games follows the Metal Best Practices and compiles the needed pipeline state objects at load time, which allows minimizing hitching at runtime and delivers the smoothest experience possible to users. But apps, as we mentioned, cannot benefit from the Metal Shader Cache before it's been populated. So the up-front compilation time adds up, potentially making the first time launch experience take longer than desired.

By pre-seeding a harvested Metal Binary Archive that had collected function variants from 1,700 pipeline state objects, we observed a massive speedup in the creation times when we compare against starting with an empty Metal Shader Cache. These results were measured on a six-core, three gigahertz Mac Mini with 32 gigabytes of RAM. When we focus on pipeline build times, we go from spending one minute and 26 seconds building pipelines to just three seconds. Overall, a speedup of over 28 times.

To summarize, Metal Binary Archives allow you to manually manage pipeline caches. These can be harvested from a device and deployed on other compatible devices to dramatically reduce the pipeline creation times the first time a game or app is installed, and after a device reboots in the case of iOS. AAA games and other apps bound by a very large number of pipeline state objects can benefit from this feature to obtain extraordinary gains in pipeline creation times under these conditions.

Building and shipping GPU executable binaries with your application allows you to accelerate your first time launch experience and cold-boot app experiences. I hope you take advantage of this new feature. And that's it for Metal Binary Archives. Next, I'm going to talk about another new feature we're bringing to the compilation model, dynamic libraries.

Metal Dynamic Libraries are a new feature that will allow you to build well-abstracted and reusable compute-shared library code for your applications. I'll be discussing the concept, execution, and details of dynamic libraries. Today, developers may choose to create utility libraries of Metal functions to compile with their kernels. Off-line compilation can save time while generating these libraries, but there's still two costs that occur when using a utility library.

Every app pays the cost of generating machine code for the utility library at PSO generation. In addition, compiling multiple pipelines with the same utility library results in duplicated machine code for subroutines. This can result in longer pipeline load times due to back-end compilation and increased GPU memory usage.

And this year, we're introducing a solution to this problem, the Metal Dynamic Library. The Metal Dynamic Library enables you to dynamically link, load and share utility functions in the form of machine code. This code is reusable between multiple compute pipelines, eliminating duplicate compilation and the storing of shared subroutines.

In addition, much like the Metal Binary Archive, the Metal Dynamic Library is serializable and shippable as an asset in your application. Before we dive into the API, let's talk about what a Metal Dynamic Library is. A Metal Dynamic Library is a collection of exported functions that can be called from multiple compute pipelines. Later, we will discuss which functions in your dylib are exported and how to manage them. Unlike an executable Metal Library, dynamic libraries cannot be used to create MTLFunctions. However, standard Metal Libraries can import functions that are implemented in a dynamic library.

At pipeline creation time, the dynamic library is linked to resolve any imported functions, much like a dynamic library's used in a typical application. So why might you want to use dynamic libraries? If your application can be structured into, or relies on a shared utility codebase, dynamic libraries are for you. Using dynamic libraries in your app prevents recompiling and duplicating machine code across pipeline states.

If you are interested in developing Metal middleware, dynamic libraries provide you the ability to ship a utility library to your users. Unlike before where you would have to ship sources to developers, or compile their code with a static Metal Library, a dynamic library can be provided and updated without requiring users to rebuild their own metallib files.

Finally, dynamic libraries give you the power to expose hooks for your users to create custom kernels. The Metal API exposes the ability to change which libraries are loaded at pipeline creation time, allowing you the ability to inject user-defined behavior into your shaders without creating the Metal Library and MTLFunctions containing your entry points.

To determine if metal dynamic libraries are supported for your GPU, check the feature query "supportsDynamicLibraries." In the next few slides, we'll work through an example of how to create a dynamic library, how symbols are resolved, and some more advanced linking scenarios. A standard Metal Library is compiled to AIR through either a makeLibrary with source call at runtime, or through compiling your library with the Metal toolchain.

To create a Metal Dynamic Library, we begin with a similar workflow. We start by creating a Metal Library, but when doing so, we specify that we'd like this library to be used as a dynamic library. Next, we call the function makeDynamicLibrary, which will back-end-compile our metal code to machine code. This is the only time you will need to compile the dynamic library.

We need one more bit of information, a unique install name. At pipeline creation time, these names are used by the linker to load the dynamic library. The linker supports two relative paths, @executable_path, which refers to the metallib containing an executable kernel, and @loader_path, which refers to the metallib containing a load command. An absolute path can also be used.

With an install name and a library type, I'm now ready to create a dynamic library. Once I've set the compile options, I create a Metal Library which will compile my library from source to AIR. Then, I call the API method makeDynamicLibrary on the Metal device, which will compile my dynamic library into machine code.

Now that we've covered how to create a dynamic library, let's take a look at how we can use it. In these next few examples, I'll be discussing how we can link dynamic libraries at runtime. These operations can also be achieved when compiling Metal libraries off-line. And we'll discuss our off-line workflow later in this session.

To link a dynamic library when compiling a metallib from source, add the dynamic library to the "libraries" property of the MTLCompileOptions before you compile your library. The specified library will be linked at pipeline creation time. However, symbol resolution will be checked at compile time to make sure at least one implementation of the function exists. To review these steps, when creating a metallib from source, source files should include headers that define functions available in your dynamic libraries.

At compile time, dynamic libraries included in the "libraries" option are searched for at least one matching function signature. If no signature is found, compilation will fail, explaining which symbols are missing. However, unlike when compiling with static libraries, or header-only libraries, this compilation does not bind the function call to the function implementation. At pipeline creation time, libraries are linked and loaded, and a function implementation is chosen. We'll go over the case where multiple dynamic libraries export the same function in just a moment.

In addition to executable Metal Libraries linking dynamic libraries, dynamic libraries can also reference other dynamic libraries. If all these libraries were created from source at runtime, linking Dylib2 and Dylib3 to Dylib1 is as simple as setting the Metal compiler option libraries property on creation. And, to reiterate one more time, dylibs are shared between kernels. And although multiple kernels link the same dylib, only one instance of the dylib exists in memory.

Because linking is deferred to pipeline creation time, we can replace functions, or even full libraries with new implementations by using the insertLibraries property of the ComputePipelineDescriptor. Setting this option is comparable to setting the DYLD_INSERT_LIBRARIES environment variable. At pipeline creation time, the linker will first search through inserted libraries to find imported symbols before looking through the kernel's linked libraries for any remaining imported symbols.

In this example, DylibA exports the function foo. And when we create the compute pipeline state, foo will be linked to the implementation in DylibA. When we use insertLibraries, both DylibA and DylibD export the function foo. When we create the pipeline state, we walk down the list of imported libraries to resolve the symbol. And instead of linking the implementation of foo in DylibA, we will instead link from foo and DylibD.

Finally, let's discuss distributing your dynamic libraries. Much like Binary Archives, compiled dynamic libraries can be serialized out to URL. Both the precompiled binary and the generic AIR for the Metal Library are serialized. If you end up distributing the dynamic library as an asset in your project, the Metal framework will recompile the AIR slice into machine code if the target device cannot use the precompiled binary. This would occur when loading your dylib on a different architecture or OS.

This compilation is not added to the Metal Shader Cache, so make sure to serialize and load your library next time to save compilation time. To help you adopt this API and to work through a small example, we've provided a sample Xcode project available on developer.apple.com. This sample uses a compute shader to apply a full-screen postprocessing effect. The compute shader calls in to a dynamic library to determine the pixel color. And at runtime, we demonstrate how insertLibraries can be used to change which function it's linked against. If you're interested in running this code yourself, head to developer.apple.com and download the Metal Dynamic Library sample.

We're really excited to bring you these features. Dynamic libraries allow you to write reusable library code without paying the cost in time or memory of recompiling your utilities. And, like Binary Archives, dynamic libraries are serializable and shippable. In the sample code, we've demonstrated how you can use dynamic libraries to allow users to write their own methods without requiring users to write their own kernel entry points.

This feature is supported this year in iOS and macOS. Check the feature query on your device to see if your GPU supports dynamic libraries. So far, we've talked about some of the ways we're updating our shader model in Metal this year. In the final part of our talk, we'll be discussing additional updates to our off-line toolchains. To help me cover this topic, I'm going to hand you over to my colleague, Ravi.

Thank you, Kyle, and hello everyone. In the previous sections, we heard about how to create and use Binary Archives and dynamic libraries using the Metal API. I'm Ravi Ramaseshan from the Metal Front-End Compiler Team, and in this section I'm going to talk to you about how to create and manipulate these objects using the Metal Developer Tools.

With a small code base, you can put all your code into a couple of Metal files and build a metallib using a command line like the one shown below. As your code base grows, you keep adding files to the same command line. But it becomes hard to track dependencies between all your shader sources.

To address this, we are bringing libraries to Metal. Libraries come in three flavors. The kind of metallibs that you've been building up until now, the ones you use to create your Metal functions are what we call executable metallibs. For your non-entry point or utility code, you can now create static or dynamic libraries. Along with Metal libraries, we are also bringing more tools to Metal which mimic the CPU toolchain. All these tools together form the Metal Developer Tools which can be found in your Xcode toolchain. We'll see how to use these tools to improve your shader compilation workflows.

To get started, let's use the Metal compiler on our Metal sources to get the corresponding AIR files using the command line below. With that out of the way... let's use a new tool in the toolchain, metal-libtool, which like its CPU counterpart is used to build libraries. The static option archives all the AIR files together to build a static library.

We'll then run the linker through the Metal compiler to link the AIR files with the static library to create our executable metallibs. The lowercase "L" option followed by a name is how you get the linker to link against your library. You can also use the uppercase "L" option to get the linker to search directories in addition to the default system paths.

The way to think about static linking is that each of your executable metallibs has a copy of the static library. This has a few implications. On the bright side, your metallibs are self-contained and easy to deploy since they have no runtime dependencies. Also, the compiler and linker have access to the concrete implementations of your library, so they can perform link-time-optimizations, resulting in potentially smaller and faster code.

On the downside, you may be duplicating the library into each of your metallibs, resulting in a larger app bundle. Fortunately, dynamic linking is a powerful mechanism to address this problem. To create dynamic libraries, we'll invoke the linker by using the dynamiclib option. The install_name option to the linker is the toolchain counterpart of the Metal API you saw earlier. The install name is recorded into the metallib for the loader to find the library at runtime.

Now, we'll link the utility library with our AIR files to get our executable metallibs. With dynamic linking, the utility library does not get copied into the executable metallibs, but instead has to be deployed on the target system separately. Let's see how the loader finds the dynamic library at runtime.

It starts when you build the library. The linker uses the install name of the dylib to embed a load command into the resulting metallib. Think of it as a reference to the dylib that this metallib depends on. The load command is how the loader locates and loads the dylib at runtime.

You can have multiple of these load commands if you link with more than one library. Finally, let's revisit the install_name option we used when building our dylib and see how a couple of special names work. Let's assume that libUtility depends on another dylib and focus on how the loader resolves these special names in the load command for libUtility.

At runtime, the loader finds the library to be loaded using the install name, but replaces @loader_path with the path of the metallib containing the load command. Metal also supports @executable_path which the loader resolves to the path of the executable metallib containing the entry point function. You can probably see that for executable metallibs both these special names resolve to the same location.

Through load commands, each of your metallibs record only a reference to the dylib. Binding an implementation of the symbol to its reference is done at runtime by the loader. This, too, has some implications. On the positive side, using dynamic libraries solves the duplication problem we saw with static libraries. The downside is that at runtime, the dylib needs to exist for your executable metallibs to work and the loader must be able to find it.

Since libraries can be written by multiple authors, you run the risk of name collisions between the libraries. Like in this example, the two libraries unintentionally export the same symbol calculate. The expected behavior was for each library to use its own calculate function. In fact, with static linking, you would've gotten an error at build time. However, with dynamic linking, the loader just picks one definition and binds all references to it.

Because of this, you may only get an incorrect result observed at runtime and can be quite hard to track down. So, why does the calculate function participate in dynamic linking? That's because, just like with the CPU compiler and linker, by default, Metal exports all the symbols in your library.

You can quickly check the symbols in your library using metal-nm, which, like its CPU counterpart, let's you inspect the names of the symbols that are exported by your metallib. The question then is, how do we control what symbols are exported by our library. Just like on the CPU side, you can use the static keyword, anonymous namespaces and the visibility attributes to control which symbols are exported by your library. It's also a good idea to use namespaces when defining your interfaces. And finally, we support the exported_symbols_list linker option. For more details, there's some great documentation on our developer website on dynamic libraries.

The other exciting concept that we introduced to you earlier was harvesting fully compiled binaries from the device. We have harvested such a metallib for libUtility on an A13 device using the Metal API that we saw in the previous section. To work with such objects, there's a new tool called metal-lipo.

Let's use it to peek into what's in the metallib that we just harvested using the info option. The tool reports that this metallib contains two architectures. The way to think about that is it is a fat binary that really contains two independent metallibs called "slices." The A13 slice, which is back-end compiled and the generic AIR slice.

When the harvested metallib is deployed on non-A13 devices, Metal will use the AIR slice, just as it does today. That means spinning up the Metal Compiler Service and invoking the back-end compiler to build your pipeline. This allows the metallib to be used on all iOS devices, not just A13-based ones. However, if the same metallib is downloaded on an A13 device, Metal will use the A13 slice, skip back-end compilation completely and potentially improve your app-loading performance.

Now, let's say you want to improve the experience of your app users that are also on A12 devices. Besides the A13 device, let's also harvest a metallib from an A12 device. To simplify your app deployment, you might want to bundle all the slices into a single metallib. Metal-lipo allows you to do just that and create an even fatter universal binary. This technique can even be used with Binary Archives. Obviously, the more slices that you pack into your binary, the larger your app bundle becomes. So keep that in mind when deciding which slices you want to pack into your universal binary.

So, let's do a quick recap of the different workflows we have seen today. We started by using the Metal compiler to turn our Metal sources into AIR files. We then used metal-libtool to create static libraries, a new workflow to replace your existing Metal AR-based one. We then built a new kind of metallib, a dynamic library. We also saw how to combine AIR, static and dynamic libraries to create executable metallibs.

Along the way, we also saw using metal-nm to inspect the symbols exported by a metallib. And finally, we used metal-lipo to work with slices in our harvested metallibs. The last thing we want to show you is a use case shared to us by some of our game developers. Here's a high-level view of their workflow. As you can see, they use a variety of tools, including the CPU and GPU toolchains, to build the app bundle.

In some cases, developers have pooled their machines together into a server farm that they use to build their assets. This workflow works great as long as these tools run on macOS. However, some of these developers have established game and graphic asset creation pipelines that are based on Microsoft Windows infrastructure. In order to support these developers, this year we are introducing the Metal Developer Tools on Windows. With this, you will now have the flexibility to build your metallibs targeting Apple platforms from macOS, Windows or even a hybrid setup.

The tools are available as a Windows Installer and can be downloaded from the Apple developer website today. All the workflows that are supported by the toolchain we release with Xcode are also supported by the Windows-hosted tools. That brings us to the end of this session. Let's recap what we covered here. We introduced you to Binary Archives, a mechanism which you can employ to avoid spending time on back-end compilation for some of your critical pipelines.

We then presented dynamic libraries in Metal as an efficient and flexible way to decouple your library code from your shaders. And finally, we went over some new and important compilation workflows by directly using the Metal Developer Tools. We hope this presentation gets you started with adopting the new compilation model for your new and existing workflows. Thanks for watching this session, and enjoy the rest of WWDC.