Capture cinematic video in your app - WWDC 2025

Audio & Video • iOS • 18:27

Discover how the Cinematic Video API enables your app to effortlessly capture cinema-style videos. We’ll cover how to configure a Cinematic capture session and introduce the fundamentals of building a video capture UI. We’ll also explore advanced Cinematic features such as applying a depth of field effect to achieve both tracking and rack focus.

Speaker: Roy Li

Open in Apple Developer site

Downloads from Apple

HD Video (122.9 MB)
SD Video (38.4 MB)

Check out Bezel, our iPhone mirroring app →

Transcript

Introduction

Hi, I'm Roy. I’m an engineer on the Camera Software team. Today, I’m excited to talk about how your apps can easily capture pro-level cinema-style videos with a Cinematic Video API. With iPhone 13 and 13 Pro, we introduced Cinematic mode. With its intuitive user interface and powerful algorithms, It transformed iPhone into a cinematography powerhouse. In this talk, we will have a look at what makes Cinematic video magical and walk through some code together to see how to build a great Cinematic capture experience.

Cinematic video

So, what is Cinematic video? At its heart are classic storytelling tools like rect focus and tracking focus. With a shallow depth of field, the director guides viewers attention to the key subjects in the scene, enhancing narrative impacts. When subjects move, as they often do in films, tracking focus keeps them sharply in view.

Though powerful, in the real world, these focus techniques require a great deal of expertise, which is why on a movie set, there are focus pollers whose main responsibility is to carry out these powerful but challenging shots. Cinematic video drastically simplifies this by intelligently driving focus decisions. For example, when a subject enters the frame, the algorithm automatically racks the focus to them and starts tracking. When a subject looks away, the focus automatically transitions to another point, returning to the subject when appropriate.

This year, we're making these amazing capabilities available as the Cinematic Video API, so your apps can easily capture these amazing cinema-style videos. Let’s explore how we can build a great capture experience for Cinematic videos using the new API. Let’s start with a typical capture session for a video app.

Firstly, we select the device from which we want to capture movies. Then we add it to a device input. Depending on the use cases, multiple outputs can be added to the session. Connection objects will be created when these outputs are added to the capture session. This is not a trivial setup, but enabling Cinematic video capture is really easy. In iOS 26, we're adding a new property, isCinematicVideoCaptureEnabled on the AVCaptureDeviceInput class. By setting it to true, we configure the whole capture session to output Cinematic video. and each of the outputs will now receive the Cinematic treatment.

The movie file produced by the movie file output will be Cinematic. It contains the disparity data, metadata, and the original video that enables non-destructive editing. To play it back with the bokeh rendered or edit the bokeh effect, you can use the Cinematic Framework we introduced in 2023. To learn more about this framework, please check out the WWDC23 session Support Cinematic mode videos in your app. The video data output will produce frames with a shallow depth of field effect baked in. This is useful when you need direct access to the frames, such as when sending them to a remote device.

Similarly, the preview layer will have the bokeh rendered into it in real time. It's an easy way to build a viewfinder. With this high-level architecture in mind, let’s walk through some code in these following areas.

Build a great cinematic capture experience

We will configure an AVCaptureSession with all its components required for Cinematic capture.

Then we build an interface for video capture using SwiftUI. We will walk through how to get metadata like face detections and how to draw them on the screen. With different ways to manually drive focus, we tap into the full power of Cinematic video. And we finish off with some advanced features to make our app more polished.

Let’s get started with the capture session. First, let’s find the video device from which we want to capture the movie. To find the device, we create an AVCaptureDevice.DiscoverySession object. Cinematic video is supported on both the Dual Wide camera in the back and the TrueDepth camera in the front. In this case, we specify .builtInDualWideCamera in the array of device types. Since we’re shooting a movie, we use .video as the mediaType. And we request the camera in the back of the device.

As we’re only requesting a single device type, we can just get the first element in the discovery session's devices array. In order to enable Cinematic video capture, a format that supports this feature must be used. To find such formats, we can iterate through all the device’s formats and use the one whose isCinematicVideoCaptureSupported property returns true. Here are all the supported formats.

For both Dual Wide and TrueDepth cameras, both 1080p and 4K are supported at 30 frames per second. If you are interested in recording SDR or EDR content, you can use either 420 video range or full range. If we prefer 10-bit HDR video content, use x420 instead. Since we’re not making a silent film, we want sound as well. We will use the same DiscoverySession API to find the microphone.

With our devices in hand, we create the inputs for each one of them. Then we add these inputs to the capture session. At this point, we can turn on Cinematic video capture on the video input. To enhance the Cinematic experience, we can capture spatial audio by simply setting first order ambisonics as the multichannelAudioMode. To learn more about spatial audio, please check out this year's session, “Enhance your app’s audio content creation capabilities.” Moving on to the outputs, we create an AVCaptureMovieFileOutput object and add it to the session.

Our hands are never as steady as a tripod, so we recommend enabling video stabilization. To do so, we first find the video connection of the movieFileOutput and set its preferredVideoStabilizationMode. In this case, we use cinematicExtendedEnhanced. Lastly, we need to associate our preview layer with the capture session. We’re done with the capture session for now. Let's move on to the user interface.

Since AVCaptureVideoPreviewLayer is a subclass of CALayer, which is not part of SwiftUI, to make them interoperate, we need to wrap the preview layer into a struct that conforms to the UIViewRepresentable protocol. Within this struct, we make a UIView subclass CameraPreviewUIView. We override its layerClass property to make the previewLayer the backing layer for the view. And we make a previewLayer property to make it easily accessible as an AVCaptureVideoPreviewLayer type.

We can then put our preview view into a ZStack, where it can be easily composed with other UI elements like camera controls. As mentioned in the intro, shallow depth of field is an important tool for storytelling. By changing the simulatedAperture property on the device input, we can adjust the global strength of the bokeh effect. Displayed on the right, driving this property with a slider, we change the global strength of the blur.

This value is expressed in the industry standard f-stops, which is simply the ratio between the focal length and the aperture diameter. Moving them around, the aperture is the focal length divided by the f number. Therefore, the smaller the f number, the larger the aperture, and the stronger the bokeh will be. We can find the minimum, maximum, and default simulated aperture on the format. We use them to populate the appropriate UI elements, like a slider.

Now, let’s build some affordances that allow the user to manually interact with Cinematic video. For users to manually drive focus, we need to show visual indicators for focus candidates like faces. And to do that, we need some detection metadata. We will use an AVCaptureMetadataOutput to get these detections so we can draw their bounds on the screen for users to interact with. The Cinematic video algorithm requires certain metadataObjectTypes to work optimally.

And they are communicated with the new property requiredMetadataObjectTypesForCinematicVideoCapture. An exception is thrown if the metadataObjectTypes provided differ from this list when Cinematic video is enabled. Lastly, we need to provide a delegate to receive the metadata and a queue on which the delegate is called. we receive metadata objects in the metadata output delegate callback. To easily communicate this metadata to our view layer in SwiftUI, we use an observable class.

When we update its property, the observing view will automatically refresh. In our view layer, whenever our observable object is updated, the view is automatically redrawn. And we draw a rectangle for each metadataObject. When creating these rectangles, it’s important that we transform metadata’s bounds into the preview layer’s coordinate space. Using the layerRectConverted fromMetadataOutputRect method.

Note that X and Y in the position method refer to the center of the view, instead of the upper left corner used by AVFoundation. So we need to adjust accordingly by using the midX and midY of the rect. With metadata rectangles drawn on the screen, we can use them to manually drive focus.

The Cinematic Video API offers three ways to manually focus. Let's now walk through them one by one. The setCinematicVideoTrackingFocus detectedObjectID focusMode method can be used to rack focus to a particular subject identified by the detectedObjectID, which is available on the AVMetadata objects that you get from the metadata output. focusMode configures Cinematic video’s tracking behavior. The CinematicVideoFocusMode enum has three cases: none, strong, and weak. Strong tells Cinematic video to keep tracking a subject even when there are focus candidates that would have been otherwise automatically selected.

In this case, although the cat became more prominent in the frame, the strong focus, as indicated by the solid yellow rectangle, stayed locked on the subject in the back. Weak focus, on the other hand, lets the algorithm retain focus control. It automatically racks the focus when it sees fit.

In this case, as the cat turned around, he was considered more important, and the weak focus shifted automatically to him, as indicated by the dashed rectangle. The none case is only useful when determining whether a metadata object currently has the focus, so it should not be used when setting the focus.

The second focus method takes a different first parameter. Instead of a detected object ID, it takes a point in a view. It tells Cinematic video to look for any interesting object at the specified point. When it finds one, it will create a new metadata object with the type salient object. So we can draw the rectangle around it on the screen.

The third focus method is setCinematicVideoFixedFocus that takes a point and the focus mode. It sets the focus at a fixed distance which is computed internally using signals such as depth. Paired with a strong focus mode, this method effectively locks the focus at a particular plane in the scene, ignoring other activities even in the foreground.

Any apps can implement the focus logic that makes sense in their respective use cases. In our app, we do the following: Tapping on a detection rectangle not in focus, we rack the focus to it with a weak focus. With this, we can switch the focus back and forth between subjects in and out of focus.

Tapping on a metadata object already being weakly focused on turns it into a strong focus, indicated by the solid yellow rectangle. Tapping at a point where there are no existing detections, we want Cinematic video to try to find any salient object and weakly focus on that. With a long press, we set a strong fixed focus. Here is how we can implement this logic in code. Firstly, we need to make two gestures. The regular tap gesture can be easily done with a SpatialTapGesture, which provides the tap location that we need to set focus.

When tapped, we call the focusTap method on our camera model object, where we have access to the underlying AVCaptureDevice. Long press, on the other hand, is a bit more complicated, as the built-in longPressGesture doesn’t provide the tap location we need to simulate a long press with a DragGesture.

When pressed, we start at 0.3 second timer. When it fires, we call the focusLongPress method on the camera model. Then we create a rectangle view to receive the gestures. It’s inserted at the end of the ZStack, which puts it on top of all the detection rectangles so the user’s gesture input is not blocked.

As we already saw in the previous videos, it's important to visually differentiate the focused rectangles between weak focus, strong focus, and no focus to help the user take the right action. We do this by implementing a method that takes an AVMetadataObject and returns a focused rectangle view. Let’s not forget that we need to transform the bounds of the metadata from the metadata output’s coordinate space to that of the preview layer.

By setting different stroke styles and colors, we can easily create visually distinct rectangles for each focus mode. With the point passed from the view layer, we can determine which focus method to use. First, we need to figure out whether the user has tapped on a metadata rectangle. And we do this in the helper method, findTappedMetadataObject.

Here, we iterate through all the metadata that we cache for each frame and check whether the point specified falls into one of their bounds. Again, we make sure the point and the rect are in the same coordinate space. Coming back to the focusTap method, if a metadata object is found and is already in weak focus, then we turn it into a strong focus.

If it’s not already in focus, we focus on it weakly. If the user didn’t tap on a metadata rectangle, then we tell the framework to try to find a salient object at this point. With a long press, we simply set a strong fixed focus at the specified point.

At this point, we have a fully functional app that can capture Cinematic video. Let’s polish it up with a few more details. Currently, our video capture graph looks like this. We have three outputs to capture the movie, receive metadata, and the preview. If we want to support still image capture during the recording, we can do so by simply adding an AVCapturePhotoOutput to the session.

Since our graph is already configured to be Cinematic, the photo output will get a Cinematic treatment automatically. The image returned by the photo output will have the bokeh effect burned in. Lastly, the Cinematic video algorithm requires sufficient amount of light to function properly. So in a room that’s too dark or the camera is covered, We want to inform users of such problem in the UI. In order to be notified when this condition occurs, you can key-value observe the new property cinematicVideoCaptureSceneMonitoringStatuses on the AVCaptureDevice class. Currently, the only supported status for Cinematic video is not enough light.

In the KVO handler, we can update the UI properly when we see insufficient light. An empty set means that everything is back to normal. In today’s talk, we had a recap on how Cinematic video enables our users to capture gorgeous pro-level movies, even for everyday moments like hanging out with their pets. And we had a detailed walkthrough on how to build a great Cinematic capture experience with the Cinematic Video API. We can’t wait to see how your apps can tap into these capabilities to deliver richer, more cinematic content. Thank you for watching.