January 1st, 2024

Reading and Writing Spatial Video with AVFoundation

A totally normal human wearing Apple Vision Pro on a plane.

Reading Spatial Video using AVAssetReader
Reading Spatial Video using AVPlayer
Writing Spatial Video using AVAssetWriter

If you’ve worked with AVFoundation’s APIs, you’ll be familiar with CVPixelBuffer, an object which represents a single video frame. AVFoundation manages the tasks of reading, writing, and playing video frames, but the process changes when dealing with spatial video (aka MV-HEVC), which features video from two separate angles.

Loading a spatial video into an AVPlayer or AVAssetReader on iOS appears similar to loading a standard video. By default, however, the frames you receive only show one perspective (the “hero” eye view), while the alternate angle, part of the MV-HEVC file, remains uncompressed.

With iOS 17.2 and macOS 14.2, new AVFoundation APIs were introduced for handling MV-HEVC files. They make it easy to get both angles of a spatial video, but are lacking in documentation¹. Here’s a few tips for working with them:

Reading Spatial Video using AVAssetReader

AVAssetReader can read media data faster than realtime, and is good for reading a video file, applying some transform, and writing back to a file using AVAssetWriter (for example, converting a spatial video to an SBS video playable on a Meta Quest / XREAL Air). Reading spatial video requires informing VideoToolbox that we want both angles decompressed from the video, instead of just the “hero” eye.

Create an AVAssetReader:

let asset = AVAsset(url: <path-to-spatial-video>)
let assetReader = try AVAssetReader(asset: asset)

Create an AVAssetReaderTrackOuptut, specifying that we want both MV-HEVC video layers decompressed:

let output = try await AVAssetReaderTrackOutput(
    track: asset.loadTracks(withMediaType: .video).first!,
    outputSettings: [
        AVVideoDecompressionPropertiesKey: [
            kVTDecompressionPropertyKey_RequestedMVHEVCVideoLayerIDs: [0, 1] as CFArray,
        ],
    ]
)
assetReader.add(output)

Start copying sample buffers containing both angles:

assetReader.startReading()

while let nextSampleBuffer = output.copyNextSampleBuffer() {
    guard let taggedBuffers = nextSampleBuffer.taggedBuffers else { return }
    
    let leftEyeBuffer = taggedBuffers.first(where: {
        $0.tags.first(matchingCategory: .stereoView) == .stereoView(.leftEye)
    })?.buffer
    let rightEyeBuffer = taggedBuffers.first(where: {
        $0.tags.first(matchingCategory: .stereoView) == .stereoView(.rightEye)
    })?.buffer
    
    if let leftEyeBuffer,
       let rightEyeBuffer,
       case let .pixelBuffer(leftEyePixelBuffer) = leftEyeBuffer,
       case let .pixelBuffer(rightEyePixelBuffer) = rightEyeBuffer {
        // do something cool
    }
}

Reading Spatial Video using AVPlayer

When dealing with real-time playback, you’ll often want to use AVPlayer, which manages the playback and timing of a video automatically. AVPlayerVideoOutput is the new API added for reading spatial video in real time, and is straightforward to set up.

Create an AVPlayer:

let asset = AVAsset(url: <path-to-spatial-video>)
let player = AVPlayer(playerItem: AVPlayerItem(asset: asset))

Create an AVPlayerVideoOutput for outputting stereoscopic video:

let outputSpecification = AVVideoOutputSpecification(
    tagCollections: [.stereoscopicForVideoOutput()]
)
let videoOutput = AVPlayerVideoOutput(specification: outputSpecification)
player.videoOutput = videoOutput

Add a periodic time observer for reading frames at a specified interval:

player.addPeriodicTimeObserver(
    forInterval: CMTime(value: 1, timescale: 30),
    queue: .main
) { _ in
    guard let taggedBuffers = videoOutput.taggedBuffers(
        forHostTime: CMClockGetTime(.hostTimeClock)
    )?.taggedBufferGroup else { return }

    let leftEyeBuffer = taggedBuffers.first(where: {
        $0.tags.first(matchingCategory: .stereoView) == .stereoView(.leftEye)
    })?.buffer
    let rightEyeBuffer = taggedBuffers.first(where: {
        $0.tags.first(matchingCategory: .stereoView) == .stereoView(.rightEye)
    })?.buffer
    
    if let leftEyeBuffer,
       let rightEyeBuffer,
       case let .pixelBuffer(leftEyePixelBuffer) = leftEyeBuffer,
       case let .pixelBuffer(rightEyePixelBuffer) = rightEyeBuffer {
        // do something cool
    }
}

Play!

player.play()

Writing Spatial Video using AVAssetWriter

Following the advice in Q&A: Building apps for visionOS, the steps for creating a spatial video from two stereo videos are:

Create an AVAssetWriter:

let assetWriter = try! AVAssetWriter(
    outputURL: <path-to-spatial-video>,
    fileType: .mov
)

Create and add a video input for the spatial video. It is important to specify kVTCompressionPropertyKey_MVHEVCVideoLayerIDs, kCMFormatDescriptionExtension_HorizontalFieldOfView, and kVTCompressionPropertyKey_HorizontalDisparityAdjustment in the compression properties. Without these, your video will not be read as a spatial video on visionOS:

let input = AVAssetWriterInput(
    mediaType: .video,
    outputSettings: [
        AVVideoWidthKey: 1920,
        AVVideoHeightKey: 1080,
        AVVideoCompressionPropertiesKey: [
            kVTCompressionPropertyKey_MVHEVCVideoLayerIDs: [0, 1] as CFArray,
            kCMFormatDescriptionExtension_HorizontalFieldOfView: 90_000, // asset-specific, in thousandths of a degree
            kVTCompressionPropertyKey_HorizontalDisparityAdjustment: 200, // asset-specific
        ],
        AVVideoCodecKey: AVVideoCodecType.hevc,
    ]
)
assetWriter.add(input)

Create an AVAssetWriterInputTaggedPixelBufferGroupAdaptor for the video input:

let adaptor = AVAssetWriterInputTaggedPixelBufferGroupAdaptor(assetWriterInput: input)

Start writing:

assetWriter.startWriting()
assetWriter.startSession(atSourceTime: .zero)

Start appending frames. Each frame consists of two CMTaggedBuffers:

let left = CMTaggedBuffer(tags: [.stereoView(.leftEye), .videoLayerID(0)], pixelBuffer: leftPixelBuffer)
let right = CMTaggedBuffer(tags: [.stereoView(.rightEye), .videoLayerID(1)], pixelBuffer: rightPixelBuffer)
adaptor.appendTaggedBuffers([left, right], withPresentationTime: <presentation-timestamp>)

Finish writing:

input.markAsFinished()
assetWriter.endSession(atSourceTime: <end-time>)
assetWriter.finishWriting {
    // share assetWriter.outputURL
}

Since publishing this article, Apple has added sample code for both reading multiview 3D video files and writing multiview HEVC. ↩

January 1st, 2024

Reading and Writing Spatial Video with AVFoundation

Reading Spatial Video using AVAssetReader

Reading Spatial Video using AVPlayer

Writing Spatial Video using AVAssetWriter

Footnotes