Sonic Stage

Auto-Generating Interactive Spatial Soundscapes from Dialogue Videos for Blind Viewers

Logo 1
Logo 3
Logo 4
HKUSTa, Columbia Universityb, University of Rochesterc
CHI 2026 Extended Abstract
Sonic Stage Teaser

Video Presentation

Abstract

We present Sonic Stage, a system that transforms dialogue videos into interactive spatial soundscapes, enabling BLV audiences to intuitively understand characters' actions and movements through immersive auditory cues. Sonic Stage conveys essential visual information during dialogue through three auditory techniques: (1) spatialized dialogue to represent spatial layout, (2) diegetic sound to convey character actions, and (3) interactive descriptions to provide context-specific visual details.

The Accessibility Challenges in Dialogue Videos

Figure 2

In scenes with lots of dialogue, there is little opportunity to insert audio descriptions (AD). Consequently, blind and low-vision (BLV) viewers often miss crucial visual information, such as characters’ actions, movements, and facial expressions.

Our Solution: Sonic Stage

Figure 3

Sonic Stage conveys essential visual information during dialogue using three auditory techniques: spatialized dialogue, diegetic sound, and interactive descriptions. These techniques enable BLV viewers to perceive on-screen actions within an immersive auditory experience.

Sonic Stage's Audio Spatialization Pipeline

Figure 4

To create a coherent auditory experience across camera changes, Sonic Stage reconstructs a 3D scene representation from dialogue videos and renders all auditory cues within a shared spatial soundscape. Its pipeline consists of three stages: (A) frame sampling, (B) scene reconstruction, and (C) soundscape optimization.

Technical Evaluation with Diverse Video Types

Figure 5

Sonic Stage’s pipeline achieved 91.9% overall accuracy in character trajectory reconstruction across a diverse video set. It performed well in scenes with distinct backgrounds, even under fast motion and sparse views. The remaining errors mainly arise from two issues: (1) too few full or medium shots for robust spatial reconstruction, and (2) insufficient feature points for multi-view alignment.

User Evaluation with BLV Viewers

Figure 6

In a user study with 12 BLV viewers, Sonic Stage significantly improved video comprehension, spatial presence, and narrative engagement compared to a baseline modeled after SPICA, the state-of-the-art method for accessible video exploration.

Opportunities Across Diverse Video Genres

Figure 7

Sonic Stage’s techniques could be extended to diverse video genres, including sketch comedy, opera, dance, and documentary. We hope this work inspires future research on immersive, interactive audio representations that improve video accessibility for blind audiences.

BibTeX

@article{SonicStage2026,
  title={Sonic Stage: Automatically Generating an Interactive Spatial Soundscape to Facilitate Dialogue Video Comprehension for Blind and Low Vision Viewers},
  author={Xu, Shuchang and Jin, Xiaofu and Jain, Gaurav and Zhang, Wenshuo and Qu, Huamin and Smith, Brian A. and Yan, Yukang},
  booktitle={Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems},
  year={2026},
  url={https://doi.org/10.1145/3772363.3798425}
}