Sonic Stage

Xu, Shuchang; Jin, Xiaofu; Jain, Gaurav; Zhang, Wenshuo; Qu, Huamin; Smith, Brian A.; Yan, Yukang

Sonic Stage

Auto-Generating Interactive Spatial Soundscapes from Dialogue Videos for Blind Viewers

Shuchang Xu^a, Xiaofu Jin^a, Gaurav Jain^b, Wenshuo Zhang^a, Huamin Qu^a, Brian A. Smith^b, Yukang Yan^c

HKUST^a, Columbia University^b, University of Rochester^c
CHI 2026 Extended Abstract

Paper

Video Presentation

Abstract

We present Sonic Stage, a system that transforms dialogue videos into interactive spatial soundscapes, enabling BLV audiences to intuitively understand characters' actions and movements through immersive auditory cues. Sonic Stage conveys essential visual information during dialogue through three auditory techniques: (1) spatialized dialogue to represent spatial layout, (2) diegetic sound to convey character actions, and (3) interactive descriptions to provide context-specific visual details.

The Accessibility Challenges in Dialogue Videos

In scenes with lots of dialogue, there is little opportunity to insert audio descriptions (AD). Consequently, blind and low-vision (BLV) viewers often miss crucial visual information, such as characters’ actions, movements, and facial expressions.

Our Solution: Sonic Stage

Sonic Stage conveys essential visual information during dialogue using three auditory techniques: spatialized dialogue, diegetic sound, and interactive descriptions. These techniques enable BLV viewers to perceive on-screen actions within an immersive auditory experience.

Sonic Stage's Audio Spatialization Pipeline

To create a coherent auditory experience across camera changes, Sonic Stage reconstructs a 3D scene representation from dialogue videos and renders all auditory cues within a shared spatial soundscape. Its pipeline consists of three stages: (A) frame sampling, (B) scene reconstruction, and (C) soundscape optimization.

Technical Evaluation with Diverse Video Types

Sonic Stage’s pipeline achieved 91.9% overall accuracy in character trajectory reconstruction across a diverse video set. It performed well in scenes with distinct backgrounds, even under fast motion and sparse views. The remaining errors mainly arise from two issues: (1) too few full or medium shots for robust spatial reconstruction, and (2) insufficient feature points for multi-view alignment.

User Evaluation with BLV Viewers

In a user study with 12 BLV viewers, Sonic Stage significantly improved video comprehension, spatial presence, and narrative engagement compared to a baseline modeled after SPICA, the state-of-the-art method for accessible video exploration.

Opportunities Across Diverse Video Genres

Sonic Stage’s techniques could be extended to diverse video genres, including sketch comedy, opera, dance, and documentary. We hope this work inspires future research on immersive, interactive audio representations that improve video accessibility for blind audiences.

BibTeX

@article{SonicStage2026,
  title={Sonic Stage: Automatically Generating an Interactive Spatial Soundscape to Facilitate Dialogue Video Comprehension for Blind and Low Vision Viewers},
  author={Xu, Shuchang and Jin, Xiaofu and Jain, Gaurav and Zhang, Wenshuo and Qu, Huamin and Smith, Brian A. and Yan, Yukang},
  booktitle={Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems},
  year={2026},
  url={https://doi.org/10.1145/3772363.3798425}
}

More Works from Our Research Team

Front Row

DanmuA11y

Branch Explorer

Sonic Stage

Auto-Generating Interactive Spatial Soundscapes from Dialogue Videos for Blind Viewers

Video Presentation

Abstract

The Accessibility Challenges in Dialogue Videos

Our Solution: Sonic Stage

Sonic Stage's Audio Spatialization Pipeline

Technical Evaluation with Diverse Video Types

User Evaluation with BLV Viewers

Opportunities Across Diverse Video Genres

More Video Examples from Sonic Stage

BibTeX