Building AI Infrastructure for a Genre That Doesn’t Have Any

June 17, 2026

Hao-Wen Dong, Assistant Professor of Music, School of Music, Theatre & Dance

Hao-Wen Dong’s project started as a course final project and grew into something considerably more ambitious: an attempt to build, from near-zero, the AI infrastructure that the a cappella research community needed but had never had. There was no large dataset. There were no pre-trained models tuned for mixed vocal recordings. There was no established pipeline for separating soprano from alto from baritone in a live ensemble performance. Everything had to be constructed.

The dataset problem was both the central constraint and the first opportunity. Two members of the research team were professional a cappella singers — a coincidence that became the project’s most valuable resource. The team recorded 2.6 hours of studio-quality a cappella music with all stems separated: each vocal part — soprano, alto, tenor, baritone, bass and vocal percussion — recorded individually before being mixed. By the standards of most music research datasets this is extremely small; by the standards of a cappella research it is one of the largest ever assembled. The recordings span five languages — Mandarin, English, Hakka, Taiwanese, and Korean — a multilingual scope that anticipates the eventual need for models that generalize beyond any single tradition.

The team’s first attempt at source separation was to fine-tune an existing music separation model — one trained to separate piano, guitar, bass, and drums — on the a cappella dataset. It improved performance, but imperfectly: the model’s prior knowledge about instruments did not transfer cleanly to the task of separating six vocal tracks of similar timbre. The second attempt trained from scratch on 2.6 hours of data. Surprisingly, this worked reasonably well, suggesting that a cappella source separation is more tractable from scratch than the data volume implied. The breakthrough came from a third approach: using singing voice cloning as a data augmentation strategy. By generating AI-cloned versions of the human recordings — preserving vocal character while varying performance details — the team multiplied the effective size of the training data. The augmented dataset produced a measurably more generalizable model, and the demonstration was persuasive: the AI-cloned recordings were close enough in quality that Dong was comfortable playing them alongside the originals to show the technique working.

The rehearsal interface that grew alongside the source separation work went through its own evolution. A formative study with twelve singers — beginners and professionals — revealed that the singers’ needs did not map neatly onto what AI can do most easily. They wanted to feel the authentic context of group singing, not practice alone against a metronome. They wanted feedback at the right level of abstraction — real-time pitch deviation data, Dong found, can be counterproductive, pulling attention away from musical intention. They wanted a sense of the big picture: how their individual part fits the arrangement as a whole. And they wanted the tool to support their creativity rather than evaluate it. Those requirements pushed the interface design away from raw AI output — precise pitch-error detection displayed as waveforms — toward higher-level, more interpretable feedback generated with the help of language models. The tension between what the AI can measure precisely and what the musician needs to hear usefully remains an open design problem, and Dong was candid that it is ongoing.