MIRAGE: Building the Infrastructure Before the Science

David Sears, Associate Professor, Music Theory, School of Music, Theatre & Dance

Watch Recording

David Sears’s AI journey began on a Sunday afternoon spinning a virtual globe. Radio Garden — a web interface that lets listeners tune into live radio stations anywhere on Earth — confronted him with more musical diversity in thirty minutes than he had seen in years of academic music theory literature: hybrid traditions blending indigenous and Western organizing principles, genres he had never encountered, sounds arriving from places his research community had never thought to study. And it came with an API.

What followed was not a single research project but the construction of research infrastructure — a global database of radio station metadata and musical features designed to make world-music research at scale possible for the first time. Sears identified three goals: build the database, build an accompanying online dashboard for researchers without programming backgrounds, and use the resulting infrastructure to conduct machine listening projects that placed musical diversity at the center rather than treating Western organizing principles as a default.

The data collection was deliberate and tiered. His team selected 10,000 radio stations from around the world and monitored them for three months, collecting metadata for 100 streaming events per station — one million events in total. A team of human annotators reviewed every station across 15 variables, correcting errors in location data, station descriptions, and stream content. The pipeline then attempted to match each event against open-access databases — Wikidata, MusicBrainz — and commercial sources like Spotify to enrich the metadata. The result was 131 metadata variables per event, spanning everything from GPS coordinates and broadcast time zones to artist demographics, instruments, and vocal types.

What to trust was a genuine research design question. Rather than cleaning the data to a single authoritative version, the team attached reliability scores to every field and published the data in reliability quartiles. Crucially, Sears described the decision to leave low-reliability records — particularly for artists from countries underrepresented in Western databases — as intentional: “These are our blind spots.” Researchers could come to that data specifically to study the music that existing databases know nothing about.

The infrastructure was built with multiple access points from the start. The Mirage Project dashboard offers a globe interface, streaming playback, filtering, and export tools for researchers with no coding background. A Python API client serves those who want to query the data programmatically. And GeoListen — a mobile game in the App Store and on Google Play — turns the listener’s task of guessing a song’s geographic origin into a large-scale behavioral experiment, collecting data from players worldwide about how human beings form geographic associations with unfamiliar music. Every track in the game comes from the Mirage database; every guess becomes research data. The game is science wearing the skin of entertainment, and it works.

Sears was candid about the limitations. Radio Garden overrepresents Europe and underrepresents China, which blocks stations from streaming on the platform. The newest re-monitoring project — recording 300,000 short audio excerpts from the live streams — corrects the geographic sampling by weighting stations proportionally to population density. The science that Mirage was built to support is now beginning in earnest.