VoucherVision: Scaling the Unscalable

William Weaver, Schmidt AI in Science Fellow, Michigan Institute for Data & AI in Society

Watch Recording

William Weaver opened with a number that reframes the problem immediately: 400 million dried plant specimens stored in herbaria around the world. Only 48 million have been digitized. At current rates of manual transcription — 15 to 20 specimens per hour under favorable conditions, one every 15 minutes for difficult historical handwriting — working through the existing backlog could take 30 years.

The bottleneck is not imaging. Photographing a herbarium sheet is fast. The bottleneck is transcription: getting the handwritten text from a label mounted on a pressed plant collected in 1886 into a structured, queryable database field. Every herbarium specimen is, in effect, two things: the plant itself, and the archival metadata that gives the plant scientific meaning — who collected it, where, when, in what conditions. Without the metadata, the plant is just a dried curiosity. With it, the specimen becomes a data point for tracking climate change, monitoring invasive species, understanding phenological shifts, or recovering genomic information from the deep past.

VoucherVision is Weaver’s answer to the transcription problem. It is a pipeline that takes an image of a herbarium sheet and uses vision language models to extract the label text and structure it into a database record — automatically, at cloud scale, accessible to any institution regardless of size or resources. The system is hosted on Google Cloud and exposes both a simple web interface for collections with one person managing everything, and a Python package for more sophisticated users. A library of prompt templates allows each institution to specify which fields they need and in what format, accommodating the variation in database schemas across more than 4,000 herbaria worldwide.

The model choices have evolved rapidly through three years of active benchmarking on actual specimens. What began as an exploration of open-source and locally hosted models has converged on Gemini Vision, which has proven consistently the most accurate for the specific task of reading historical handwritten labels and inferring GPS coordinates from textual place descriptions written before satellites existed. Weaver showed the audience a label from 1886 — cramped, cursive, partly faded — and VoucherVision’s structured output alongside it: collector name, date, taxonomy, location, and a geocoordinate derived entirely from the description on the label. “This did a better job than I would have been able to do.”

The success of the transcription step revealed a new constraint: the editing and quality-control workflow cannot keep pace with the rate at which AI can produce transcriptions. A zero-trust review mode — where every AI-suggested change must be manually accepted — is only 10 to 15% faster than manual transcription from scratch. Weaver’s team has built additional editing interfaces — batch tools, spreadsheet views, automated quality checks — that together bring the productivity improvement to roughly 50%. But the deeper infrastructure problem remains: collection database management systems were not built to track which cells were AI-generated and which were human-verified, and the field is still developing standards for how to flag, trust, and eventually publish AI-assisted data. “The biggest backlog,” Weaver noted, “is actually just our databases can’t handle this sort of data yet.” In the meantime, even unreviewed transcriptions are valuable — institutions can now see what they have in their collections, enabling grant applications and research partnerships that would otherwise have been impossible without knowing the specimens existed.