Jingyi Qiu, PhD Student, School of Information
Watch RecordingJingyi Qiu’s project began with a feeling familiar to anyone who reads machine learning papers: the abstract promised a breakthrough; the paper, on closer reading, delivered something considerably more modest. She wanted to know whether that gap — between what papers claim and what they show — could be measured systematically, at scale, across the whole of scientific literature. And she wanted to know whether it was getting worse.
The hardest part of the problem was not finding the hype. It was measuring it without introducing the very distortions she was trying to detect. Her first attempt was direct: ask an LLM to rate how much a paper’s abstract overstated its findings. It failed. The model lacked the domain expertise to judge whether specific claims were substantiated, and its ratings collapsed into a narrow band — mostly two or three on a ten-point scale, regardless of the paper. “Naively prompting an LLM is not enough,” she concluded.
Her second attempt was more elegant in theory: use LLMs as generators to produce a continuous spectrum of rhetorical styles — from ultra-conservative to highly promotional — for the same paper content, then locate the actual abstract somewhere on that spectrum. She called this chain-based sampling. It failed too. LLMs produce stable extremes — a very cautious abstract, a very promotional one — but when asked to produce intermediate levels between two reference examples, they anchor too strongly on those references. The middle of the spectrum collapsed into outputs indistinguishable from the endpoints.
Her third attempt worked, and it worked because it abandoned the attempt to instruct the LLM directly in favor of exploiting something LLMs do naturally: role-playing. Rather than asking the model to write at “promotional level five,” she asked it to write as a specific persona — a sleep-deprived PhD student, a famous AI researcher, a science communicator. The model had no concept of a numerical rhetorical scale, but it had a very clear concept of how those people write. Thirty personas, designed to span the full range from cautious to visionary, were calibrated using a Bradley-Terry pairwise ranking model — comparing each persona’s output against real paper abstracts to map where in the rhetorical distribution it fell — and then validated against human expert assessments. Only after both checks was the framework applied at scale.
The results were striking. Rhetorical scores have risen sharply in the scientific literature since 2023, and the rise correlates with evidence of LLM-assisted writing. High rhetorical scores predict media attention and citation counts — but not peer-review ratings, consistent with the hypothesis that hype shapes how broad audiences engage with science without influencing expert evaluation of the underlying work. The paradox Qiu left the room with was pointed: “LLMs are both a tool and part of the problem in this project. They help us measure hype as instruments. But they are also amplifying the AI hype itself.”