Citation Fidelity: Measuring the Telephone Game Across Millions of Papers

June 17, 2026

Hong Chen, PhD Student, School of Information

Hong Chen opened with a scenario familiar to almost anyone who works in research: you follow a citation to its source and find that the original paper does not quite say what it was claimed to say. The hedging is gone. The scope has widened. The specific population from which a finding was drawn has disappeared into a universal claim. A carefully qualified observation has become a fact. The sociology of science has known about this phenomenon — citation distortion — for decades. But the evidence had always been limited to a few hundred citation pairs, checked by experts, in a single field. Enough to confirm that the problem existed; not enough to see its patterns. Chen’s question was whether AI could change that.

The pipeline he and his collaborators built was grounded in a foundational design insight: no single model could handle everything the measurement required. The problem had to be decomposed before it could be solved. The first step identifies which sentences in a citing paper are “reporting citations” — sentences that actually state a finding from another work, as opposed to background references, methodological comparisons, or acknowledgments. The second step extracts the corresponding claims from the cited paper — the specific results or conclusions being cited. The third step measures how faithfully the citing paper’s version conveys the original. Each step required its own annotated training data, its own fine-tuned model, and its own validation against human judgment before it was trusted. Built that way, the pipeline could process 30 million citation pairs — a scale at which questions that had been unanswerable for decades became tractable.

The findings confirmed and extended what small-scale studies had suggested. Open-access papers are cited with higher fidelity than paywalled ones — consistent with the hypothesis that researchers who haven’t read the original paper more closely are more likely to distort it. Within-field citations are more faithful than cross-field ones. Self-citations are most faithful of all. Fidelity declines as the time gap between the cited and citing paper grows. And there is an intriguing asymmetry in the role of seniority: more senior first authors produce less faithful citations, while last-author seniority shows no consistent effect. The most striking result came from the “telephone game” mechanism. Using a quasi-causal design — matching citations that engaged with the same original paper but differed only in whether they also cited an intermediary source — the team found that routing through an unfaithful intermediary propagates distortion downstream. Faithful intermediaries do not add damage; unfaithful ones compound it. The classic opioid epidemic case, where a carefully scoped clinical letter about hospital patients became, through 600+ citations, a claim that pain patients faced no addiction risk at all, now has empirical company across millions of pairs.

Chen was candid about the timing of the work: it was completed largely before large language models became broadly available, using fine-tuned encoder models that are less capable in some respects but far more efficient and reproducible. Processing 30 million citation pairs with LLM inference would cost orders of magnitude more in both time and computation. The encoder pipeline, running on standard cluster hardware, remains the practical instrument. Looking ahead, he sees LLMs not as replacements for that pipeline but as a layer built on top of it: able to reason through longer document contexts, to explain why a specific citation is unfaithful rather than merely flagging it, and to generate the synthetic training data needed for next-generation classifiers that can distinguish hedging removal from scope inflation from outright factual error. Scale, she concluded, is what unlocks new questions — and a well-built measurement tool is what makes scale meaningful.