UMAP and the Problem You Couldn’t See Until You Could

Peter Bahr, Vice President and Managing Research Director, Strada Institute for the Future of Work

Watch Recording

Peter Bahr’s AI journey is unusual in this collection because the central tool he adopted is not a large language model, not a deep neural network, and not generative in any sense. It is a dimensionality reduction algorithm. But the lesson of his story is about something that happens long before any algorithm is chosen: how do you know whether the method you’ve been using for decades is actually working?

The research question Bahr and his colleagues have spent years trying to answer is deceptively simple: how can community colleges better support a student population that is extraordinarily diverse — first-generation students, working adults, students with family care responsibilities, students with skill gaps, students enrolled episodically over many years? The standard approach, across decades of research, has been to cluster students into typologies — groups with similar backgrounds and behaviors that institutions can target with specific support. The standard method for doing that clustering has been K-means.

K-means, Bahr’s colleague Dr. Iran Shen demonstrated, is not up to the task. The problem is geometric. Community college student data lives in ten, twelve, or fourteen dimensions — behavioral variables, enrollment patterns, background characteristics — and humans cannot visualize more than three. Without the ability to see the data, there is no way to tell whether the groups K-means found are real or artifacts. “K-means got away with it,” Bahr said, “because we can’t see the data.” The algorithm produces interpretable-looking groupings from noise, and decades of research built on those groupings may be largely spurious.

UMAP — Uniform Manifold Approximation and Projection — addresses the problem by projecting high-dimensional data into two or three dimensions while preserving both local and global structure. Student neighborhoods that are similar in the original space stay similar in the projection; distances between neighborhoods remain meaningful. The team uses UMAP in three distinct roles across the analysis workflow. Before clustering, it reveals whether natural groupings actually exist in the data before any algorithm is applied. During clustering, it serves as a performance optimizer — varying configuration parameters and identifying where results are most stable. After clustering, it acts as a visual validator, checking whether the boundaries the algorithm found align with natural gaps in the data or are impositions on a continuous landscape.

What the UMAP-supported workflow revealed was striking: approximately 70% of community college students, at least along behavioral dimensions, cannot be cleanly separated from one another. They exist on a continuum. The prior typological literature — built on K-means — had partitioned them into four, five, or eight discrete groups that appeared meaningful but were, in large part, artifacts of the method. The implication for practice is significant: student success interventions designed for discrete segments may need to be fundamentally rethought for a population that is more continuous than categorical. Bahr closed with a note of cautious optimism: “We’re beginning to make some meaningful advancements — hopefully not spurious. Hopefully we don’t do two more decades of this and then discover we were wrong again.”