U-M Annual Data Science & AI Summit 2023

Research Talks

Monday, November 13

Research Talks, Session 1

Session Chair: Kai Zhu, Associate Professor of Environment and Sustainability, School for Environment and Sustainability and Associate Professor of Ecology and Evolutionary Biology, College of Literature, Science, and the Arts

Data for Urban Sustainability – The Good, The Bad, and The Ugly

Benjamin Goldstein, Joshua Newell, and Dimitrios Gounaridis, School for Environment and Sustainability

Measuring and mapping the environmental impacts of cities is critical to helping cities decarbonize and achieve other societal sustainability goals. However, data to support these efforts have historically been scarce. This undermines the ability of cities to become more sustainable, equitable, and just. Here, we highlight three data sources at the frontiers of urban sustainability assessment. First, the Good; Panjiva a structured repository of international trade data that is useful for reconstructing the supply chains that are both indispensable to modern cities and cause environmental degradation in far-flung locations. Second, the Bad; conflicting national dietary surveys (NHANES, FoodAPS) that hamper efforts to accurately measure urban food consumption and its related environmental impacts. Third, the Ugly; CoreLogic, the back-end of Zillow, an immense database of uneven building and property attributes for every land parcel in the United States, useful for analyzing everything from green gentrification to flood risk to urban energy consumption. Through three case studies, we demonstrate the potentials and pitfalls of using each data source. We then argue that although we live in an age of plentiful data, critical data gaps must be addressed to foster a sustainable urban future.

Proactive policing as reinforcement learning

Tian An Wong, Mathematics and Statistics, U-M Dearborn

Recent analyses of predictive policing have shown the inherent biases in such systems. We show that the models considered in fact apply to proactive policing in general, which can be also viewed as a reinforcement learning system, and thus may also lead to over-policing. Time permitting, I will also discuss ongoing work assessing the efficacy and effects of the Detroit Police Department’s expansion of the use of ShotSpotter, a gunshot acoustic detection device, as an example of surveillance technology in current use. 

Personalized Personalized Medicine Discovery through Machine Learning

Donglin Zeng, Biostatistics

Advances in technology are revolutionizing medical research by collecting large-scale data from
each individual patient (clinical biomarkers, genomics, electronic health records), making it possible to meet the promise of individualized treatment and health care. The availability of these rich data sources provides new opportunities to deeply tailor treatment for each patient, while at the same time, posing tremendous challenges for analyzing highly complex and noisy data in personalized medicine discovery. In this talk, I will present an overview of machine learning methods we have recently developed in this direction: learning methods for the discovery of optimal dynamic treatment regimens, personalized dose finding, benefit-risk analysis, and medical diagnostics. For each method, we establish its theoretical statistical properties including consistency and learning rates. The comparative advantages over existing methods are demonstrated in simulation studies and applications to real world studies.

Research Talks, Session 2

Session Chair: Elle O’Brien, Lecturer III in Information and Research Investigator, School of Information

Using Galaxy Shapes to Develop a Debiasing Framework for Deep Learning

Christopher J Miller, Astronomy and Department of Physics

Esteban Medina and Guillermo Cabrera, Computer Science, University of Concepcion, Chile

In order to train deep learning models, usually a large amount of correctly annotated data is needed. Depending on the data domain, the task of correctly annotating data can prove to be difficult, as in many cases the ground truth of the data is not obtainable. This is true for numerous problems within the astronomy domain, one of these being the morphological classification of galaxies. The aforementioned means that astronomers are forced to rely on an estimate of the ground truth, often generated by human annotators. The problem with this is that human generated labels have been shown to contain biases related to the quality of the data being labeled, such as image resolution. This type of bias is a consequence of the quality of the data. Even datasets annotated by experts can be affected by this type of bias. In this work, we show that deep learning models trained on biased data learn the bias contained in the data, transferring the bias to its predictions. We also propose a framework to train deep learning models which allow us to obtain unbiased models even when training on biased data. We test our framework by training a classification model on images of morphologically classified galaxies by humans and show that the AI system can mitigate the bias. We also examine the AI learning process to gain insight on how the AI does this mitigation.

Differentiable Physics

Venkat Viswanathan, Aerospace Engineering

Differentiable physics provides a new approach for modeling and understanding the physical systems by pairing the new technology of differentiable programming with classical numerical methods for physical simulation. I’ll discuss two avenues of learning residual physics: (i) exchange correlation problem for density functional theory and (ii) closure models for turbulent fluid flow, via the differentiable physics approach.

Multimodal Learning from the Bottom Up

Andrew Owens, Electrical and Computer Engineering

Today’s machine perception systems rely heavily on supervision provided by humans, such as labels and natural language. I will talk about our efforts to make systems that, instead, learn from two ubiquitous sources of unlabeled data: visual motion and cross-modal associations. I will first discuss our work on creating unified motion analysis methods that can address both object tracking and optical flow tasks. I’ll then discuss how, perhaps surprisingly, these same techniques can be applied to localizing sound sources from stereo audio, and how sound localization can be jointly learned with visual rotation estimation. Finally, I’ll talk about our work on learning from tactile sensing data that has been collected “in the wild” by humans, and our work on capturing camera properties by learning the cross-modal correspondence between images and camera metadata.

Tuesday, November 14

Research Talks, Session 3

Session Chair: Alex Gorodetsky, Assistant Professor of Aerospace Engineering, College of Engineering

Barriers to data science adoption in scientific research

Elle O’Brien, School of Information

Jordan Mick, School of Public Health

Data science has been heralded as a transformative family of methods for scientific discovery, yet many researchers face substantial obstacles incorporating these techniques into their existing research. Here we report findings from a qualitative interview study of researchers at the University of Michigan, all scientists who currently work outside of data science (in fields such as astronomy, education, chemistry, and political science) and wish to adopt data science methods as part of their research program. These scientists quickly identified that they lacked the expertise to confidently implement and interpret new methods. For most, independent study was unsuccessful, owing to limited time, missing foundational skills, and difficulty navigating the marketplace of educational data science resources. Overwhelmingly, participants reported isolation in their endeavors and a desire for a greater community. This talk will focus on targets for academic data science communities, leaders, and professional societies to build supportive communities of practice and expertise in applied data science.

How is Surgical Training like a Computer Adaptive Test?

Andrew Krumm, Learning Health Sciences

This talk provides an overview for how my colleagues and I are combining machine learning and traditional psychometrics to overcome critical challenges in assessing the operative performance of surgical trainees. The ways in which we collect, analyze, and report on data has broad application for professions like teaching, social work, and health professions more generally. Using the Society for Improving Medical Professional Learning’s (SIMPL) operative performance app, SIMPL OR, I demonstrate how methods from computer adaptive testing can be applied to performance assessment tasks and produce overall scores that can accurately reflect trainees’ developing ability, reduce measurement burden, and generate tailored recommendations.

Michele Peruzzi, Biostatistics

Title and abstract to be added soon

Discovering Markers and Mechanisms of Mental Disorders with Natural Language Processing and Daily Thought Sampling

Chandra Sripada, Psychiatry and Philosophy

Major mental disorders—such as depression, anxiety, and ADHD—involve alterations in the internal spontaneous stream of thought (SST), the ongoing flow of ideas, impressions, and memories that unfold before the mind. SST has long been regarded as beyond the reach of systematic scientific study, but our group has pioneered the use of verbalized thought protocols to collect a sizable database of SST from subjects, including app-based daily SST sampling. In this talk, we present an overview of our research program leveraging natural language processing (NLP) and large language models (LLMs) to: 1) Discover subtle predictive markers in SST linked to clinically-relevant traits and conditions; 2) Apply model-based methods (e.g., Markov transition graphs) to uncover the underlying mechanisms that produce altered SST patterns in these clinical conditions.