U-M Annual Data Science & AI Summit 2022

Propelling Original Data Science Showcase

PODS Showcase

The MIDAS Propelling Original Data Science (PODS) grant strongly encourages works that transform research domains through data science and AI, works that improve the reproducibility of research, and works that promise major impact and potential for significant expansion.

AI-Based Author Entity Disambiguation for Promoting Fair Evaluation of Women in Science

Jinseok Kim, Research Assistant Professor, Survey Research Center, Institute for Social Research, Research Assistant Professor, Information and Adjunct Lecturer in Information, School of Information

Using bibliographic data, studies have reported that female scholars tend to produce fewer papers and attract fewer citations than male scholars, indicating that women in science underperform in terms of scholarly productivity and impact. I argue that such findings are likely based on flawed data in which female authors are not properly identified. Specifically, female scholars may have changed their last names after marriage and have used the changed names in publications instead of their maiden names used in publications authored before marriage. As none of existing bibliographic data services consolidates author entities with different last names, entities of female authors who change names are inevitably split into different entities – one with a maiden name and the other with a marital name. This means that publications and citations of female authors who have used different names are likely undercounted, possibly leading to under-evaluation of their scholarly productivity and impact. This issue can hinder fair evaluation of women in science as female researchers are increasing in number while small fraction of women is found to retain their maiden names. To address the issue, this project will develop a machine learning method to consolidate female author entities in bibliographic data, thus promoting fair evaluation of women in science (> Responsible Research Pillar). Under the PODS grant, first, PI will create large-scale labeled data to train algorithmic models to merge the same female author entities split under different names (> Data Pillar). Then, PI will implement the models on author entities recorded in PubMed which indexes research papers in biomedicine (> Data Pillar), and demonstrate how the correct identification of name-changed female authors can lead us to different understanding of research productivity and citation-based impact of female scholars in the field where almost half of scientists are estimated to be female (> Analytics Pillar). Based on this case study and the algorithmic method, PI will apply for grants from funders such as the NSF to expand the PODS project into a large-scale, cross-field project (> Follow-on Expansion). The findings derived from this project will enable science community and policy makers to correctly characterize the research productivity and impact of female scholars and to implement effective supports and policies to promote fairness and equity for women in science (> Future Impact). A tool that implements the newly developed method will be shared under the UM license for reuse, validation, and improvement with AI researchers (> Contribution to UM data science and AI research ecosystem).

Unlocking the Vault: Machine Learning Methods for the Mobilization of Data from Millions of Plant Images

Stephen Smith, Associate Chair, Department of Ecology and Evolutionary Biology, Professor of Ecology and Evolutionary Biology and Associate Curator, Ecology and Evolutionary Biology, College of Literature, Science, and the Arts
William Weaver, College of Literature, Science, and the Arts

While genomic data has revolutionized the biological sciences, data that record physical attributes of organisms remain limited due to the challenges of morphological data gathering techniques. Herbaria contain a staggering wealth of historical biodiversity in the form of specimens and their associated information. However, lack of access constrains the type and scale of research that can leverage these specimens. Recently, most herbaria have undertaken the immense task of digitizing their collections, allowing for images of the specimens to be searchable and easily accessible. However, collecting trait, morphometric, and phenotypic data from digitized specimens remains laborious and limiting. Large-scale genomic analyses are increasingly common, but large-scale morphometric studies are rare due to the time-intensive nature of data collection. We have demonstrated with our software package ‘LeafMachine’ that recent advances in machine learning models and computer vision methods are capable of rapidly extracting useful data from digitized specimens. Nevertheless, major challenges remain. The proposed project will create a deployable open-sourced software package enabling researchers to efficiently process herbarium specimen images. This plays a crucial role in the development of the “virtual herbarium” where the goal is to have rich data accompany each specimen, extending the usefulness of collections in large-scale research projects. We will also leverage the world class herbarium maintained by the University of Michigan to test and validate our software at scale while also significantly increasing the research impact of its specimens, in line with the emerging pillar. This connection to the UM herbarium will strengthen relationships of diverse data-science resources and capabilities on the UM campus. The developments made here also will facilitate future expansion beyond herbarium images to those collected by citizen scientists stored in public databases like iNaturalist and even twitter and instagram. This proposal directly addresses both the Emerging pillar and Methodological foundations pillars

Developing Language-based Tools For Real-Time Counseling Feedback

Veronica Perez-Rosas, Assistant Research Scientist, Electrical Engineering and Computer Science, College of Engineering
Kenneth Resnicow, Irwin M Rosenstock Collegiate Professor of Public Health, Professor of Health Behavior and Health Education, School of Public Health and Professor of Pediatrics, Medical School
Rada Mihalcea, Janice M Jenkins Collegiate Professor of Computer Science and Engineering and Professor of Electrical Engineering and Computer Science, College of Engineering

In 2019, 24% of American adults with mental health issues reported unmet treatment needs. Among several other reasons, this can be largely attributed to the current shortage of mental health workers. The situation is also exacerbated by recent issues such as the COVID-19 pandemic and mental health providers burnout. While there is an increasing need for mental health treatment there are also important barriers to the rapid and e↵ective training of mental health practitioners such as the need of extensive clinical supervision and the laborious process this entails. AI technology holds the promise to address such challenges by providing low resource and cost e↵ective opportunities for training counselors to practice and receive real-time evaluative feedback. Current strategies for counselor training rely on monitoring and recording live video interactions, which are then manually evaluated to provide constructive feedback. However, this feedback is usually not immediate as it requires an expert instructor to watch and evaluate each recording. In this project, we seek to build languagebased evaluative tools to provide timely feedback to counselors in training while they learn to formulate responses to clients’ statements. We will focus on reflective listening, i.e., the ability to understand and reflect on what the patient is saying. We plan to use Natural Language Processing (NLP), to build a system able to (1) measure the quality of a reflection formulated by a counseling student in response to a patient statement by providing a reflection accuracy score; and (2) suggest rewritings when responses do not adhere to proper counseling style. Our project aligns with the MIDAS analytics pillar, as we will use AI methods to build language tools to enhance current learning strategies used in the training of future counselor professionals, which in turn will have a positive impact on the current surge of mental health services.

Building a Genomic Literature Knowledgebase

Jie Liu, Assistant Professor of Computational Medicine and Bioinformatics, Medical School and Assistant Professor of Electrical Engineering and Computer Science, College of Engineering

Our knowledge regarding the human genome has been exponentially increasing, driven by the ever-evolving biotechnologies that characterize the human genome from different perspectives. A major source of our knowledge about the human genome comes from direct measurements and annotations of different genomic elements, exemplified by a number of ground-breaking consortia including the ENCODE project, the Roadmap Epigenomics project, the GTEx project, the 4D Nucleome project, and the HuBMAP project. While each of these consortia has a dedicated Data Coordinate Center and a data portal, these consortium datasets are usually tabular-structured, heterogeneous, and sparse, and as a result, the knowledge accumulated from individual consortia is isolated. Another source of our knowledge regarding the human genome comes from individual research labs, which is usually hypothesis-driven and captured in the biological literature. However, the ever-growing biological literature is being stored as unstructured text, and we do not have an infrastructure to extract knowledge buried in the literature. Consolidating two knowledge sources is even more challenging. To tackle these challenges, we aim to develop an open knowledge network for navigating and embedding our ever-growing knowledge regarding the human genome. We will adopt domain knowledge and use cutting-edge machine learning approaches to improve entity and relation extraction from genomic literature, and consolidate with our existing GenomicKB knowledge graph. We will also improve genomic literature search and navigation in the light of our knowledge network.

Combating and Predicting Drug Resistance using a Hybrid Mechanistic Machine Learning Model

Margaret Reuter, Research Fellow, Biomedical Engineering, College of Engineering and Medical School
Rudy Richardson, Dow Professor Emeritus of Toxicology, Professor Emeritus of Environmental Health Sciences, School of Public Health and Associate Professor Emeritus of Neurology, Medical School
Sriram Chandrasekaran, Assistant Professor of Biomedical Engineering, Medical School

Pathogens are becoming progressively drug resistant, yet drug discovery methods have failed to produce new classes of antimicrobials for decades. As increasingly pathogenic strains of diseases emerge, there is an urgent need to identify effective therapies from existing U.S. Food and Drug Administration approved drugs. Multi-drug regimens are already being used to fight antibiotic resistance, but they are often chosen empirically, leading to suboptimal treatment outcomes, and spread of resistance. Using a unique combination of structural molecular docking, chemogenomic studies, and machine learning algorithms, we will create a tool for developing effective drug combination therapies and investigate the biochemical principles that govern drug interactions and mechanisms of action. Due to the flexible, multiscale, and hybrid nature of our model, we will be able to examine many combinations, infeasible to interrogate via physical experiments due to cost and time. Secondly, the model will enable us to explore more deeply the underlying biological and chemical factors that influence synergy for better design of drug combination therapy.

Improving Cardiovascular Disease Detection with a Novel Multi-label Classifier for Electrocardiograms: Capturing Label Uncertainty and Complex Hierarchical Relationships between Output Classes

Negar Farzaneh, Research Investigator, Emergency Medicine, Medical School
Hamid Ghanbari, Assistant Professor of Internal Medicine, Medical School
Kevin Ward, Medical School
Sardar Ansari, Research Assistant Professor, Emergency Medicine, Medical School

The objective of this project is to develop a multi-label classifier that captures the dependency between different output labels as well as the uncertainty about the ground truth labels in the context of electrocardiogram (ECG) classification. ECG is the primary test for cardiovascular diagnosis, and while automated ECG analysis models are used clinically, they have several flaws, often resulting in inaccurate output. First, the models do not account for hierarchical dependencies among cardiac diseases. For example, both “ectopic atrial tachycardia” and “multifocal atrial tachycardia” share the same “atrial tachycardia” parent disease, but this relationship is not accounted for when using a conventional multi-label classifier, which assumes all classes are equally distinct and independent. Second, the ground truth labels from diagnostic statements often reflect clinician doubts, but current deep learning models ignore these doubts and treat uncertain labels as being a definitive “presence” or “absence” of a disorder. Consequently, we propose to overcome these obstacles by developing novel datadriven diagnosis models, leveraging a unique cohort of >2.15 million ECGs collected at Michigan Medicine. Specifically, we will develop a novel deep learning classifier that takes the hierarchical relationships into account. We will also use a soft (vs. hard) labeling approach to leverage information regarding uncertainty in the model. This research is aligned with the MIDAS “Analytics” and “Emerging” pillars by developing a decision analytic that can precisely determine multiple cardiac diseases, which will improve cardiovascular disease diagnosis and prevent clinical mismanagement resulting from ECG model inaccuracies. This PODS award will lay the necessary groundwork for us to submit a future NIH grant proposal to develop a comprehensive, fully automated cardiac decision support system. Moreover, we will disseminate our findings to other researchers at UM and beyond via presentations and publications.

A Machine-Learning Approach to Reduce Uncertainty in Climate Forcing by Aerosols

Joyce Penner, Ralph J Cicerone Distinguished University Professor of Atmospheric Science and Professor of Climate and Space Sciences and Engineering, College of Engineering
Xianglei Huang, Professor of Climate and Space Sciences and Engineering, College of Engineering
Yang Chen, Assistant Professor of Statistics, College of Literature, Science, and the Arts

The largest uncertainties in climate forcing are associated with the forcing by atmospheric aerosols. Uncertainties in aerosol climate forcing are estimated from the spread in forcing estimates in global models and/or observations. The uncertainties in these estimates have remained large primarily because there are no direct estimates of forcing based solely on observations and the many model processes required to estimate forcing are treated differently in different models. As a result, a given model may fit some types of data, while other models fit other data. So far, it has been impossible to understand the causes of the differences in models and thereby to decrease the spread in forcing estimates. This project seeks to develop a datadriven method that will help ascertain why models differ, and ultimately, what may be needed to correct models and deliver estimates of the climate forcing by aerosols that agree more widely. In this work we propose to build a feedforward neural-network emulator that uses inputs from the aerosol/climate model from Penner’s group that are adjusted to better fit the available observations. This will allow us to determine which aspects of the aerosol/climate model need to be improved. We will also work with collaborators at ETHZ to build a similar emulator and compare which aspects of their model need to be adjusted to better fit the observations. The hope is that this will allow the improvement of processes in the two models, and, consequently, allow the two model predictions of climate forcing to converge. A follow-on project will enlist other aerosol modeling groups. This will ultimately allow improvements to all models and thus lead to more cohesive estimates of climate forcing, thereby reducing its uncertainty. The proposed study fits the Responsible Research Pillar and the Emerging Pillar for the 2022 PODS grant solicitation.

Developing a Large-Scale Dataset to Track Romantic Relationship Formation and Maintenance

Amie Gordon, Assistant Professor of Psychology, College of Literature, Science, and the Arts and Faculty Associate, Research Center for Group Dynamics, Institute for Social Research
Elizabeth Eve Bruch, Associate Professor of Sociology, Associate Professor of Complex Systems, College of Literature, Science, and the Arts and Research Associate Professor, Population Studies Center, Institute for Social Research

Supportive relationships are one of the most robust predictors of well-being and longevity and, thus, a key area for research and intervention. However, we know little about the processes through which people enter into committed relationships and how partner choices are associated with the challenges people encounter in maintaining their relationships over time. To address this gap in knowledge, we need longitudinal data from large groups of people as they enter and then sustain or dissolve relationships. Dating apps have detailed information on how people enter into relationships; however, researchers do not typically have access to the data collected by these apps. In addition, existing dating apps only track relationship formation, not relationship maintenance. Therefore, we propose creating a research-based dating app that will track individuals’ dating decisions and relationship behaviors over time. By launching this data collection tool in the University of Michigan (U-M) population, we will create a valuable resource that can help answer questions regarding preferences and choice, partner selection, relationship formation, and relationship maintenance. This project aligns most closely with the Data Pillar, as it will result in a novel and important data source for understanding human behavior. We also anticipate working with computer science collaborators to develop a privacy-preserving synthetic dataset as a good model of how to do open-source, transparent, privacy preserving work using detailed observational data from Apps, which aligns with both the Data Pillar and the Responsible Research Pillar. In addition, this project aligns with the Analytics Pillar because it provides opportunities to use cutting-edge methods in choice modeling and natural language processing to better predict human behavior and important health-related outcomes (e.g., relationship status, relationship quality). This project has potential for broad impact and the improvement of society, as insights about relationships and data privacy can benefit all populations.

Sustainability Outcomes of Restrictions on Human Actions: COVID-19 Mobility Changes, Forest Fires and Air Pollution across Land Regimes

Arun Agrawal, Samuel Trask Dana Professor, Professor of Environment and Sustainability, School for Environment and Sustainability, Faculty Associate, Center for Political Studies, Institute for Social Research, Professor of Political Science, College of Literature, Science, and the Arts and Professor of Public Policy, Gerald R Ford School of Public Policy
Ines Ibanez, Professor of Environment and Sustainability, School for Environment and Sustainability, Professor of Ecology and Evolutionary Biology, College of Literature, Science, and the Arts, Professor of Environment, Program in the Environment, School for Environment and Sustainability and College of Literature, Science, and the Arts and Adjunct Professor of Biological Station, College of Literature, Science, and the Arts
Yang Chen, Assistant Professor of Statistics, College of Literature, Science, and the Arts

Changing human behavior has substantial potential to slow and even alter degradation trends. However, little is known about the potential and the effectiveness of changing human behavior over sustainability outcomes. Mandatory lockdowns and calls for voluntary restraint were key interventions to reduce disease exposure risks in the early stage of COVID-19. The resulting global reductions in human mobility constituted an unprecedented and massive natural experiment in how changes in human behaviors affected sustainability. We have analyzed the effects of mobility restrictions on forest fires across the Amazon basin using large scale remote sensing datasets on mobility and fires. We found that mobility restrictions initially led to a decline in fires, but fires rebounded in Amazon forests within 30 days to levels exceeding preCOVID-19 lockdown levels. The resulting research is now under review at a prominent journal. We seek support from MIDAS to expand our current work on a global scale. Spatially, we will examine how mobility restrictions affect the incidence of forest fires in other regions and forest types, and if patterns in the Amazon are generalizable globally. Thematically, we will analyze effects of mobility restrictions on air pollution in the immediate to longer term and across rural to urban gradients. Bayesian mixed effects models with spatiotemporal autocorrelations in conjunction with large-scale datasets will help generate a deeper understanding of the durability of sustainability outcomes associated with human behavioral changes. The proposed study aligns with all five research pillars of MIDAS, but particularly so with the Data, Analytics, and Emerging pillars. We will be able to differentiate the effects of COVID-19 lockdowns on forest fires and air pollution across sectors; develop a full proposal for NSF’s DISES program on persistent effects of human behavioral changes; contribute to UM’s research ecosystem by generating usable, publicly available COVID-Fires and COVID-Air Pollution datasets.

Machine Learning Guided Co-design for Reconstructive Spectroscopy

Qing Qu, Assistant Professor of Electrical Engineering and Computer Science, College of Engineering
Pei-Cheng Ku, Associate Chair, Department of Electrical and Computer Engineering and Professor of Electrical Engineering and Computer Science, College of Engineering

Spectroscopy is one of the most important and widely utilized techniques in science and technology, with broad applications in chemistry, life science, microbiology, food industry, biomedical sensing (lab-on-a-chip), environmental monitoring, pharmaceutical research, cosmetic industry, and quality control. For example, fluorescence spectroscopy is a crucial resource for viral detection and vaccine research, two needs of great societal importance during the pandemic. This research aims to develop machine learning methods for co-designing an on-chip spectrometer that can enable a highly miniaturized and portable sensing platform for UV-VIS, fluorescence, and chemi-/electro-luminescence spectroscopy application. The major challenges lie in the spectrum reconstruction with a limited number of encoders/photodetectors, which result in challenging machine learning problems and the existing method performs poorly. The collaboration between Qu (with expertise in machine learning) and Ku (with expertise in semi-conductor devices) will resolve the challenge by developing new machine learning methods, which learn more precise models of the data acquisition process and provide more efficient reconstruction algorithms, leading to faster and more accurate spectrum recovery. In return, the developed learning methods will provide guidance for the better design of the sensing platform.