Schedule

November 10

9:00 - Opening Remarks

H.V. Jagadish

Director, MIDAS | Professor of Electrical Engineering and Computer Science

9:05 - Keynote: Data Feminism

Data Feminism: As data are increasingly mobilized in the service of governments and corporations, their unequal conditions of production, their asymmetrical methods of application, and their unequal effects on both individuals and groups have become increasingly difficult for data scientists–and others who rely on data in their work–to ignore. But it is precisely this power that makes it worth asking: “Data science by whom? Data science for whom? Data science with whose interests in mind? These are some of the questions that emerge from what we call data feminism, a way of thinking about data science and its communication that is informed by the past several decades of intersectional feminist activism and critical thought. Illustrating data feminism in action, this talk will show how challenges to the male/female binary can help to challenge other hierarchical (and empirically wrong) classification systems; it will explain how an understanding of emotion can expand our ideas about effective data visualization; how the concept of invisible labor can expose the significant human efforts required by our automated systems; and why the data never, ever “speak for themselves.” The goal of this talk, as with the project of data feminism, is to model how scholarship can be transformed into action: how feminist thinking can be operationalized in order to imagine more ethical and equitable data practices.

Catherine D’Ignazio – Assistant Professor, Urban Science & Planning, MIT; Director, Data+Feminism Lab, MIT

Catherine D’Ignazio is a scholar, artist/designer and hacker mama who focuses on feminist technology, data literacy and civic engagement. She has run reproductive justice hackathons, designed global news recommendation systems, created talking and tweeting water quality sculptures, and led walking data visualizations to envision the future of sea level rise. With Rahul Bhargava, she built the platform Databasic.io, a suite of tools and activities to introduce newcomers to data science. Her 2020 book from MIT Press, Data Feminism, co-authored with Lauren Klein, charts a course for more ethical and empowering data science practices. Her research at the intersection of technology, design & social justice has been published in the Journal of Peer Production, the Journal of Community Informatics, and the proceedings of Human Factors in Computing Systems (ACM SIGCHI). Her art and design projects have won awards from the Tanne Foundation, Turbulence.org and the Knight Foundation and exhibited at the Venice Biennial and the ICA Boston. D’Ignazio is an Assistant Professor of Urban Science and Planning in the Department of Urban Studies and Planning at MIT. She is also Director of the Data + Feminism Lab which uses data and computational methods to work towards gender and racial equity, particularly in relation to space and place.

 

Lauren Klein – Associate Professor, English, Quantitative Theory & Methods, Emory University

Lauren Klein is an associate professor in the departments of English and Quantitative Theory & Methods at Emory University, where she also directs the Digital Humanities Lab. Before moving to Emory, she taught in the School of Literature, Media, and Communication at Georgia Tech. Klein works at the intersection of digital humanities, data science, and early American literature, with a research focus on issues of gender and race. She has designed platforms for exploring the contents of historical newspapers  recreated forgotten visualization schemes with fabric and addressable LEDs  and, with her students, cooked meals from early American recipes and then visualized the results. In 2017, she was named one of the “rising stars in digital humanities” by Inside Higher Ed. She is the author of An Archive of Taste: Race and Eating in the Early United States (University of Minnesota Press, 2020) and, with Catherine D’Ignazio, Data Feminism (MIT Press, 2020). With Matthew K. Gold, she edits Debates in the Digital Humanities, a hybrid print-digital publication stream that explores debates in the field as they emerge. Her current project, Data by Design: An Interactive History of Data Visualization, 1786-1900, was recently funded by an NEH-Mellon Fellowship for Digital Publication.

10:10 - Research Talks Session 1

Bhramar Mukherjee – Professor and Chair, Biostatistics
The Testing Paradox for COVID-19

Reported case-counts for coronavirus are wrinkled with data errors, namely misclassification of the tests and selection bias associated with who got tested. The number of covert or unascertained infections is large across the world. How can one determine optimal testing strategies with such imperfect data? In this talk, we propose an optimization algorithm for allocating diagnostic/surveillance tests when your objective is estimating the true population prevalence or detecting an outbreak. Infectious disease models and survey sampling techniques are used jointly to come up with these strategies.


Quan Nguyen – Research Fellow, School of Information
Students’ mobility patterns on campus and the implications for the recovery of campus activities post-pandemic

This research project uses location data gathered from WiFi access points on campus to model the mobility patterns of students in order to inform the planning of educational activities that can minimize the transmission risk.

The first aim is to understand the general mobility patterns of students on campus to identify physical spaces associating with a high-risk of transmission. For example, we can extract insights from WiFi data about which locations are the busiest during which time of the day, how much time was typically spent at each location, and how do these mobility patterns change over time. The second aim is to understand how students share the same physical spaces on campus (e.g. attending a lecture, meeting in the same room, sharing the same dorm). Students are presumably in a close proximity when they are connected to the same WiFi access point. We model a student-to-student network from their co-location activities and use its network centrality measures as proxies of transmission risk (i.e. students in the center of a network would have a higher chance of getting exposed to COVID-19 than those in the periphery). We then correlate network centrality measures with academic information (e.g. class schedule, course enrollment, study major, year of study, gender, ethnicity) to determine whether certain features of the academic record are related to transmission risk. For example, we can identify which groups of students are more vulnerable to potential infections by associating with a high network centrality. Insights from this research project will inform the University of Michigan’s strategies for the recovery of educational activities post-pandemic with empirical evidence of students’ mobility pattern on campus as well as factors that associate with a high-risk of transmission.


Qiushi Yu – Ph.D. student, Political Science
Modeling the Perceived Truthfulness of Public Statements on COVID-19: A New Model for Pairwise Comparisons of Objects with Multidimensional Latent Attributes

What is more important for how individuals perceive the truthfulness of statements about COVID-19: a) the objective truthfulness of the statements, or b) the partisanship of the individual and the partisanship of the people making the statements? To answer this question, we develop a novel model for pairwise comparisons data that allows for a richer structure of both the latent attributes of the objects being compared and rater-specific perceptual differences than standard models. We use the model to analyze survey data that we collected in the summer of 2020. This survey asked respondents to compare the truthfulness of pairs of statements about COVID-19. These statements were taken from the fact-checked statements on https://www.politifact.com. We thus have an independent measure of the truthfulness of each statement. We find that the actual truthfulness of a statement explains very little of the variability in individuals’ perceptions of truthfulness. Instead, we find that the partisanship of the speaker and the partisanship of the rater account for the majority of the variation in perceived truthfulness, with statements made by co-partisans being viewed as more truthful.


Ivo Dinov – Professor, HBBS/SoN, DCMB/SoM, MIDAS
Computational Neuroscience, Time Complexity, and Spacetime Analytics

The proliferation of digital information in all human experiences presents difficult challenges and offers unique opportunities of managing, modeling, analyzing, interpreting, and visualizing heterogeneous data. There is a substantial need to develop, validate, productize, and support novel mathematical techniques, advanced statistical computing algorithms, transdisciplinary tools, and effective artificial intelligence apps.

Spacekime analytics is a new technique for modeling high-dimensional longitudinal data, such as functional magnetic resonance imaging (fMRI). This approach relies on extending the notions of time, events, particles, and wavefunctions to complex-time (kime), complex-events (kevents), data and inference-functions, respectively. This talk will illustrate how the kime-magnitude (longitudinal time order) and kime-direction (phase) affect the subsequent predictive analytics and the induced scientific inference. The mathematical foundation of spacekime calculus reveals various statistical implications including inferential uncertainty and a Bayesian formulation of spacekime analytics. Complexifying time allows the lifting of all commonly observed processes from the classical 4D Minkowski spacetime to a 5D spacetime manifold, where a number of interesting mathematical problems arise.

Spacekime analytics transforms time-varying data, such as time-series observations, into higher-dimensional manifolds representing complex-valued and kime-indexed surfaces (kime-surfaces). This process uncovers some of the intricate structure in high-dimensional data that may be intractable in the classical space-time representation of the data. In addition, the spacekime representation facilitates the development of innovative data science analytical methods for model-based and model-free scientific inference, derived computed phenotyping, and statistical forecasting. Direct neuroscience science applications of spacekime analytics will be demonstrated using simulated data and clinical observations (e.g., UK Biobank).


Ziyou Wu – PhD student, Electrical and computer engineering, Bio-inspired robotics dynamical system lab
Challenges in dynamic mode decomposition

Dynamic Mode Decomposition (DMD) is a powerful tool in extracting spatio-temporal patterns from multi-dimensional time series. DMD takes in time series data and computes eigenvalues and eigenvectors of a finite-dimensional linear model that approximates the infinite-dimensional Koopman operator which encodes the dynamics. DMD is used successfully in many fields: fluid mechanics, robotics, neuroscience, and more. Two of the main challenges remaining in DMD research are noise sensitivity and issues related to Krylov space closure when modeling nonlinear systems. In our work, we encountered great difficulty in reconstructing time series from multilegged robot data. These are oscillatory systems with slow transients, which decay only slightly faster than a period.
Here we present an investigation of possible sources of difficulty by studying a class of systems with linear latent dynamics which are observed via multinomial observables. We explore the influences of dataset metrics, the spectrum of the latent dynamics, the normality of the system matrix, and the geometry of the dynamics. Our numerical models include system and measurement noise. Our results show that even for these very mildly nonlinear conditions, DMD methods often fail to recover the spectrum and can have poor predictive ability. We show that for a system with a well-posed system matrix, having a dataset with more initial conditions and shorter trajectories can significantly improve the prediction. With a slightly ill-conditioned system matrix, a moderate trajectory length improves the spectrum recovery. Our work provides a self-contained framework on analyzing noise and nonlinearity, and gives generalizable insights dataset properties for DMD analysis.
Work was funded by ARO MURI W911NF-17-1-0306 and the Kahn Foundation.

12:45 - Poster Session

The 2020 Symposium’s poster sessions will be hosted on via the College of Engineering’s CareerFair+ tool. A direct link to this platform will be made available in the coming weeks.

View list of poster titles and topics

14:45 - Mini-Workshops

Mini-workshop topics:

  • Agent-based modeling and systemic racism
  • Data Science and Natural Language Processing to find rare classes of entities from text
  • Introduction to Python for community members and K-12 teachers and students
  • Scrubbing and cleaning of sensitive data
  • Stitching Together the Fabric of 21st Century Social Science
  • The state of the art in Automated and Semi-Automated Video Coding

Agent-based modeling and systemic racism

Lead Presenter: Holly Hartman, PhD candidate, Biostatistics, University of Michigan

In this workshop, participants will gain a better understanding of systemic bias and how algorithms may continue to promote inequity. Participants will learn about agent based methods, a tool which can be used to examine algorithmic fairness. There will be opportunities to brainstorm ideas for new research projects within the participants’ fields.

Data Science and Natural Language Processing to find rare classes of entities from text

Lead Presenter: VG Vinod Vydiswaran, Assistant Professor, Learning Health Sciences and School of Information, University of Michigan

Natural language processing (NLP) and Data Science methods, including recently popular deep learning-based approaches, can unlock information from narrative text and have received great attention in the medical domain. Many NLP methods have been developed and showed promising results in various information extraction tasks, especially for rare classes of named entities. These methods have also been successfully applied to facilitate clinical research. In this workshop, we will highlight some methods and technologies to identify rare concepts and entities in text in the medical domain as well as other “open” domains.

Introduction to Python for community members and K-12 teachers and students

Lead Presenter: Fred Feng, Assistant Professor Industrial and Manufacturing Systems Engineering, University of Michigan-Dearborn

This hands-on workshop is tailored to audiences who do not have prior programming experience. The first half of the workshop covers Python programming basics and the second half covers performing data analysis and visualization in Python with real-world data. The audiences are encouraged to follow along with the examples on their own computer. We will use an online browser-based environment (Google Colab), and no software installations on your computer are required. Attendees will need a Google account and will sign in to their browser in order to use this cloud-based tool during the workshop.

Scrubbing and cleaning of sensitive data

Lead Presenter: Jonathan Reader, Programmer/Data Analyst, Neurology, University of Michigan
Co-Presenters:
Nicolas May, Data Systems Manager,  Neurology, University of Michigan
Kelly Bakulski, Research Assistant Professor, School of Public Health, University of Michigan

Before analysis, data must be retrieved, scrubbed of identifiable information, cleaned (e.g., addressed missing data, reshaped appropriately), and delivered. Using biomedical and transportation datasets as examples of how this generalizable process works, this workshop will walk attendees through a real-world pipeline used to process and deliver datasets. Documentation and code will be made available through GitLab to allow for coding along with the demonstration. As a result of this workshop, attendees will leave with a practical template for implementing their own a data science pipeline.

Stitching Together the Fabric of 21st Century Social Science

Presentations:
Mike Mueller-Smith, Assistant Professor, Department of Economics, University of Michigan: “The Criminal Justice Administrative Records System: Assessing the Footprint of the U.S. Criminal Justice System”
David Johnson, Director and Research Professor, Panel Study of Income Dynamics and Survey Research Center, University of Michigan: “Building America’s Family Tree: The Panel Study of Income Dynamics”
Trent Alexander, Associate Director and Research Professor, ICPSR, University of Michigan: “Creating a New Census-based Longitudinal Infrastructure”
Joelle Abramowitz, Assistant Research Scientist, Survey Research Center, University of Michigan: “The Census-Enhanced Health and Retirement Study: Optimal Probabilistic Record Linkage for Linking Employers in Survey and Administrative Data”

Today’s pressing questions of social science and public policy demand an unprecedented degree of data scope and integration as we recognize the cross-cutting dynamics of economics, political science, sociology, demography, and psychology. This panel features four UM researchers who are pushing the frontier of data construction and linkage in coordination with partners at the U.S. Census Bureau.

The state of the art in Automated and Semi-Automated Video Coding 

Lead Presenter: Jason Corso, Professor, Electrical Engineering and Computer Science, University of Michigan
Co-Presenters:
Maggie Levenstein, Director and Research Professor, ICPSR and School of Information, University of Michigan
Susan Jekielek, Assistant Research Scientist, ICPSR, University of Michigan
Donald Likosky, Professor, Department of Cardiac Surgery, University of Michigan

Video is being acquired at an alarming rate across domains, including social research, healthcare, entertainment, sporting and more.  The ability to code this video—attribute certain properties, labels, and other annotations—in support of analytical domain-relevant questions is critical; otherwise, human coding is required.  Human coding, however, is laborious, expensive, not repeatable, and, worse, often error prone.  Video coding, an area within artificial intelligence and computer vision, seeks automated and semi-automated methods to support more effective and robust video coding.  This workshop will review the state of the art in video coding from a capabilities, limitations and tooling perspective and present real-world use-cases.

November 11

9:00 - Research Talks Session 2

Yajuan Si – Research Assistant Professor, Survey Research Center, Institute for Social Research
Novel Tools to Increase the Reliability and Reproducibility of Population Genetics Research

Advances in population genetic research have the potential to create numerous important advances in the science of population dynamics. The interplay of micro-level biology and macro-level social sciences documents gene–environment–phenotype interactions and allows us to examine how genetics relates to child health and wellbeing. However, traditional genetics research is based on nonrepresentative samples that deviate from the target population, such as convenience and volunteer samples. This lack of representativeness may distort association studies. Recent findings have provoked concern about misinterpretation, irreproducibility and lack of generalizability, exemplifying the need to leverage survey research with genetics for population-based research. This project is motivated by the research team’s collaborative work on the Fragile Family and Child Wellbeing Study and the Adolescent Brain Cognitive Development Study, which present these common problems in population genetics studies, to advance the integration of genetic science into population dynamics research. The project will evaluate sample selection effects, identify population heterogeneity in polygenic score analysis, and develop strategies to adjust for selection bias in the association studies of educational attainment, cognition status and substance use for child health and wellbeing. This interdisciplinary project will strengthen the validity and generalizability of population genetics research, deepen new understandings of human behavior and facilitate advances in population science.


Christopher Gillies – Assistant Research Scientist, Emergency Medicine
An end-to-end deep learning system for rapid analysis of the breath metabolome with applications in critical care illness and beyond

The metabolome is the set of low-molecular-weight metabolites and its quantification represents a summary of the physiological state of an organism. Metabolite concentration levels in biospecimens are important for many critical care health illnesses like sepsis and acute respiratory distress syndrome (ARDS). Sepsis is responsible for 35% of patients who die in the hospital and ARDS has a mortality rate of 40%. Missing data is a common challenge in metabolomics datasets. Many metabolomics investigators impute fixed values for missing metabolite concentrations and this imputation approach leads to lower statistical power, biased parameter estimates, and reduced prediction accuracy. Certain applications of metabolomics data, like breath analysis by gas chromatography, used for the prediction or detection of ARDS, can be done without the quantification of individual metabolites. This would circumvent the quantification step of individual metabolites, eliminating the missing data problem. Our team has developed a rapid gas chromatography breath analyzer, which has been challenged by missing data, a time-consuming process of breath signature alignment, and the following quantification of metabolites across patients. Analyzing the breath signal directly could eliminate these challenges. End-to-end deep learning systems are neural networks that operate directly on a raw data source and make a prediction directly for the target application. These systems have been successful in diverse fields from speech recognition to medicine. We envision an end-to-end deep learning that leverages transfer learning, from the collection of many healthy samples, that could rapidly multiply the applications of our breath analyzer. The end-to-end deep learning system will enhance our breath analyzer so it could be used more efficiently in areas of the intensive care unit to the battlefield to identity patients or soldiers with critical illnesses like sepsis and ARDS and monitor longitudinal changes in breath metabolites.


Alauddin Ahmed – Assistant Research Scientist, Mechanical Engineering
Machine learning-guided equations for the on-demand prediction of natural gas storage capacities of materials for vehicular applications

Transportation is responsible for nearly one-third of the world’s carbon dioxide (CO2) emission because of burning fossil fuel. While we dream for zero-carbon vehicles, future projections suggest little decline in fossil fuel consumption by the transportation sector until 2050. Therefore, ‘bending the curve’ of CO2 emission prompts the adoption of low-cost and reduced-emission alternative fuels. Natural gas (NG), the most abundant fossil fuel on earth, is such an alternative with nearly 25% lower carbon footprint and lower price compared to its gasoline counterpart. However, the widespread adoption of natural gas as a vehicular fuel is hindered by the scarcity of high-capacity, light-weight, low-cost, and safe storage systems. Recently, materials-based natural gas storage for vehicular applications have become one of the most viable options. Especially, nanoporous materials (NPMs) are in the spotlight of the U.S. Department of Energy (DOE) because of their exceptional energy storage capacities. However, the number of such NPMs is nearly infinite. It is unknown, a priori, which materials would have the expected natural gas storage capacity. Therefore, searching a high-performing material is like ‘finding a needle in a haystack’ that slows down the speed of materials discovery against growing technological demand. Here we present a novel approach of developing machine learning-guided equations for the on-demand prediction of energy storage capacities of NPMs using a few physically meaningful structural properties. These equations provide users the ability to calculate energy storage capacity of an arbitrary NPM rapidly using only paper and pencil. We show the utility of these equations by predicting NG storage of over 500,000 covalent-organic frameworks (COFs), a class of NPMs. We discovered a COF with record-setting NG storage capacity, surpassing the unmet target set by DOE. In principle, the data-driven approach presented here might be relevant to other disciplines including science, engineering, and health care.


David Fouhey – Assistant Professor, UM EECS
Fusing Computer Vision And Space Weather Modeling

Space weather has impacts on Earth ranging from rare, immensely disruptive events (e.g., electrical blackouts caused by solar flares and coronal mass ejections) to more frequent impacts (e.g., satellite GPS interference from fluctuations in the Earth’s ionosphere caused by rapid variations in the solar extreme UV emission). Earth-impacting events are driven by changes in the Sun’s magnetic field; we now have myriad instruments capturing petabytes worth of images of the Sun at a variety of wavelengths, resolutions, and vantage points. These data present opportunities for learning-based computer vision since the massive, well-calibrated image archive is often accompanied by physical models. This talk will describe some of the work that we have been doing to start integrating computer vision and space physics by learning mappings from one image or representation of the Sun to another. I will center the talk on a new system we have developed that emulates parts of the data processing pipeline of the Solar Dynamics Observatory’s Helioseismic and Magnetic Imager (SDO/HMI). This pipeline produces data products that help study and serve as boundary conditions for solar models of the energetic events alluded to above. Our deep-learning-based system emulates a key component hundreds of times faster than the current method, potentially opening doors to new applications in near-real-time space weather modeling. In keeping with the goals of the symposium, however, I will focus on some of the benefits close collaboration has enabled in terms of understanding how to frame the problem, measure success of the model, and even set up the deep network.


Oleg Gnedin – Professor, Department of Astronomy, LSA
Decoding the Environment of Most Energetic Sources in the Universe

Astrophysics has always been at the forefront of data analysis. It has led to advancements in image processing and numerical simulations. The coming decade is bringing qualitatively new and larger datasets than ever before. The next generation of observational facilities will produce an explosion in the quantity and quality of data for the most distant sources, such as the first galaxies and first quasars. Quasars are the most energetic objects in the universe, reaching luminosity up to 10^14 that of the Sun. Their emission is powered by giant black holes that convert matter into energy according to the famous Einstein’s equation E = mc^2. The largest progress will occur in quasar spectroscopy. Detailed measurements of spectrum of quasar light, as it is being emitted near the central black hole and partially absorbed by clouds of gas on the way to the observer on Earth, allows for a particularly powerful probe of quasar environment. Because spectra of different chemical elements are unique, spectroscopy allows to study not only the overall properties of matter such as density and temperature, but also the detailed chemical composition of the intervening matter. However, the interpretation of these spectra is made very challenging by the many sources contributing to the absorption of light. In order to take a full advantage of this new window into the nature of supermassive black holes we need detailed theoretical understanding of the origin of quasar spectral features. In a MIDAS PODS project we are applying machine learning to model and extract such features. We are training the models using data from the state-of-the-art numerical simulations of the early universe. This approach is fundamentally different from traditional astronomical data analysis. We have only started learning what information can be extracted and still looking for a new framework to interpret these data.

10:40 - Poster awards

Somangshu Mukherji

Assistant Professor, Music Theory, School of Music, Theatre, and Dance

Trisha Fountain

Education Program Manager, MIDAS

11:00 - Fireside Chat: Data Science as both a Science and a Force for Social Change

Eric Horvitz – Chief Scientific Officer, Microsoft

Moderator: H.V. “Jag” Jagadish –  Director, MIDAS

12:00 - Closing Remarks, networking rooms open for additional discussion

H.V. Jagadish

Director, MIDAS | Professor of Electrical Engineering and Computer Science

Keynote Speakers

Catherine D’Ignazio

Assistant Professor, Urban Science & Planning
Director, Data + Feminism Lab
Department of Urban Studies & Planning, MIT

Lauren Klein

Associate Professor, English, Quantitative Theory and Methods
Emory University

Eric Horvitz

Technical Fellow and Chief Scientific Officer
Microsoft

Program Committee

Libby Hemphill
School of Information

Justin Johnson
Computer Science and Engineering

Danai Koutra
Computer Science and Engineering

Jing Liu
MIDAS

Christopher MillerChristopher Miller
Astronomy

Sam Mukherji
Music Theory

Arvind Rao
Computational Medicine and Bioinformatics, and Radiation Oncology

Zhenke Wu

Zhenke Wu
Biostatistics

Symposium Sponsors

External Partners Supporting the Symposium