SYMPOSIUM PROGRAM
NOV 13, WEDNESDAY
Deep Learning Workshops by Amazon and Google
More workshops planned in the spring of 2020.
Google Deep Learning Workshop
8:30 am — 1:30 pm
Weiser Hall 10th floor
Amazon Deep Learning Workshop (Only for University of Michigan faculty, staff and students), two identical sessions.
Session 1: 8:00 am — noon
Session 2: 1 — 5 pm
Rackham Building 4th floor East Conference Room
NOV. 14, THURSDAY
8:45 am — Keynote 1: Rayid Ghani, Carnegie Mellon University
10:00 am — Panel Discussion: Big Data and Political Science
Ceren Budak, Assistant Professor, School of Information, University of Michigan
Lisa Singh, Professor of Computer Science, Georgetown University
Stuart Soroka, Professor. Communication and Media, and Political Science, University of Michigan
Michael Traugott, Professor Emeritus, Political Science, Communication, Center for Political Studies, University of Michigan
Moderator: Rayid Ghani, Carnegie Mellon University
11:00 am — Research Talks, Session 1
11:00am – A Flexible Generative Framework for Graph-based Semi-supervised Learning (Weijing Tang, Statistics, University of Michigan)
Abstract: We consider a family of problems that are concerned about making predictions for the majority of unlabeled, graph-structured data samples based on a small proportion of labeled examples. Relational information among the data samples, often encoded in the graph or network structure, is shown to be helpful for these semi-supervised learning tasks. Conventional graph-based regularization methods and recent graph neural networks do not fully leverage the interrelations between the features, the graph, and the labels. We propose a flexible generative framework for graph-based semi-supervised learning, which approaches the joint distribution of the node features, labels, and the graph structure. Borrowing insights from random graph models in network science literature, this joint distribution can be instantiated using various distribution families. For the inference of missing labels, we exploit recent advances of scalable variational inference techniques to approximate the Bayesian posterior. We conduct thorough experiments on benchmark datasets for graph-based semi-supervised learning. Results show that the proposed methods outperform state-of-the-art models under most settings.
11:15am – Why Scientists Cite What They Cite (Misha Teplitskiy, School of Information, University of Michigan)
Abstract: Although citations and related metrics like the H-index are widely used in academia to evaluate research and allocate resources, the referencing decisions on which they are based are poorly understood. In particular, it is unclear whether authors reference works that influenced them most — the “normative” view — or those they believe the readers will value most — the “social constructivist” view. We present preliminary results from a pilot survey of authors of scientific articles in which we asked them about specific references they have made. We find that authors (1) know the content of the papers they cite less well when the references are to famous (highly cited) papers and (2) are influenced (per-capita) equally by highly and sparsely cited works. An experiment in which authors were asked about references with and without signals of the references’ `status’ (e.g., how highly cited the reference is), we find that positive correlations between citations and perceptions of the quality of a paper, like its validity or significance, are explained by status signals. These findings are inconsistent with the normative view and support the social constructivist view, requiring a radical reassessment of the role of citation in scientific practice. Note: we are finishing analysis of the full-scale survey and experiment, and will be ready to report results from this larger effort, involving 8000 scientists, by the time of presentation.
11:30am – Constructing Expressive Relational Queries with Dual-Specification Synthesis (Christopher Baik, Computer Science and Engineering, University of Michigan)
Abstract:Querying a relational database is difficult because it requires the user to have a grasp of the relational model, the SQL language, and the schema at hand. While natural language interfaces (NLIs) and query-by-example (QBE) are promising alternatives, they suffer from various challenges. Natural language queries (NLQs) are often ambiguous, even for human interpreters, and current QBE approaches require either low-complexity queries, user schema knowledge, exact example tuples from the user, or a closed-world assumption to be tractable. Consequently, we propose dual-specification query synthesis which consumes both a NLQ and an optional QBE-like table sketch query that enables users to express varied levels of knowledge. We introduce the Duoquest system, which leverages guided partial query enumeration to efficiently explore the space of possible queries. We demonstrate in experiments on the prominent Spider benchmark that Duoquest substantially outperforms state-of-the-art NLI and QBE approaches.
11:45am – Extracting medication information and adverse drug events from clinical narratives (V.G.Vinod Vydiswaran, School of Information, University of Michigan)
Abstract: Accurate record of patient medication information is a critical component in clinical care. Changes to medications, including reasons for prescribing and adverse drug events, are often recorded in plain text narratives. While data science approaches are widely studied in medical informatics, identifying medication concepts and adverse drug events from clinical narratives remains a challenging natural language processing task.
12:00am – Data science challenges of sparse integer counts: our experience with single-cell sequencing data (Jun Z. Li, Department of Human Genetics, University of Michigan)
Abstract: High-capacity DNA sequencing has replaced many traditional assays used in biomedical research, because sequencing is often more efficient, accurate, and unbiased. We now routinely apply transcript counting to profile mRNA, or use protein occupancy of DNA to quantify gene regulation. While a typical data matrix still involves p features in n samples, sequencing-based assays produce counts data: positive integers and zeros. Classic statistical learning methods deal with real valued data, thus the current literature on distance metrics, matrix decomposition, normalization, or imputation is only suitable for data from microarrays or deep RNA sequencing, which yields large integers. Recent arrival of single-cell sequencing technologies, however, brings low integers due to shallow sampling of thousands of cells, causing n to be on par with (or greater than) p, sometimes with inflated zeros beyond what is expected by down-sampling. The notion of sparsity takes on dual meaning: low-rank approximation of data structure, and sparse sampling with unusual amounts of missing data. We and colleagues in the Michigan Center for Single-Cell Genomic Data Analytics have attempted to develop new metrics, and revise existing algorithms, to address data science challenges of sparse counts data. We explore clustering; pseudotime alignment; optimal feature selection in multi-class systems (k>2); batch-effect correction; sample assignment in n=1 situations; imputation; lineage construction with high levels of miss value. We perform benchmarking at scale by using large ensembles of simulated data with controlled structure and difficulties. We also examine data science behavior: the diversity of decision-making styles among analysts. Sparse counts data are encountered in studies of consumer behavior (purchasing history or online ratings); political science (voting or polling data); patient health records (particularly sparse for healthy individuals); communication signal compression and recovery; citation indices; traffic patterns; and social networks. Our experiences with single-cell data are generalizable to other data science domains.
12:15 pm — Poster Session 1 and lunch in the Assembly Hall, and East and West Conference Rooms
Featuring U-M data science research, and posters from students and postdocs from 20+ leading universities. See full list.
Lunch provided.
2:00 pm — Industry discussion panel: Data Science for the Next Ten Years in the Industry
Dana Budzyn, Co-founder and CEO, UBDI
Richard Lindberg: Quicken Loans
Tony Qin, AI Lead, DiDi Chuxing
Kyle Schmitt: Managing Director, Global Insurance Practice, J. D. Power
Moderator: Ella Atkins, Professor of Aerospace Engineering, University of Michigan
3:15 pm — Keynote 2: Tina Eliassi-Rad, Northeastern University
Title: Just Machine Learning
4:15 pm — Research Talks, Session 2
4:15 pm – Quantitative Assessment of People Mobility and Social Behavior in Public Open Spaces through Deep Learning (Jerome Lynch, Department of Civil and Environmental Engineering, University of Michigan)
Abstract: One of the most important features of a dense urban environment are public open spaces such as parks and greenways. Public open spaces encourage social interaction, promote healthy lifestyles, and drive economic growth. Especially cities seeking to overcome economic downturns, initial investment in public open spaces is often a necessary step to spur further revitalization. However, the utilization of park use and their benefits are often impossible to quantify precisely. Assessments of park space is often done through visual observation to manually count and map patrons. Computer vision methods applied to camera images collected in public open spaces offer an opportunity to develop an automated approach to observing patron utilization of public open spaces. This study presents the development of a scalable deep learning framework that uses security cameras in parks to detect people and to classify their behavior (e.g., walking, biking, scootering, sitting, talking). A heavily annotated image library termed OPOS (Objects in Public Open Spaces) has been developed consisting of 8K images with nearly 20K instances of park patrons and their various activities. OPOS is used to train a Mask-RCNN detector to automate the detection and classification of park patrons using eleven classes. Mean average precision (mAP) of the framework is well above 75% for all patron classes but many classes exceed 95%. The cameras used for patron detection are also calibrated through pin-hole camera models to spatially map identified patrons to extract patron trajectories that can be mapped to GIS ayers. The approach is deployed in the Detroit Riverfront to automate the processing of security camera feeds to detect and map park users. Trajectories of patrons in the park are used to count the number of patrons using specific amenities and to develop heat maps of where park patrons spend time engaged in social activity.
4:30 pm – PAC Reinforcement Learning without Real-World Feedback (Yuren Zhong, Department of Electronic Engineering and Computer Science, University of Michigan)
Abstract: While reinforcement learning has achieved success in many applications, many state-of-art methods require a large number of training samples to find a good policy. In some tasks (e.g., self-driving cars, robotic control), samples are sufficiently costly so as to render existing algorithms infeasible or impractical. One approach to overcoming this challenge, referred to as Sim-to-Real, is to first train an agent in one or more simulated environments before deploying it in the real world. Of course, this solution presents its own challenges, owing to the fact that simulators invariably do not coincide with the real-world. Previous works on Sim-to-Real can be classified according to whether the agent does or does not receive feedback in the real-world. From a theoretical perspective, the goal of Sim-to-Real is to learn a real-world policy that has a smaller sample complexity than if an agent was training only on the real-world. Intuitively, this gain comes from training on the simulators (perhaps with a much larger sample complexity), together with some modeling assumptions that link the simulators and real-world. Existing theoretical studies have focused on the setting where feedback is received in the real world, and where the dynamics are governed by a conventional Markov decision process (MDP). Our contribution is to develop a theoretical framework, algorithm, and analysis for Sim-to-Real without real-world feedback. Furthermore, we study a type of contextual decision process known as a rich observation MDP (ROMDP), which generalizes an MDP by allowing for policies to be based on a (possibly continuous valued) observation associated to an unseen state variable. We establish a real-world sample-complexity guarantee that is smaller than existing guarantees for learning a single ROMDP. Our modeling assumptions and approach leverage ideas from domain generalization, which allow us to identify environments based on the marginal distribution of observations.
4:45 pm – Irreproducibility in Large-Scale Drug Sensitivity Data (Zoe Rehnberg, Department of Statistics, University of Michigan)
Abstract: Following the release of several large-scale pharmacogenomic studies, the consistency of high-throughput drug sensitivity data across experiments has been widely discussed. While gene expression data are well replicated, only varying levels of moderate to poor concordance has been found for drug sensitivity measures (half-maximal inhibitory concentration [IC50] and area under the dose-response curve [AUC]) in multiple large databases. In this work, we take advantage of detailed raw data to identify factors ranging from data collection to data analysis contributing to the lack of reproducibility in drug sensitivity studies. We find that many different forms of measurement error and the diversity of biological relationships between cell lines and compounds cause difficulties in reliably summarizing drug efficacy. Additionally, we develop a new method of normalizing raw drug response data that accounts for the presence of measurement error and improves agreement between replicates.
5:00 pm – Cadence and Formal Function in Mozart’s Music (Nathan John Martin, School of Music, Theatre, and Dance, University of Michigan)
Abstract: This talk presents preliminary results from the author’s ongoing empirical study of cadential articulations in Mozart’s music. Building on previous research, a corpus of 180 musical works was selected randomly from the whole of Mozart’s œuvre (627 works), and all half-cadential progressions in these works were identified, classified by type, and tagged for their formal location, yielding a total sample of 1002 half cadences. As expected, the study corpus exhibited significant correlations between half-cadence type and formal location. In particular, so-called “expanding 6-8” and “doppia” half cadences tended to appear at major formal divisions; whereas “simple” half cadences, as anticipated, appeared within major formal spans. An initial chi-square test on the 4 x 2 table assembled from the study data (with cadence type as the predicting variable) yielded a significant result (p << 0.01), as did subsequent subgroup analysis (p << 0.01). In examining the most formulaic of classical formal functions, the study provides a first step towards a more fine-grained and empirically validated account of intrinsic formal function. The results are of interest to music theorists both in establishing that intrinsic formal function operates at the smallest (intrathematic) units of musical form and for the methodological novelty of using empirical methods to address music-theoretical questions.
NOV. 15, FRIDAY
8:30 am — Keynote 3: Tanya Berger-Wolf, University of Illinois at Chicago
9:40 am — Announcing the Awardees of the 2019 Propelling Original Data Science Grants
9:45 am — Research Talks, Session 3
9:45 am – Learning Attribute Patterns in High-Dimensional Structured Latent Attribute Models (Yuqi Gu, Department of Statistics, University of Michigan)
Abstract: Structured Latent Attribute Models (SLAMs) are a family of discrete latent variable models widely used in social and biological sciences, including cognitive diagnosis in educational and psychological assessments, epidemiological study of disease etiology, and medical classification and diagnosis. Based on individuals’ observed multivariate responses, the framework of SLAMs enables one to achieve fine-grained inference on individuals’ discrete latent attributes, and also to obtain the latent subgroups of a population, based on the inferred attribute patterns. One challenge in modern applications of SLAMs is that the number of discrete latent attributes could be large, leading to a high-dimensional space for all the possible configurations of the attributes, i.e., high-dimensional latent attribute patterns. In many applications, the number of potential patterns is much larger than the sample size. In such high-dimensional scenarios, existing estimation methods tend to over-select the number of latent patterns, and they may not have scalability. Moreover, theoretical questions remain open on whether and when the “sparse” latent attribute patterns are identifiable. This paper considers the problem of learning significant attribute patterns from a SLAM with potentially high-dimensional configurations of the latent attributes. Our contributions contain the following three aspects. First, we establish mild identifiability conditions that guarantee a SLAM with an arbitrary set of true attribute patterns can be reliably learned from data. Second, we propose a statistically consistent method to perform attribute pattern selection. We establish the theoretical guarantee for selection consistency in the setting of high-dimensional patterns. Third, we propose a fast screening strategy for SLAMs as a preprocessing step, which can scale to huge number of potential latent attribute patterns, and we establish its sure screening property. We apply the methodology to two educational datasets and obtain meaningful clusters and knowledge structures of the student populations.
10:00 am – Improving Mild Cognitive Impairment Prediction via Reinforcement Learning (Jiayu Zhou, Computer Science and Engineering, Michigan State University)
Abstract: The search for early biomarkers of mild cognitive impairment (MCI) has been central to Alzheimer’s Disease (AD) and the dementia research community in recent years. While there exist in-vivo biomarkers (e.g., beta-amyloid and tau) that can serve as indicators of pathological progression toward AD, biomarker screenings are prohibitively expensive to scale if widely used among pre-symptomatic individuals in the outpatient setting. Behavior and social markers such as language, speech, and conversational behaviors reflect cognitive changes that may precede physical changes and offer a much more cost-effective option for preclinical MCI detection, especially if they can be extracted from a non-clinical setting. We developed a prototype AI conversational agent that conducts screening conversations with participants. Specifically, this AI agent must learn to ask the right sequence of questions to distinguishing the conversational characteristics of the participants with MCI from those with normal cognition. Using transcribed data obtained from recorded conversational interactions between participants and trained interviewers, which were generated in a recently completed clinical trial, and applying supervised learning models to these data, we developed a novel reinforcement learning (RL) pipeline and a dialogue simulation environment to train an efficient dialogue agent to explore a range of semi-structured questions. Specifically, the agent is trained to sketch disease-specific lexical probability distribution, and thus to converse in a way that maximizes the diagnosis accuracy and minimizes the number of conversation turns. We evaluate the performance of the proposed RL framework on the MCI diagnosis. The results show that while using only a few turns of conversation, our framework can significantly outperform state-of-the-art supervised learning approaches used in a past study. Our work presents a step toward using AI to extend clinical care beyond the classical hospital and clinical settings.
10:15 am – Integrating Experimental and Observational Data through Machine Learning (Edward Wu, Statistics, University of Michigan)
Abstract: Two main virtues of randomized experiments are that they (1) do not suffer from confounding and (2) allow for design-based inference, meaning that the physical act of randomization largely justifies the statistical assumptions made. However, sample sizes are often small. Conversely, observational studies typically offer much larger sample sizes at lower costs, but may suffer confounding. Recent work has sought to integrate “big observational data” with “small but high-quality experimental data” to get the best of both worlds. For example, how can one exploit a large database of electronic health records to improve the accuracy of a small clinical trial? Or, how can one use administrative data on hundreds of thousand of students to improve a small experiment testing the effectiveness of a new educational technique? Similar questions arise across many disciplines. This talk discusses a flexible framework that allows researchers to employ machine learning algorithms to learn from the observational data, and use the resulting models to improve precision in randomized experiments. Importantly, there is no requirement that the machine learning models are “correct” in any sense. The final experimental results rely only on randomization as the basis for inference and are guaranteed to be exactly unbiased. Thus, there is no danger of confounding biases in the observational data leaking over into the experiment. The framework is applied to A/B tests of educational software, using an observational administrative database of previous student achievement.
10:30 am – Synthetic data method to incorporate external information into a current study (Tian Gu, Department of Biostatistics, University of Michigan)
Abstract: In the big data era, incorporating external summary-level information into current study has attracted significant interest to improve the estimation efficiency. Most existing methods require a specific form for the external information, e.g. estimated coefficients from a correctly specified mean model. We propose a synthetic data approach, which consists of using the auxiliary summary information to create synthetic data observations, and then appending them to the internal data. This combined data is then analyzed to give an estimate of the target parameters. Our method relaxes the requirement on the information that is available externally. A theoretical justification of the method is provided, and it is evaluated in simulation studies. The method is applied to improve models for the risk of prostate cancer. The method’s broad applicability makes it appealing for use across diverse scenarios.
10:45 am – Data-Driven Discovery of Materials for Energy Storage (Alauddin Ahmed, Mechanical Engineering Department, University of Michigan)
Abstract: Energy is behind every technological revolution witnessed by human society, and materials are at the heart of every energy technology. Finding sustainable means for meeting society’s growing energy needs in an environmentally benign way is the most pressing challenge facing mankind. With projected increase in consumption of gaseous fuel (e.g., hydrogen, natural gas), renewable energy (e.g., solar, wind) and electrochemical energy (e.g., rechargeable batteries), the demand of low-cost, high-capacity, reversible, and efficient energy storage systems has been on the steep rise. New technologies based on novel materials will undoubtedly play a crucial role in the emerging areas that crosscut the energy and environment. However, the gap between increasing demand of low-cost high-performance materials for energy-related applications and the frequency of traditional trail-and-error approach of discovering such materials is widening day by day. To address this challenge, we developed a data-driven approach to accelerate the discovery and design of novel materials for sustainable energy applications, which involves machine learning, data mining, feature engineering and high-throughput density functional theory and grand canonical Monte Carlo calculations. As a result, we computationally discovered several record-setting materials for hydrogen, methane, and thermal energy storage. Importantly, four record-setting metal-organic frameworks (IRMOF-20, SNU-70, UMCM-9, and PCN-610/NU-100) computationally predicted for hydrogen storage were experimentally confirmed by our Chemistry Department collaborators. Also, we developed design guidelines for targeted synthesis of materials with desirable energy storage capacities. A major outreach of this endeavor is the creation of the world’s largest structure-property-performance database for hydrogen storage capacities of half a million MOFs, which is currently publicly accessible via the Hydrogen Materials—Advanced Research Consortium (HyMARC) datahub.
11:00 am — Poster Session 2. Student Poster Awards in the Assembly Hall, and East and West Conference Rooms
Featuring U-M data science research, and posters from students and postdocs from 20+ leading universities. See full list.
Student poster award winners (with cash awards) in multiple categories will be announced.
1:00 pm — Data Challenge Award Ceremony
The MIDAS-MDST Data Challenge winners were announced.
1:30- 4:30 pm — Data Science for Music mini-symposium
Featuring four projects funded by MIDAS.
1:30 pm “Understanding and Mining Patterns of Audience Engagement and Creative Collaboration in Largescale Crowdsourced Music Performances”
Danai Koutra, Walter Lasecki, Computer Science and Engineering; Sang Won Lee, Computer Science, Virginia Tech
2:00 pm”Understanding How the Brain Processes Music through the Bach Trio Sonatas”
Daniel Forger, Mathematics; James Kibbie, Organ
2:40 pm”The Sound of Text”
Rada Mihalcea, Computer Science and Engineering; Anıl Çamcı, Performing Arts Technology; Sile O’Modhrain, Performing Arts Technology; Jonathan Kummerfeld, Computer Science and Engineering
3:10 pm”A Computational Study of Patterned Melodic Structures across Musical Cultures”
Somangshu Mukherji, Áine Heneghan, Nathan Martin, René Rusch, Music Theory; Long Nguyen, Statistics; Steven Abney, Linguistics
3:45 pm Panel Discussion: Data Science and the Future of Arts Research
Panelists:
Daniel Forger (Professor of Mathematics);
Allie Lahnala (graduate student, Computer Science and Engineering);
Sam Mukherji (Assistant Professor, Music Theory);
Gregory Wakefield (Director of ArtsEngine, Professor of Electrical Engineering and Computer Science).
Moderator: Marvin Parnes, former Executive Director of Arts Alliance for Research Universities
Please check out this page to see the lists of winners in the 2019 symposium poster session and data challenge competitions.
We are the first in the country to organize a Consortium for Data Scientists in Training. 38 data science students and postdocs from 29 leading universities around the country will attend our symposium, including Columbia, Duke, Harvard, MIT, Clark Atlanta University, Purdue, Rice, Stanford, University of California (Berkeley), University of Washington, Wayne State.
Featured Speakers
Tanya Berger-Wolf: Computational Ecology and AI for Conservation
Founding member of Wildbook.org; Board of Directors Member, Wild Me; Professor of Computer Science at University of Illinois at Chicago working at the intersection of computer science, wildlife biology, and social sciences
Tina Eliassi-Rad: Just Machine Learning
Core Faculty at the Network Science Institute, Associate Professor at the Khoury College of Computer Sciences, Northeastern University working in the areas of data mining, ethics of artificial intelligence, machine learning, network science, and computational social science
Rayid Ghani: Machine Learning for Social Good: Examples, Opportunities, and Challenges
Chief Scientist of 2012 Obama Campaign, Distinguished Career Professor at Carnegie Mellon with a joint appointment to the Heinz College of Information Systems and Public Policy and the School of Computer Science
Program Committee
- Ceren Budak, School of Information
- Yang Chen, Statistics
- Danai Koutra, Computer Science and Engineering
- Jing Liu, MIDAS
- Sam Mukherji, Music Theory
- Arvind Rao, Computational Medicine and Bioinformatics, and Radiation Oncology
- Zhenke Wu, Biostatistics