Friday, December 1, 2017 at the Michigan League
(click on title to expand)
Al Hero and Brian Athey, MIDAS Co-Directors
|Brian Athey is the Michael A. Savageau Collegiate Professor and Chair of the Department of Computational Medicine and Bioinformatics, and Professor of Psychiatry and Internal Medicine.||Al Hero is the John H. Holland Distinguished University Professor of Electrical Engineering and Computer Science, R. Jamison and Betty Williams Professor of Engineering, Professor of Biomedical Engineering, and Professor of Statistics|
Yanxin Pan, PhD Student, Design Science, College of Engineering
Title: Deep Design: Product Aesthetics for Heterogeneous Markets
Abstract: Aesthetic appeal is a primary driver of customer consideration over product designs such as automobiles. Product designers must accordingly convey design attributes (e.g., `Sportiness’) that the customer will prefer, a challenging proposition given subjective perceptions of customers belonging to heterogeneous market segments. We introduce a scalable deep learning approach that aims to predict how customers across market segments perceive aesthetic designs, as well as visually interpret “why” the customer perceives as such. An experiment is conducted to test this approach, using a Siamese neural network architecture containing a pair of conditional generative adversarial networks, trained using large-scale product design and crowdsourced customer data. Our results show that we are able to predict how aesthetic design attributes are perceived by customers in heterogeneous market segments, as well visually interpret these aesthetic perceptions. This provides evidence that the proposed deep learning approach may provide an additional means of understanding customer aesthetic perceptions complementary to existing methods used in product design.
Rohail Syed, PhD Candidate, School of Information
Title: Toward Search Engines Optimized for Human Learning
Abstract: While search technology is widely used for learning-oriented information needs, the results provided by popular services such as Web search engines are optimized primarily for generic relevance, not effective learning outcomes. As a result, the typical information trail that a user must follow while searching to achieve a learning goal may be an inefficient one involving unnecessarily easy or difficult content, or material that is irrelevant to actual learning progress relative to a user’s existing knowledge. We address this problem by introducing a novel theoretical framework, algorithms, and empirical analysis of an information retrieval model that is optimized for learning outcomes instead of generic relevance. We do this by formulating an optimization problem that incorporates a cognitive learning model into a retrieval objective, and then give an algorithm for an efficient approximate solution to find the search results that represent the best ‘training set’ for a human learner. Our model can personalize results for an individual user’s learning goals, as well as account for the effort required to achieve those goals for a given set of retrieval results. We investigate the effectiveness and efficiency of our retrieval framework relative to a commercial search engine baseline (‘Google’) through a crowdsourced user study involving a vocabulary learning task, and demonstrate the effectiveness of personalized results from our model on word learning outcomes.
Yang Chen, PhD, Assistant Professor of Statistics, College of Literature, Science, and the Arts
Title: Calibration Concordance by Multiplicative Shrinkage with Applications to Astronomical Instruments
Abstract: Calibration data are often obtained by observing several well-understood objects simultaneously with multiple instruments, such as satellites for measuring astronomical sources. Analyzing such data and obtaining proper concordance among the instruments is challenging when the physical source models are not well understood, when there are uncertainties in “known” physical quantities, or when data quality varies in ways that cannot be fully quantified. Furthermore, the number of model parameters increases with both the number of instruments and the number of sources. Thus, concordance of the instruments requires careful modeling of the mean signals, the intrinsic source differences, and measurement errors. We propose a log-Normal hierarchical model and a more general log-t model that respect the multiplicative nature of the mean signals via a half-variance adjustment, yet they permit imperfections in the mean modeling to be absorbed by residual variances. We present analytical solutions in the form of power shrinkage in special cases and develop reliable Markov chain Monte Carlo (MCMC) algorithms for general cases. We apply our method to several data sets including a combination of observations of active galactic nuclei (AGN) and spectral line emission from the supernova remnant E0102, obtained with a variety of X-ray telescopes such as Chandra, XMM-Newton, Suzaku, and Swift and compiled by the International Astronomical Consortium for High Energy Calibration (IACHEC). We demonstrate that our method provides helpful and practical guidance for astrophysicists when adjusting for disagreements among instruments.
Walter Mebane, PhD, Professor of Political Science, Professor of Statistics, College of Literature, Science, and the Arts
Title: Using Twitter to Observe Election Incidents in the United States
Abstract: Individuals’ observations about election administration can be valuable to improve election performance, to help assess how well election forensics methods work, to address interesting behavioral questions and possibly to help establish the legitimacy of an election. In the United States such observations cannot be gathered through official channels. We use Twitter to extract observations of election incidents by individuals all across the United States throughout the 2016 election, including primaries, caucuses and the general election. To classify Tweets for relevance and by type of election incident, we use automated machine classification methods in an active learning framework. We demonstrate that for primary election day in one state (California), the distribution of types of incidents revealed by data developed from Twitter roughly matches the distribution of complaints called in to a hotline run on that day by the state. For the general election we develop hundreds of thousands of incident observations that occur at varying rates in different states, that vary over time and by type and that depend on state election and demographic conditions. Thousands of observations concern long lines, but even more celebrate successful performance of the election process—testimonies that “I voted!” proliferate. We show how different types of Twitter users report distinct types of incidents.
Matthew Shapiro, PhD, Lawrence R Klein Collegiate Professor of Economics, College of Literature, Science, and the Arts
Title: Computational Approaches for the Construction of Novel Macroeconomic Data
Abstract: Nowcasting, the description of current events and events in the immediate future and immediate past, holds great promise for insight into social and economic phenomena based on tracking and analyzing online data. Online data sources, such as social media text messages and images, capture a wide range of economic and social behaviors at high frequency and low cost, especially relative to traditional survey and administrative sources. Yet macroeconomic measurements derived from social media data have not yet become mainstream, for three main reasons. First, the software systems and deployment are too difficult for most users. Second, new data sources such as image streams require novel computational approaches before they can be useful to domain experts. Third, projects to date have been either entirely manual (and thus too burdensome for most) or entirely automatic (and thus unable to exploit economists’ expertise). This project aims to build a software system that helps overcome these burdens. The research team will develop a data ingestion and archiving service that constantly records, processes, and archives text and image data from online sources, such as Twitter and government-sponsored traffic cameras. They will also develop a nowcasting dataset construction tool for economists and other domain experts to transform the ingested and archived data streams into high-quality topic-specific nowcasts. For example, an economist might use the tool to build an unemployment predictor, while a stock analyst might use it to predict the opening weekend box office for an important studio’s releases. It entails research efforts in computer vision, machine learning, and data management systems. The team will also build an economics datapedia that collects and publishes a range of nowcast-driven datasets built using the dataset construction tool, similar to YouTube or other social platforms. This system allows economists and other domain experts to discuss, combine, and criticize datasets. Together, these components should make social media a powerful tool for the construction of economic measures by practicing economists. This project will produce novel research, and by also building tools, services, and datasets, will make that research substantially more impactful, long-lasting, and useful to other researchers.
Director, U-M Venture Center
Office of Technology & Transfer
School of Information
Computer Science and Engineering
College of Engineering
Data Science Infrastructure & Services: (30 minutes)
Brock Palen (ARC) and Kerby Shedden (CSCAR) will present about advanced computing infrastructure and consulting support for Data Science. Following the presentation, there will be a Q&A session. Brock and Kerby are especially interested in learning about specific needs in this area that ARC and CSCAR can work to accommodate.
Director, Advanced Research Computing
Kerby Shedden, PhD
Professor of Statistics, Biostatistics
Director, Consulting for Statistics, Computing & Analytics Research
Vyas Ramasubramani, PhD Student, Chemical Engineering, College of Engineering
Title: Simple Data and Workflow Management with the signac Framework
Abstract: Researchers in the field of materials science, chemistry, and computational physics are regularly posed with the challenge of managing large and heterogeneous data spaces. The amount of data increases in lockstep with computational efficiency multiplied by the amount of available computational resources, which shifts the bottleneck within the scientific process from data acquisition to data post-processing and analysis. We present a framework designed to aid in the integration of various specialized formats, tools and workflows. The signac framework provides all basic components required to create a well-defined and thus collectively accessible data space, simplifying data access and modification through a homogeneous data interface, largely agnostic of the data source, i.e., computation or experiment. The framework’s data model is designed not to require absolute commitment to the presented implementation, simplifying adaption into existing data sets and workflows. This approach not only increases the efficiency for the production of scientific results, but also significantly lowers barriers for collaborations requiring shared data access.
Emily Hector, PhD Candidate, Biostatistics, School of Public Health
Title: A distributed and integrated method of moments for high-dimensional correlated data analysis
Abstract: We are motivated by a regression analysis of electroencephalography (EEG) neuroimaging data with high-dimensional correlated responses with multi-level nested correlations. We develop a divide-and-conquer procedure implemented in a fully distributed and parallelized computational scheme for statistical estimation and inference of regression parameters. Despite significant efforts in the literature, the computational bottleneck associated with high-dimensional likelihoods prevents the scalability of existing methods. The proposed method addresses this challenge by dividing responses into subvectors to be analyzed separately and in parallel on a distributed platform using pairwise composite likelihood. Theoretical challenges related to combining results from dependent data are overcome in a statistically efficient way using a meta-estimator derived from Hansen’s generalized method of moments. We provide a rigorous theoretical framework for efficient estimation, inference, and goodness-of-fit tests. We develop an R package for ease of implementation. We illustrate our method’s performance with simulations and the analysis of the EEG data, and find that iron deficiency is significantly associated with two event related electrical potentials related to auditory recognition memory in the left parietal-occipital region of the brain.
Selin Merdan, PhD Candidate, Industrial and Operations Engineering, College of Engineering
Title: Data Analytics for Optimal Detection of Metastatic Prostate Cancer
Abstract: Prostate cancer staging involves the determination of the spread of cancer (metastasis) via bone scan (BS) and/or CT scan. Standard clinical guidelines indicate the need for BS and CT scan only in patients with certain unfavorable characteristics; however, there is no consensus about the optimal use of staging BS and CT scan for men with newly-diagnosed prostate cancer. To develop state-wide, evidence-based imaging criteria, we collaborated with the Michigan Urological Surgery Improvement Collaborative (MUSIC), a quality-improvement collaborative comprising 90% of the urologists in the state. The goal of this project was to determine which patients should receive imaging and which patients can safely avoid imaging. Because not all men with newly-diagnosed cancer received an imaging, we used an established method to correct for verification bias to evaluate the accuracy of published imaging guidelines. In addition to the published guidelines, we implemented advanced classification modeling techniques to develop accurate classification rules identifying which patients should receive imaging on the basis of individual risk factors. We proposed a new algorithm for a classification model considering the extraction of data of patients with nonverified disease and the high cost of misclassifying a metastatic patient in its learning framework. We employed a bi-criteria based approach to determine the Pareto optimal guidelines with respect to expected number of positive outcomes missed and expected number of negative studies. MUSIC implemented these guidelines, which were predicted to reduce unnecessary imaging by more than 40% and limit the percentage of patients with missed metastatic disease to be less than 1%.
Michael Traugott, PhD, Professor, Communication Studies, Political Science, College for Literature, Science, and the Arts
Title: A Social Science Collaboration for Research on Communication and Learning based upon Big Data
Abstract: One of the challenges facing social scientists is that our understanding of how social and political processes operate and what their consequences are has lost some of its predictive power, such as our failure to predict election outcomes. This phenomenon raises questions of whether theories and models developed in the past – among a different generation living in a different cultural and technological setting – apply in the current environment. Concurrently, the abundance of online, social media data provides the social scientists with great opportunities to understand today’s social and political phenomena. To use such opportunities, however, important issues on how to process and use social media need to be addressed. Such issues include whether social media users are representative of the population at large and whether they are honest and open, as well as whether the collection and processing of data are unbiased and accurate to allow the construction of inferences about populations. The research team will carry out a few parallel projects with the unifying theme of integrating geospatial, social media and spatial data to address research and methodological questions. One project is about communication patterns and their effects on political choices and behavior in the 2016 presidential election. The second project investigates online and Twitter communication about parenting information and misinformation. A third project will investigate a variety of methodological issues associated with inferences drawn from probability-based and nonprobability-based social surveys and from social media. The three projects will employ methods of cross-validation of survey data, social media, and administrative records and investigate the social network dynamics of elites and the general public. The research team will develop procedures for extracting meaning from large collections of text to connect with public attitudes about important political and policy issues of the day. They will also develop visualization techniques for dimensionality reduction, while expanding upon existing systems for data mining and statistical inference. The project is a collaboration between researchers from multiple units at the University of Michigan and at Georgetown University, and the team will also engage researchers at Gallup. This set of projects will become the locus for multidisciplinary efforts between social scientists, computer scientists, and statisticians at both institutions, and each university will become the locus for future extended work of this kind. The data science tools developed through this set of projects will also have wide application to other research questions in social science.
Michael Elliott, PhD, Professor, Biostatistics, School of Public Health
Title: Calibrating Big Data for population inference: applying quasi-randomization approaches to naturalistic driving data
Abstract: Summary: With the rapid penetration of Big Data into science and technology, concerns are raised about population inference based on such data. In the words of Xiao-Li Meng: “the bigger the data, the more certain we will miss our target”. Although probability sampling has been the “gold standard” of population inference for decades, its rising cost and complexity along with downward trends in responses rate has led to a growing interest in non-probability samples. This interest has heightened with growing access to Big Data, which typically is not collected using probability sampling. We aim to develop methods that use probability samples of relevant populations to improve the representativeness of non-probability samples. Methods: This study was aimed to calibrate naturalistic driving data using quasi-randomization approaches. This method assumes the non-probability sample actually has a probability sampling mechanism, albeit unknown. The goal is to estimate pseudo-inclusion probabilities using a reference survey that has a set of covariates in common with Big Data. Results: We consider the development of the quasi-randomization weights to improve the representativeness of the University of Michigan Transportation Research Institute Safety Pilot Study. Safety Pilot is consists of over 3,000 vehicles that were instrumented and followed for an average of one year; it is a convenience sample of drivers in the southeast Michigan region. We use the National Household Transportation Survey as our probability sample of drivers; it consists of information from over 300,000 drivers, but only about one driving day, obtained only from self report (no instrumentation). We also consider a simulation study to evaluate the performance of weighted estimates. Impact: “Big data” is a valuable resource, but can lead to “big mistakes” if issues of representativeness are not considered carefully. We are developing practical methods that can be implemented using existing software to deal with this issue.
| Computational Social Science Workshop (CSSW)
Jeff Lockhart, PhD Student, Sociology,
College of Literature, Science, and the Arts
|Michigan Student Artificial Intelligence Lab (MSAIL)
Sam Tenka, BS Student, Mathematics
College of Literature, Science, and the Arts
|Michigan Data Science Team (MDST)
Jonathan Stroud, PhD Student, Computer Science and Engineering
College of Engineering
|Statistics in the Community at Michigan (STATCOM)
Evan Reynolds, PhD Student, Biostatistics
School of Public Health
Christopher J. Rozell, PhD , Associate Professor, Electrical and Computer Engineering, Georgia Institute of Technology
Title: Closing the Loop Between Mind and Machine: Building Algorithms to Interface with Brains at Multiple Scales
Abstract: New technologies are rapidly developing for interfacing with the brain across multiple scales including cells, circuits, networks and systems. While there has been much discussion about the emerging “big data” problems that will arise from high-resolution measurement technologies, there are a number of other new data science challenges also emerging. In particular, now more than ever, we have the ability and desire to build closed-loop systems that are a combination of biology and technology working together in real time for scientific discovery and clinical therapies. Innovations in interfacing technology require parallel advances in algorithmic technology to determine what to do with these new tools to maximize their effectiveness. These system require data science approaches that operate online, are designed for closed-loop processing in real-time, are robust to the reality of “small data” in many applications, and are fully informed about both the biology and modern neurotechnology being used at the interface. In this talk we will survey recent examples of data science problems we are working on as we build closed-loop interfacing systems for the brain in health and disease. These problems will span scales from single-cell electrophysiology up to novel brain-machine interfaces for controlling complex systems.
MIDAS gratefully acknowledges Northrop Grumman Corporation for its generous support of the MIDAS Seminar Series.
The reception will take place in the Michigan League Ballroom on the 2nd floor.