Explore ARCExplore ARC

Study on bias in learning analytics earns Brooks Best Full Research Paper Award at LAK conference

By | General Interest, Happenings, News, Research

A paper co-authored by University of Michigan School of Information research assistant professor Christopher Brooks received the Best Full Research Paper Award at the International Conference on Learning Analytics & Knowledge (LAK) Conference in Tempe, Arizona. The award was announced on the final day of the conference, March 7, 2019.

The paper, “Evaluating the Fairness of Predictive Student Models Through Slicing Analysis,” describes a tool designed to test the bias in algorithms used to predict student success.

The goal of the paper, Brooks says, was to evaluate whether the algorithms used to predict whether students would succeed in massive online courses (MOOCs) was skewed by the gender makeup of the classes.

“We were able to find that some have more bias than others do,” says Brooks. “First we were able to show that different MOOCs tend to have different bias in gender representation inside of the MOOCs.”

Read more…

HDDA: DataSifter: statistical obfuscation of electronic health records and other sensitive datasets

By | Research

Title

HDDA: DataSifter: statistical obfuscation of electronic health records and other sensitive datasets

Publication
Journal of Statistical Computation and Simulation

Date

11 Nov. 2018

DOI
https://doi.org/10.1080/00949655.2018.1545228

Authors
Simeone Marino, Nina Zhou, Yi Zhao, Lu Wang, Qiucheng Wu & Ivo D. Dinov (2019)

Abstract
There are no practical and effective mechanisms to share high-dimensional data including sensitive information in various fields like health financial intelligence or socioeconomics without compromising either the utility of the data or exposing private personal or secure organizational information. Excessive scrambling or encoding of the information makes it less useful for modelling or analytical processing. Insufficient preprocessing may compromise sensitive information and introduce a substantial risk for re-identification of individuals by various stratification techniques. To address this problem, we developed a novel statistical obfuscation method (DataSifter) for on-the-fly de-identification of structured and unstructured sensitive high-dimensional data such as clinical data from electronic health records (EHR). DataSifter provides complete administrative control over the balance between risk of data re-identification and preservation of the data information. Simulation results suggest that DataSifter can provide privacy protection while maintaining data utility for different types of outcomes of interest. The application of DataSifter on a large autism dataset provides a realistic demonstration of its promise practical applications.

Balzano wins NSF CAREER award for research on machine learning and big data involving physical, biological and social phenomena

By | General Interest, Happenings, News, Research

Prof. Laura Balzano received an NSF CAREER award to support research that aims to improve the use of machine learning in big data problems involving elaborate physical, biological, and social phenomena. The project, called “Robust, Interpretable, and Efficient Unsupervised Learning with K-set Clustering,” is expected to have broad applicability in data science.

Modern machine learning techniques aim to design models and algorithms that allow computers to learn efficiently from vast amounts of previously unexplored data, says Balzano. Typically the data is broken down in one of two ways. Dimensionality-reduction uses an algorithm to break down high-dimensional data into low-dimensional structure that is most relevant to the problem being solved. Clustering, on the other hand, attempts to group pieces of data into meaningful clusters of information.

However, explains Balzano, “as increasingly higher-dimensional data are collected about progressively more elaborate physical, biological, and social phenomena, algorithms that aim at both dimensionality reduction and clustering are often highly applicable, yet hard to find.”

Balzano plans to develop techniques that combine the two key approaches used in machine learning to decipher data, while being applicable to data that is considered “messy.” Messy data is data that has missing elements, may be somewhat corrupted, or is filled heterogeneous information – in other words, it describes most data sets in today’s world.

Balzano is an affiliated faculty member of both the Michigan Institute for Data Science (MIDAS) and the Michigan Institute for Computational Discovery and Engineering (MICDE). She is part of a MIDAS-supported research team working on single-cell genomic data analysis.

Read more about the NSF CAREER award…

Who’s Tweeting About the President? What Big Survey Data Can Tell Us About Digital Traces

By | Research

Title
Who’s Tweeting About the President? What Big Survey Data Can Tell Us About Digital Traces

Published in
January 21, 2019 Social Science Computer Review

DOI
10.1177/0894439318822007

Authors
Josh Pasek, Colleen A. McClain, Frank Newport, Stephanie Marken

Abstract
Researchers hoping to make inferences about social phenomena using social media data need to answer two critical questions: What is it that a given social media metric tells us? And who does it tell us about? Drawing from prior work on these questions, we examine whether Twitter sentiment about Barack Obama tells us about Americans’ attitudes toward the president, the attitudes of particular subsets of individuals, or something else entirely. Specifically, using large-scale survey data, this study assesses how patterns of approval among population subgroups compare to tweets about the president. The findings paint a complex picture of the utility of digital traces. Although attention to subgroups improves the extent to which survey and Twitter data can yield similar conclusions, the results also indicate that sentiment surrounding tweets about the president is no proxy for presidential approval. Instead, after adjusting for demographics, these two metrics tell similar macroscale, long-term stories about presidential approval but very different stories at a more granular level and over shorter time periods.

3D Shape Modeling for Cell Nuclear Morphological Analysis and Classification

By | Research

Title
3D Shape Modeling for Cell Nuclear Morphological Analysis and Classification

Published in
Scientific Reports 8, October 2018

DOI
10.1038/s41598-018-33574-w

Authors
Alexandr A. Kalinin, Ari Allyn-Feuer, Alex Ade, Gordon-Victor Fon, Walter Meixner, David Dilworth, Syed S. Husain, Jeffrey R. de Wet, Gerald A. Higgins, Gen Zheng, Amy Creekmore, John W. Wiley, James E. Verdone, Robert W. Veltri, Kenneth J. Pienta, Donald S. Coffey, Brian D. Athey & Ivo D. Dino

Abstract
Quantitative analysis of morphological changes in a cell nucleus is important for the understanding of nuclear architecture and its relationship with pathological conditions such as cancer. However, dimensionality of imaging data, together with a great variability of nuclear shapes, presents challenges for 3D morphological analysis. Thus, there is a compelling need for robust 3D nuclear morphometric techniques to carry out population-wide analysis. We propose a new approach that combines modeling, analysis, and interpretation of morphometric characteristics of cell nuclei and nucleoli in 3D. We used robust surface reconstruction that allows accurate approximation of 3D object boundary. Then, we computed geometric morphological measures characterizing the form of cell nuclei and nucleoli. Using these features, we compared over 450 nuclei with about 1,000 nucleoli of epithelial and mesenchymal prostate cancer cells, as well as 1,000 nuclei with over 2,000 nucleoli from serum-starved and proliferating fibroblast cells. Classification of sets of 9 and 15 cells achieved accuracy of 95.4% and 98%, respectively, for prostate cancer cells, and 95% and 98% for fibroblast cells. To our knowledge, this is the first attempt to combine these methods for 3D nuclear shape modeling and morphometry into a highly parallel pipeline workflow for morphometric analysis of thousands of nuclei and nucleoli in 3D.

The effectiveness of parking policies to reduce parking demand pressure and car use

By | Research

This study is a part of the “Reinventing Transportation and Urban Mobility” project, funded by the Michigan Institute for Data Science.

Title
The effectiveness of parking policies to reduce parking demand pressure and car use

Published in
Transport Policy, January 2019

DOI
10.1016/j.tranpol.2018.10.009

Authors
Xiang Yan, Jonathan Levine, Robert Marans

Abstract
Evaluating the effectiveness of parking policies to relieve parking demand pressure in central areas and to reduce car use requires an investigation of traveler responses to different parking attributes, including the money and time costs associated with parking. Existing parking studies on this topic are inadequate in two ways. First, few studies have modeled parking choice and mode choice simultaneously, thus ignoring the interaction between these two choice realms. Second, existing studies of travel choice behavior have largely focused on the money cost of parking while giving less attention to non-price-related variables such as parking search time and egress time from parking lot to destination. To address these issues, this paper calibrates a joint model of travel mode and parking location choice, using revealed-preference survey data on commuters to the University of Michigan, Ann Arbor, a large university campus. Key policy variables examined include parking cost, parking search time, and egress time. A comparison of elasticity estimates suggested that travelers were very sensitive to changes in egress time, even more so than parking cost, but they were less sensitive to changes in search time. Travelers responded to parking policies primarily by shifting parking locations rather than switching travel mode. Finally, our policy simulation results imply some synergistic effects between policy measures; that is, when pricing and policy measures that reduce search and egress time are combined, they shape parking demand more than the sum of their individual effects if implemented in isolation.

VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies

By | Research

This research was supported by funding from the Michigan Center for Single-Cell Genomic Data Analytics—a part of the Michigan Institute for Data Science.

Title
VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies

Published in
Genome Biology, November 12, 2018

DOI
10.1186/s13059-018-1575-1

Authors
Mengie Chen and Xiang Zhou

Abstract
We develop a method, VIPER, to impute the zero values in single-cell RNA sequencing studies to facilitate accurate transcriptome quantification at the single-cell level. VIPER is based on nonnegative sparse regression models and is capable of progressively inferring a sparse set of local neighborhood cells that are most predictive of the expression levels of the cell of interest for imputation. A key feature of our method is its ability to preserve gene expression variability across cells after imputation. We illustrate the advantages of our method through several well-designed real data-based analytical experiments.

TAIJI: Approaching Experimental Replicates-Level Accuracy for Drug Synergy Prediction

By | Research

MIDAS-affiliated researchers recently published a paper on accurate and fast computational tools for predicting drug synergistic effects.

Title
TAIJI: Approaching Experimental Replicates-Level Accuracy for Drug Synergy Prediction

Published in
Bioinformatics, November 21, 2018

DOI
10.1093/bioinformatics/bty955

Authors
Hongyang Li, Shuai Hu, Nouri Neamati, Yuanfang Guan

Abstract

Motivation

Combination therapy is widely used in cancer treatment to overcome drug resistance. High-throughput drug screening is the standard approach to study the drug combination effects, yet it becomes impractical when the number of drugs under consideration is large. Therefore, accurate and fast computational tools for predicting drug synergistic effects are needed to guide experimental design for developing candidate drug pairs.

Results

Here, we present TAIJI, a high-performance software for fast and accurate prediction of drug synergism. It is based on the winning algorithm in the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge, which is a unique platform to unbiasedly evaluate the performance of current state-of-the-art methods, and includes 160 team-based submission methods. When tested across a broad spectrum of 85 different cancer cell lines and 1089 drug combinations, TAIJI achieved a high prediction correlation (0.53), approaching the accuracy level of experimental replicates (0.56). The runtime is at the scale of minutes to achieve this state-of-the-field performance.

 

MIDAS announces winners of 2018 poster competition

By | Educational, General Interest, Happenings, Research

The Michigan Institute for Data Science (MIDAS) is pleased to announce the winners of its 2018 poster competition, which is held in conjunction with the MIDAS annual symposium.

The symposium was held on Oct. 9-10, 2018, and the student poster competition had more than 60 entries. The winners, judged by a panel of faculty members, received cash prizes.

Best Overall

Arthur Endsley, “Comparing and timing business cycles and land development trends in U.S. metropolitan housing markets”

Most likely health impact

  • Yehu Chen, Yingsi Jian, Qiucheng Wu, Yichen Yang, “Compressive Big Data Analytics – CBDA: Applications to Biomedical and Health Studies”
  • Jinghui Liu, “An Information Retrieval System with an Iterative Pattern for TREC Precision Medicine”

Most likely transformative science impact

  • Prashant Rajaram, “Bingeability and Ad Tolerance: New Metrics for the Streaming Media Age”
  • Mike Ion, “Learning About the Norms of Teaching Practice: How Can Machine Learning Help Analyze Teachers’ Reactions to Scenarios?”

Most interesting methodological advancement

  • Nina Zhou and Qiucheng Wu, “DataSifter: Statistical Obfuscation of Electronic Health Records and Other Sensitive Datasets”
  • Aniket Deshmukh, “Simple Regret Minimization for Contextual Bandits”

Most likely societal impact

  • Ece Sanci, “Optimization of Food Pantry Locations to Address Food Scarcity in Toledo, OH”
  • Rohail Syed, “Human Perception of Surprise: A User Study”

Most innovative use of data

  • Lan Luo, “Renewable Estimation and Incremental Inference in Generalized Linear Models with Streaming Datasets”
  • Danaja  Maldeniya, “Psychological Response of Communities affected by Natural Disasters in Social Media”

MDST group wins KDD best paper award

By | General Interest, Happenings, MDSTPosts, Research

A paper by members and faculty leaders of the Michigan Data Science Team (co-authors: Jacob Abernethy, Alex Chojnacki, Arya Farahi, Eric Schwartz, and Jared Webb) won the Best Student Paper award in the Applied Data Science track at the KDD 2018 conference in August in London.

The paper, ActiveRemediation: The Search for Lead Pipes in Flint, Michigan, details the group’s ongoing work in Flint to detect pipes made of lead and other hazardous material.

For more on the team’s work, see this recent U-M press release.