U-M, MIDAS researchers supported by Chan Zuckerberg Initiative

By | General Interest, Happenings, News, Research

Several University of Michigan researchers, including faculty affiliated with MIDAS, recently received support from the Chan Zuckerberg Initiative under its Human Cell Atlas project.

The project seeks to create a shared, open reference atlas of all cells in the healthy human body as a resource for studies of health and disease. The project is funding a variety of software tools and analytic methods. The U-M projects are listed below:

Identifying genetic markers: dimension reduction and feature selection for sparse data
Investigator: Anna Gilbert, Department of Mathematics, MIDAS Core Faculty Member
Description: One of the modalities that scientists participating in the Human Cell Atlas will use to gather data is single cell RNA sequencing (scRNA-seq). The analysis, however, of scRNA-seq data poses novel biological and algorithmic challenges. The data are high dimensional and not necessarily in distinct clusters (indeed, some cell types are exist along a continuum or developmental trajectory). In addition, data values are missing. To analyze this data, we must adjust our dimension reduction algorithms accordingly and either fill in the values or determine quantitatively the impact of the missing values. Furthermore, none of these steps is performed in isolation; they are part of a principled data analysis pipeline. This work will leverage over a decade of modern, sparsity-based machine learning methods and apply them to dimension reduction, marker selection, and data imputation for scRNA-seq data. In one of our two feature selection methods, we adapt a 1-bit compressed sensing algorithm (1CS) introduced by Genzel and Conrad. In order to select markers, the algorithm finds optimal hyperplanes that separate the given clusters of cells and that depend only on a small number of genes. The second method is based on the mutual information (MI) framework developed in. This algorithm greedily builds a set of markers out of a set of statistically significant genes that maximizes information about the target clusters and minimizes redundancy between markers. The imputation algorithms use sparse data models to impute missing values and are tailored to integer counts.

Computational tools for integrating single-cell RNA sequencing studies with genome-wide association studies
Investigator: Xiang Zhou, Biostatistics
Description: Single cell RNA sequencing (scRNAseq) has emerged as a powerful tool in genomics. Unlike previous bulk RNAseq that measures average expression levels across many cells, scRNAseq can measure gene expression at the single cell level. The high resolution of scRNAseq has thus far transformed genomics: scRNAseq has been applied to classify novel cell-subpopulations and states, quantify progressive gene expression, perform spatial mapping, identify differentially expressed genes, and investigate the genetic basis of expression variation. While many computational tools have been developed for analyzing scRNAseq data, tools for effective integrative analysis of scRNAseq with other existing genetic/genomic data types are underdeveloped. Here, we propose to extend our previous integrative methods and develop novel computational tools for integrating scRNAseq data with genome-wide association studies (GWASs). Our proposed tools will identify cell-subpopulations relevant to GWAS diseases or traits, facilitate the interpretation of association results, catalyze more powerful future association studies, and help understand disease etiology and the genetic basis of phenotypic variation. The proposed tools will be applied to integrate summary statistics from various GWASs with fine-scale cell-subpopulations identified from the Human Cell Atlas (HCA) project, to maximize the impact of HCA and facilitate our understanding of the genetic architecture of various human traits and diseases — a question of central importance to human health.

Joint analysis of single cell and bulk RNA data via matrix factorization
Investigator: Clayton Scott, Electrical Engineering and Computer Science, MIDAS Affiliated Faculty
Description: Single cell RNA sequence (ssRNAseq) data is a recently developed platform that enables the measurement of thousands of gene expression levels across individual cells in a tissue sample of interest. The ability to quantify gene expression at the cell level has great potential for advancing our understanding of the cellular processes that characterize a broad range of biological phenomena. However, compared with older bulk RNA technology, which measures expression levels of large numbers of cells in aggregate, ssRNAseq data has higher levels of measurement noise, which complicates its analysis. Furthermore, the problem of inferring cell type from ssRNAseq data is an unsupervised machine learning problem, an already difficult problem even without high measurement noise. To address these issues, we propose a mathematical and algorithmic framework to infer cellular characteristics by analyzing single cell and bulk RNA data simultaneously, via an approach grounded in matrix factorization. The developed algorithms will be evaluated on real data gathered by researchers at the University of Michigan who study breast cancer and spermatogenesis.

Integrating single cell profiles across modalities using manifold alignment
Investigator: Joshua Welch, Computational Medicine and Bioinformatics
Description: Integrating the variation underlying different types of single cell measurements is a critical step toward a comprehensive catalog of human cell types. The ideal approach to construct a cell type atlas would use high-throughput single cell multi-omic profiling to simultaneously measure all cellular modalities of interest within each cell. Although this approach is currently out of reach, it is possible to separately perform high-throughput transcriptomic, epigenomic, and proteomic measurements at the single cell level. Computationally integrating multiple data modalities measured on different individual cells can circumvent the experimental challenges of multi-omic profiling. If different types of single cell measurements are performed on distinct single cells from a common population, each modality will sample a similar set of cells. Matching up similar cells to infer multimodal profiles enables some analyses for which multi-omic profiling is desirable, including multimodal cell type definition and studying covariance among different data types. Manifold alignment is a powerful computational technique for integrating multiple sources of data that describe the same set of events by discovering the common manifold (general geometric shape) that underlies them. Previously, we showed that transcriptomic and epigenomic measurements performed on distinct single cells share underlying sources of variation. We developed a computational method, MATCHER, which uses manifold alignment to integrate cell trajectories constructed from these measurements and infer single cell multi-omic profiles. Here, we will extend this approach to match multimodal single cell profiles sampled from an entire tissue.

Computational methods to enable robust and cost-effective multiplexing of single cell rna-seq experiments in population-scale
Investigator: Hyun Min Kang, Biostatistics
Description: With the advent of single-cell genomic technologies, Human Cell Atlas (HCA) seeks to create a reference maps of each individual cell type and to understand how they develop and maintain their functions, how they interact with each other, and which environmental and/or genetic changes trigger molecular dysfunction that leads to disease. To achieve these goals, it becomes increasingly important to creatively integrate single-cell genomic technologies with novel computational methods to maximize the potential of the new technological advances. Recently, our group has developed a computational tool demuxlet that enable population- scale multiplexing of droplet-based single-cell RNA-seq (dscRNA-seq) experiments. Our approach harnesses natural genetic variation carried within dscRNA-seq reads to multiplex cells from many samples in a single library prep, and statistically deconvolute the sample identity of each barcoded droplet while filtering out multiplets (droplets that contain two or more cells). In this proposal, we aim to further extend our method to increase the accuracy by harnessing cell-specific expression levels, and to eliminate the constraint requiring external genotype data. We will enable application of these methods through production, distribution, and support of efficient, well-documented, open-source software; and test these tools through analysis of simulated data and of real dscRNA-seq data.


MIDAS Health Sciences Challenge Symposium

By |

Data-intensive health science is one of the research focus areas that MIDAS supports with its Challenge Awards.  Our long-term goal is to support this research area more broadly, using the Challenge Award projects as the starting point to build a critical mass.  This symposium offers a platform for all participants to explore collaboration opportunities and aims to attract more researchers to our hub.  It will feature in-depth presentations from three Challenge Award teams, and all participants are encouraged to submit posters on data-intensive health science research.


9 am to 12:50 pm: Welcome and Challenge Award Presentations

12:50 pm to 2:15 pm: Lunch, Poster Session and Networking [poster dimensions: up to 6ft wide X 4ft height]

2:15 pm: Panel Discussion: The Future of Data-intensive Health Sciences at U-M.

  • Panelists: Brian Athey (Moderator), Marisa Eisenberg, Jun Li, Brahmajee Nallamothu, Srijan Sen, Kevin Ward

Please register online.  Please submit poster abstracts (< 300 words).  Submission Deadline: April 28.

For questions: midas-research@umich.edu.

MIDAS Trustworthy Data Science Working Group

By |

The Michigan Institute for Data Science (MIDAS) is convening a research working group on Trustworthy Data Science.  We had a working group meeting last summer in response to an NSF funding announcement on secure and trustworthy cyberspace, and would like to expand to cover a wider range of research topics under the broad theme of “Trustworthy Data Science”.  This will include research and its application on data security, privacy, data fairness, validity, and sensible applications to policy.  Such topics are essential in data science methodology and tools development, and in many research areas including healthcare, education, business and finance, sustainability, and social sciences.  Our working group welcomes methodologists as well as researchers in any research area who take these issues into consideration.  We hope to create an interdisciplinary forum that will foster innovative ideas and new collaboration.


  1. Introduction
    1. Each participant has 2-3 minutes (based on the number of participants) to describe: a) their research focus and, b) their interest in any aspect of data security, privacy, fairness and validity.
  2. Presentation
    1. Dr. H. V. Jagadish (EECS) will give an overview of these research areas and major issues.
  3. Open discussion on ideas, collaboration and interesting funding opportunities.

Future Plan: Based on the interest of participants, MIDAS will hold regular meetings on Trustworthy Data Science (in the form of chalk talks, discussion of funding announcements, etc.), to foster innovative ideas and collaboration.

Please sign up using the online form.  

For questions, please contact Jing Liu, MIDAS Senior Scientist and Industry Partnership Leader (ljing@umich.edu734-764-2750).  Please share this announcement with your colleagues who might be interested.

Data for Public Good Symposium

By |

+ Are you interested in working alongside community partners around data and evaluation?

+ Do you want to learn how to use your data skills for justice?

+ Do you want to connect with students and student organizations who are using data for social good?

Join us for a symposium on April 13th bringing together graduate students, faculty, and staff from across the university discussing effective methods for justice-oriented approaches to community-facing data projects.

Activities Include:
+ Asset Mapping
+ Building muscles for collaboration
+ Skills-sharing
RSVP by April 6thhttps://goo.gl/5Bdqke


Interdisciplinary Committee on Organizational Studies (ICOS) Big Data Summer Camp, May 14-18

By | Data, Educational, General Interest, Happenings, News
Social and organizational life are increasingly conducted online through electronic media, from emails to Twitter feed to dating sites to GPS phone tracking. The traces these activities leave behind have acquired the (misleading) title of “big data.” Within a few years, a standard part of graduate training in the social sciences will include a hefty dose of “using of big data,” and we will all be utilizing terms like API and Python.
This year ICOS, MIDAS, and ARC are again offering a one-week “big data summer camp” for doctoral students interested in organizational research, with a combination of detailed examples from researchers; hands-on instruction in Python, SQL, and APIs; and group work to apply these ideas to organizational questions.  Enrollment is free, but students must commit to attending all day for each day of camp, and be willing to work in interdisciplinary groups.

The dates of the camp are all day May 14th-18th.

IOE 899 Seminar Series: Stanley Hamstra, PhD, Milestones Research & Evaluation Accreditation Council for Graduate Medical Education

By |

Stanley J. Hamstra, PhD

VP, Milestones Research and Evaluation Accreditation Council for Graduate Medical Education


“Learning Analytics in Graduate Medical Education: Realizing the Promise of CBME with Milestones Achievement Data”

Abstract: In 2012, the Accreditation Council for Graduate Medical Education (ACGME) introduced the Next Accreditation System (NAS) for improving postgraduate medical education. An important component of the NAS is a shift towards competency-based medical education (CBME), involving milestones as markers of achievement during training. Since 2015, the ACGME has been collecting milestones achievement data (competency ratings) on all resident and fellow physicians in accredited training programs in the USA (n > 110,000 residents and fellows per year). A critical assumption in CBME is that assessment data regarding any learner (in any form) contains some degree of uncertainty. At the same time, program directors must make finite/binary decisions about learners at the time of graduation, and indeed throughout training. The availability of milestones data, in the context of national trends, gives the program director an additional tool for making the best decisions regarding learner progression (and ultimately graduation). I will briefly review tools we have developed to help program directors make use of milestones data to enhance the quality of their decisions regarding resident progression and graduation. In addition, I will outline an approach to using the data for enhancing national curricula within a specialty.

Bio: Dr. Hamstra is responsible for oversight and leadership regarding research in Milestones and assessment systems that inform decisions around resident physician progression and board eligibility. Dr. Hamstra works with medical subspecialty societies, program director organizations, the American Board of Medical Specialties, and specialty certification boards. His research addresses medical education broadly, including competency assessment for residency training programs, and developing administrative support for educational scholarship within academic health settings. Prior to joining the ACGME, Dr. Hamstra was at the University of Michigan, the University of Ottawa, and the University of Toronto Department of Surgery. He has also worked closely with the Royal College of Physicians and Surgeons of Canada on developing policies regarding competency-based medical education for graduate medical education. Dr. Hamstra received his PhD in sensory neuroscience from York University in Toronto in 1994.

Women in Data Science: Stanford University, March 5, 2018

By |

Women in Data Science (WiDS) Conference

and Datathon

Registration for Livestream


The Global Women in Data Science (WiDS) Conference aims to inspire and educate data scientists worldwide, regardless of gender, and support women in the field. This annual one-day technical conference provides an opportunity to hear about the latest data science related research and applications in a broad set of domains, All genders are invited to participate in the conference, which features exclusively female speakers.

Next WiDS Conference: March 5, 2018 at Stanford University & 100+ locations worldwide
WiDS will be held at Stanford university, and at 100+ regional events hosted by WiDS Ambassadorsand available via livestream. The 2018 program will feature fantastic speakers on a broad array of topics ranging from cybersecurity to astrophysics to computational finance, and more. Register now for an event near you.

New for 2018: WiDS Datathon
This year, we’ll be conducting the first-ever WiDS Datathon, a joint effort between Stanford, Kaggle (a Google company), Intuit, InterMedia (a recipient of the Bill & Melinda Gates foundation, and West Big Data Innovation Hub.. The datathon runs from February 1-28, 2018, and winners will be announced at our March 5, 2018, conference at Stanford.

2017 Conference Highlights

  • 75,000+ participants from 75 countries via live stream and Facebook Live, at regional events or online
  • 80+ regional events worldwide from 30 countries, simultaneous or delayed broadcast, many with regional speakers.
  • #WiDS2017 hashtag trended on Twitter all day long
  • WiDS Stanford: 400 attendees from 31 universities and 114 companies and other organizations​, with 1/3 students and 2/3 academics and industry professionals
  • 33 distinguished female speakers, moderators, and panelists

MIDAS Working Group: Teaching Data Science

By |


The Michigan Institute for Data Science (MIDAS) continues to convene a working group on teaching data science. As we incorporate data science into almost every level of teaching, many issues need to be thoroughly thought out: How do we teach data science to students with various levels of preparation, from those with little quantitative training to STEM students? How do we build data science modules to incorporate into existing domain science courses? How do we raise awareness of ethics and social responsibility in data science teaching? How do we teach data science to independent researchers, including faculty, who want to build data science into their research? What teaching resources are available at UM? Our working group welcomes anyone interested in these topics. We are developing an interdisciplinary team to foster new ideas and collaborations in the development of data science teaching methods and materials.

Please RSVP.  

The agenda for the meeting includes:

  • Introduction
  • Short presentations
    • Kerby Shedden (Professor, Statistics, and CSCAR director) will share insight from his experience teaching “capstone” style courses for undergraduate and MS students, based around case studies and focus on methods, formulating good questions, and writing.
    • Heather Mayes (Assistant Professor, Chemical Engineering) will talk about the design of a Data Science ramp-up course for engineering students and how to integrate it with existing course offerings.
    • Aaron Keys (data scientist, Airbnb) will give the industry perspective on the various training paths that students can take for a career in data science.
  • Open discussion of ideas and collaboration, and sharing resources

For questions, please contact Jing Liu, MIDAS Senior Scientist and Industry Partnership Leader (ljing@umich.edu734-764-2750).

MIDAS Working Group: Data Integration

By |


Data integration is an essential component of data science research in almost all research areas that use heterogeneous data varying in format, dimensionality, quality and granularity.  The examples are endless: multi-omics data integration is increasingly critical in biological research; clinical research benefits greatly from the integration of patient longitudinal data, lab data, sensor data and other types of diagnosis and self-report; environmental monitoring often needs the integration of statistical data, image data and geospatial data; social science research, including education, political science and economics, increasingly integrates social media and other web-based data with traditional survey data…  All the applications encounter similar data science challenges, including idiosyncratic integration methods, missing data, bias and coverage, consistency and quality control issues.  Our working group welcomes researchers with interest in data integration methodology and its application in any scientific domain.  The Michigan Institute for Data Science (MIDAS) continues to convene a research working group on data integration to create a forum that will foster new ideas and collaborations.

Please RSVP.


  • Introduction
  • Chalk talks
    • Yang Chen (Assistant Professor, Dept. Statistics) will talk about her experience on data integration and some statistical methodology, and seek interests in collaboration.
    • Jamie Estill (staff scientist, HITS) will describe at a high level the capabilities and strength of data virtualization for data integration, using medical research examples, and discuss with the group how data virtualization can facilitate their research.
  • Open discussion on ideas and collaboration.

For questions, please contact Jing Liu, MIDAS Senior Scientist and Industry Partnership Leader (ljing@umich.edu; 734-764-2750).  Please share this announcement with your colleagues who might be interested.

CHEPS Seminar: Sung Won Choi, MD, MS, University of Michigan

By |

Sung Won Choi, MD, MS

Associate Professor, Pediatrics

Inaugural Edith S. Briskin / Shirley K Schlafer Research Professor of Pediatrics

Michigan Medicine

The University of Michigan


“Multi-dimensional, Highly Time-resolved Big Data Approach for Disease Prevention”

Abstract: Individualized prediction of disease (and disease‐related events) is a major unmet challenge, yet is essential for realizing the full potential of personalized medicine. Underlying the prediction problem is the fact that disease processes, and the human hosts in which they occur, represent complex dynamical systems comprised of large numbers of components that interact in non‐linear ways over time. A key insight from complexity science is that accurate long‐term prediction in such systems is usually not feasible, but short‐term predictions can be successful if multi‐parameter, highly time‐resolved data can be collected and integrated using computational methods. Complex science indicates that prediction of disease needs to be done on an ongoing basis, in near “real‐time”, because complex dynamical processes tend to proceed non‐linearly. There are “windows of opportunity” when signal begins to exceed background noise and the disease process is early enough for intervention to be successful. Please join Dr. Choi as she discusses how she and her collaborators, including Dr. Wiens (Computer Science/Machine Learning), Dr. Tewari (Medical Oncology), Dr. Kurabayashi (Mechanical Engineering), and Dr. Li (Computational Biology) are using the blood and marrow transplantation setting as an ideal model system to prototype such an approach for disease prediction that is consistent with the highly complex nature of human disease.

Bio: Sung Won Choi, MD, MS trained as a pediatric resident at New York University and later as a fellow in pediatric hematology‐oncology at the University of Michigan. Through an NIH K23 award, Sung received additional training in Clinical Research Design and Statistical Analysis through the University of Michigan School of Public Health. She is currently an Associate Professor in the Department of Pediatrics, and in 2017, she was named the inaugural Edith S. Briskin / Shirley K Schlafer Research Professor of Pediatrics. Sung specializes in the field of blood and marrow transplantation (BMT) and is recognized for her work in translating the use of histone deacetylase inhibition in BMT patients for prevention of a devastating complication known as graft-versus‐host disease (GVHD). She enjoys translatitional research initiatives that include the use of novel, non‐steroidal therapeutics both in the prevention and treatment of GVHD. Her research efforts focus on: 1) providing an improved understanding of clinical BMT through translation of experimental studies 2) exploring clinical outcomes in BMT patients alongside laboratory correlates; and 3) leveraging novel tools, such as information technology, to support patient‐ and caregiver‐centered care in her clinical and translational research efforts in BMT.

The seminar series “Providing Better Healthcare through Systems Engineering” is presented by the U‐M Center for Healthcare Engineering and Patient Safety (CHEPS): Our mission is to improve the safety and quality of healthcare delivery through a multi‐disciplinary, systems‐engineering approach.