U-M, MIDAS researchers supported by Chan Zuckerberg Initiative

By | General Interest, Happenings, News, Research

Several University of Michigan researchers, including faculty affiliated with MIDAS, recently received support from the Chan Zuckerberg Initiative under its Human Cell Atlas project.

The project seeks to create a shared, open reference atlas of all cells in the healthy human body as a resource for studies of health and disease. The project is funding a variety of software tools and analytic methods. The U-M projects are listed below:

Identifying genetic markers: dimension reduction and feature selection for sparse data
Investigator: Anna Gilbert, Department of Mathematics, MIDAS Core Faculty Member
Description: One of the modalities that scientists participating in the Human Cell Atlas will use to gather data is single cell RNA sequencing (scRNA-seq). The analysis, however, of scRNA-seq data poses novel biological and algorithmic challenges. The data are high dimensional and not necessarily in distinct clusters (indeed, some cell types are exist along a continuum or developmental trajectory). In addition, data values are missing. To analyze this data, we must adjust our dimension reduction algorithms accordingly and either fill in the values or determine quantitatively the impact of the missing values. Furthermore, none of these steps is performed in isolation; they are part of a principled data analysis pipeline. This work will leverage over a decade of modern, sparsity-based machine learning methods and apply them to dimension reduction, marker selection, and data imputation for scRNA-seq data. In one of our two feature selection methods, we adapt a 1-bit compressed sensing algorithm (1CS) introduced by Genzel and Conrad. In order to select markers, the algorithm finds optimal hyperplanes that separate the given clusters of cells and that depend only on a small number of genes. The second method is based on the mutual information (MI) framework developed in. This algorithm greedily builds a set of markers out of a set of statistically significant genes that maximizes information about the target clusters and minimizes redundancy between markers. The imputation algorithms use sparse data models to impute missing values and are tailored to integer counts.

Computational tools for integrating single-cell RNA sequencing studies with genome-wide association studies
Investigator: Xiang Zhou, Biostatistics
Description: Single cell RNA sequencing (scRNAseq) has emerged as a powerful tool in genomics. Unlike previous bulk RNAseq that measures average expression levels across many cells, scRNAseq can measure gene expression at the single cell level. The high resolution of scRNAseq has thus far transformed genomics: scRNAseq has been applied to classify novel cell-subpopulations and states, quantify progressive gene expression, perform spatial mapping, identify differentially expressed genes, and investigate the genetic basis of expression variation. While many computational tools have been developed for analyzing scRNAseq data, tools for effective integrative analysis of scRNAseq with other existing genetic/genomic data types are underdeveloped. Here, we propose to extend our previous integrative methods and develop novel computational tools for integrating scRNAseq data with genome-wide association studies (GWASs). Our proposed tools will identify cell-subpopulations relevant to GWAS diseases or traits, facilitate the interpretation of association results, catalyze more powerful future association studies, and help understand disease etiology and the genetic basis of phenotypic variation. The proposed tools will be applied to integrate summary statistics from various GWASs with fine-scale cell-subpopulations identified from the Human Cell Atlas (HCA) project, to maximize the impact of HCA and facilitate our understanding of the genetic architecture of various human traits and diseases — a question of central importance to human health.

Joint analysis of single cell and bulk RNA data via matrix factorization
Investigator: Clayton Scott, Electrical Engineering and Computer Science, MIDAS Affiliated Faculty
Description: Single cell RNA sequence (ssRNAseq) data is a recently developed platform that enables the measurement of thousands of gene expression levels across individual cells in a tissue sample of interest. The ability to quantify gene expression at the cell level has great potential for advancing our understanding of the cellular processes that characterize a broad range of biological phenomena. However, compared with older bulk RNA technology, which measures expression levels of large numbers of cells in aggregate, ssRNAseq data has higher levels of measurement noise, which complicates its analysis. Furthermore, the problem of inferring cell type from ssRNAseq data is an unsupervised machine learning problem, an already difficult problem even without high measurement noise. To address these issues, we propose a mathematical and algorithmic framework to infer cellular characteristics by analyzing single cell and bulk RNA data simultaneously, via an approach grounded in matrix factorization. The developed algorithms will be evaluated on real data gathered by researchers at the University of Michigan who study breast cancer and spermatogenesis.

Integrating single cell profiles across modalities using manifold alignment
Investigator: Joshua Welch, Computational Medicine and Bioinformatics
Description: Integrating the variation underlying different types of single cell measurements is a critical step toward a comprehensive catalog of human cell types. The ideal approach to construct a cell type atlas would use high-throughput single cell multi-omic profiling to simultaneously measure all cellular modalities of interest within each cell. Although this approach is currently out of reach, it is possible to separately perform high-throughput transcriptomic, epigenomic, and proteomic measurements at the single cell level. Computationally integrating multiple data modalities measured on different individual cells can circumvent the experimental challenges of multi-omic profiling. If different types of single cell measurements are performed on distinct single cells from a common population, each modality will sample a similar set of cells. Matching up similar cells to infer multimodal profiles enables some analyses for which multi-omic profiling is desirable, including multimodal cell type definition and studying covariance among different data types. Manifold alignment is a powerful computational technique for integrating multiple sources of data that describe the same set of events by discovering the common manifold (general geometric shape) that underlies them. Previously, we showed that transcriptomic and epigenomic measurements performed on distinct single cells share underlying sources of variation. We developed a computational method, MATCHER, which uses manifold alignment to integrate cell trajectories constructed from these measurements and infer single cell multi-omic profiles. Here, we will extend this approach to match multimodal single cell profiles sampled from an entire tissue.

Computational methods to enable robust and cost-effective multiplexing of single cell rna-seq experiments in population-scale
Investigator: Hyun Min Kang, Biostatistics
Description: With the advent of single-cell genomic technologies, Human Cell Atlas (HCA) seeks to create a reference maps of each individual cell type and to understand how they develop and maintain their functions, how they interact with each other, and which environmental and/or genetic changes trigger molecular dysfunction that leads to disease. To achieve these goals, it becomes increasingly important to creatively integrate single-cell genomic technologies with novel computational methods to maximize the potential of the new technological advances. Recently, our group has developed a computational tool demuxlet that enable population- scale multiplexing of droplet-based single-cell RNA-seq (dscRNA-seq) experiments. Our approach harnesses natural genetic variation carried within dscRNA-seq reads to multiplex cells from many samples in a single library prep, and statistically deconvolute the sample identity of each barcoded droplet while filtering out multiplets (droplets that contain two or more cells). In this proposal, we aim to further extend our method to increase the accuracy by harnessing cell-specific expression levels, and to eliminate the constraint requiring external genotype data. We will enable application of these methods through production, distribution, and support of efficient, well-documented, open-source software; and test these tools through analysis of simulated data and of real dscRNA-seq data.


Biostatistics Seminar: Jeffrey Morris, UT MD Anderson Cancer Center

By |

Jeffrey Morris, Ph.D.

 Professor, Deputy Chair Ad Interim—Department of Biostatistics

The University of Texas MD Anderson Cancer Center


Bayesian Quantile Functional Regression for Biomedical Imaging Data

Abstract: In many areas of science, technological advances have led to devices that produce an enormous number of measurements per subject.  Frequently, researchers deal with these data by extracting summary statistics from these data (e.g. mean or variance) and then modeling those, but this approach can miss key insights when the summaries do not capture all of the relevant information in the raw data.  One of the key challenges in modern statistics is to devise methods that can extract information from these big data while avoiding reductionist assumptions.    In this talk, we will discuss methods for modeling the entire distribution of the measurements observed for each subject and relating properties of the distribution to covariates.  Our approach is to represent the observed data as an empirical quantile function for each subject, and then regress these quantile functions on a set of scalar predictors, an approach we call quantile functional regression.  We introduce custom basis functions called “quantlets” to represent the quantile functions that are orthogonal and empirically defined, so adaptive to the features of the given data set.  After fitting the quantile functional regression, we are able to perform global tests for which covariates have an effect on any aspect of the distribution, and then follow that up with local tests to characterize these differences, identifying at which quantiles the differences lie and/or assessing whether the covariate affects certain major aspects of the distribution, including location, scale, skewness, or Gaussian-ness while accounting for multiple testing.  If the differences lie in these commonly used summaries, our method can still detect them, but our method will not miss effects on aspects of the distribution outside of these summaries.  We illustrate this method on biomedical imaging data for which we relate the distribution of pixel intensities to various demographic and clinical characteristics, but the method has wide-ranging application to many areas including climate modeling, genomics, electronic medical records, and wearable computing devices.  Time allowing, I will also provide some illustrations of these methods applied in these areas of application.

Light refreshments for seminar guests will be served at 3:00 p.m. in 3755

UM Biostatistics Seminar: Veronika Rockova, PhD, University of Chicago

By |


Veronika Rockova, Ph.D.

Assistant Professor in Econometrics and Statistics

The University of Chicago Booth


‘Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity’

Abstract: Rotational post hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework. Our approach deploys intermediate factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain (a) dramatic accelerations, (b) robustness against poor initializations, and (c) better oriented sparse solutions. To avoid the prespecification of the factor cardinality, we extend the loading matrix to have infinitely many columns with the Indian buffet process (IBP) prior. The factor dimensionality is learned from the posterior, which is shown to concentrate on sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. For accurate recovery of the factor loadings, we deploy the spike-and-slab LASSO prior, a two-component refinement of the Laplace prior. A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional data, which would render posterior simulation impractical. Supplementary materials for this article are available online.

Bio: Veronika Rockova is Assistant Professor in Econometrics and Statistics at the University of Chicago Booth School of Business. Her work brings together statistical methodology, theory and computation to develop high-performance tools for analyzing large datasets. Her research interests reside at the intersection of Bayesian and frequentist statistics, and focus on: data mining, variable selection, optimization, non-parametric methods, factor models, high-dimensional decision theory and inference. She has authored a variety of published works in top statistics journals. In her applied work, she has contributed to the development of risk stratification and prediction models for public reporting in healthcare analytics.

Prior to joining Booth, Rockova held a Postdoctoral Research Associate position at the Department of Statistics of the Wharton School at the University of Pennsylvania. Rockova holds a PhD in biostatistics from Erasmus University (The Netherlands), an MSc in biostatistics from Universiteit Hasselt (Belgium) and both an MSc in mathematical statistics and a BSc in general mathematics from Charles University (Czech Republic).

Besides enjoying statistics, she is a keen piano player.


Light refreshments for seminar guests will be served at 3:00 p.m. in 3755.

Biostatistics Seminar: Jonathan Terhorst, PhD Candidate, University of California, Berkeley

By |


Jonathan Terhorst, PhD Candidate

Statistics, University of California at Berkeley


“Robust and Scalable Inference of Population History and Selection

from Hundreds of Whole Genomes”

Abstract: Demographic inference refers to the problem of inferring past population events (migrations, admixture, expansions, etc.) from patterns of mutations in sampled DNA. Apart from intrinsic appeal of understanding the origins of our species, this type of analysis is useful for forming a null model of human evolution, departures from which signal the presence of natural selection, population structure, and other interesting phenomena.

In this talk I will discuss recent statistical and computational innovations which enable us to infer demography using modern data sets consisting of hundreds of whole-genome sequences obtained from populations all over the world. These include momi, a new software package for stable and rapid computation of the expected sample frequency spectrum (SFS) under complex demographic scenarios involving numerous diverged populations, as well as SMC++, a new probabilistic framework which couples the genealogical process for a given individual with allele frequency information from a large related panel. Using these tools, I will demonstrate how we can learn about human expansion in the last 12,000 years, understand the mysterious origins of ancient DNA samples, and estimate when Europeans acquired lighter skin and the ability to digest lactose. Finally, I will discuss some of the statistical aspects of these estimators, in particular an information-theoretic lower bound on the error rate of any SFS-based demographic inference procedure.

All relevant theory will be introduced during the talk; no prior knowledge of population genetics is assumed. Portions of this work are joint with Jack Kamm, Pier Palamara, and Yun Song.

Bio: I am a PhD student in the statistics department at UC Berkeley. I’m interested in statistical / population genetics, machine learning, and generally developing mathematical models and software to help fellow scientists understand their data.

Light refreshments will be served at 3:10 p.m. in room 1690.

Biostatistics Seminar: Kenneth Lange, Professor of Biomathematics, Human Genetics and Statistics, UCLA: “Next Generation Statistical Genetics”

By |

This talk will discuss how modern data mining techniques can be imported into statistical genetics. Most relevant models now invoke high-dimensional optimization. Penalization and set projection give sparsity. Separation of variables gives parallelization. Time permitting, these ideas will be illustrated by several examples: estimation of ethnic ancestry, genotype imputation via matrix completion, conversion of imputed genotypes into haplotypes, matrix completion discriminant analysis, estimation in the linear mixed model, iterative hard thresholding in GWAS, and sparse principal components analysis.