Dr. Zhu’s group conducts research on various topics, ranging from foundational methodologies to challenging applications, in data science. In particular, the group has been investigating the fundamental issues and techniques for supporting various types of queries (including range queries, box queries, k-NN queries, and hybrid queries) on large datasets in a non-ordered discrete data space. A number of novel indexing and searching techniques that utilize the unique characteristics of an NDDS are developed. The group has also been studying the issues and techniques for storing and searching large scale k-mer datasets for various genome sequence analysis applications in bioinformatics. A virtual approximate store approach to supporting repetitive big data in genome sequence analyses and several new sequence analysis techniques are suggested. In addition, the group has been researching the challenges and methods for processing and optimizing a new type of so-called progressive queries that are formulated on the fly by a user in multiple steps. Such queries are widely used in many application domains including e-commerce, social media, business intelligence, and decision support. The other research topics that have been studied by the group include streaming data processing, self-management database, spatio-temporal data indexing, data privacy, Web information management, and vehicle drive-through wireless services.
The Smith lab group is primarily interested in examining evolutionary processes using new data sources and analysis techniques. We develop new methods to address questions about the rates and modes of evolution using the large data sources that have become more common in the biological disciplines over the last ten years. In particular, we use DNA sequence data to construct phylogenetic trees and conduct additional analyses about processes of evolution on these trees. In addition to this research program, we also address how new data sources can facilitate new research in evolutionary biology. To this end, we sequence transcriptomes, primarily in plants, with the goal of better understanding where, within the genome and within the phylogeny, processes like gene duplication and loss, horizontal gene transfer, and increased rates of molecular evolution occur.
Prof. Vershynin’s main area of expertise is high dimensional probability and its applications. He is interested in random geometric structures that appear in various data science problems. The following is a sample of his recent projects: 1. High dimensional inference from nonlinear data Sometimes we are given certain observations of an unknown vector that encodes useful but hidden information, and we want to compute that vector. Examples includes compressed sensing, linear and non-linear regression, as well as binary (yes-no) observations. We are developing methods that can estimate the hidden vector without even knowing the nature of the non-linearity of observations. Areas of application include survey methodologies, signal processing, and various high-dimensional classification problems. 2. Structure mining in networks Complex data sets such as networks often have latent structures, for example clusters or communities. We are interested in developing efficient methods to discover such latent structures. Prof. Vershynin’s methods come from various areas of mathematics and data science, including random matrix theory, geometric functional analysis, convex and discrete geometry, geometric combinatorics, high dimensional statistics, information theory, learning theory, signal processing, theoretical computer science and numerical analysis.
I am broadly interested in statistical inference, which is informally defined as the process of turning data into prediction and understanding. I like to work with richly structured data, such as those extracted from texts, images and other spatiotemporal signals. In recent years I have gravitated toward a field in statistics known as Bayesian nonparametrics, which provides a fertile and powerful mathematical framework for the development of many computational and statistical modeling ideas. My motivation for all this came originally from an early interest in machine learning, which continues to be a major source of research interest. A primary focus of my group’s research in machine learning to develop more effective inference algorithms using stochastic, variational and geometric viewpoints.
My current research explores the possibilities and limits of Markov Chain Monte Carlo (MCMC) methods in dealing with posterior or quasi-posterior distributions that arise from high-dimensional Bayesian (or quasi-Bayesian) inference in regression and graphical models. I also have some interests in optimization, and these revolve around the use of stochastic methods: whether (and how) the use of stochastic methods can help tackle large scale optimization problems of interest in statistics. I also have interests in the use of remote sensing data to study social and environmental issues in Africa.