Dr. Kang’s research focuses on the developments of statistical methods motivated by biomedical applications with a focus on neuroimaging. His recent key contributions can be summarized in the following three aspects:
Bayesian regression for complex biomedical applications
Dr. Kang and his group developed a series of Bayesian regression methods for the association analysis between the clinical outcome of interests (disease diagnostics, survival time, psychiatry scores) and the potential biomarkers in biomedical applications such as neuroimaging and genomics. In particular, they developed a new class of threshold priors as compelling alternatives to classic continuous shrinkages priors in Bayesian literatures and widely used penalization methods in frequentist literatures. Dr. Kang’s methods can substantially increase the power to detect weak but highly dependent signals by incorporating useful structural information of predictors such as spatial proximity within brain anatomical regions in neuroimaging [Zhao et al 2018; Kang et al 2018, Xue et al 2019] and gene networks in genomics [Cai et al 2017; Cai et al 2019]. Dr Kang’s methods can simultaneously select variables and evaluate the uncertainty of variable selection, as well as make inference on the effect size of the selected variables. His works provide a set of new tools for biomedical researchers to identify important biomarkers using different types of biological knowledge with statistical guarantees. In addition, Dr. Kang’s work is among the first to establish rigorous theoretical justifications for Bayesian spatial variable selection in imaging data analysis [Kang et al 2018] and Bayesian network marker selection in genomics [Cai et al 2019]. Dr. Kang’s theoretical contributions not only offer a deep understanding of the soft-thresholding operator on smooth functions, but also provide insights on which types of the biological knowledge may be useful to improve biomarker detection accuracy.
Prior knowledge guided variable screening for ultrahigh-dimensional data
Dr. Kang and his colleagues developed a series of variable screening methods for ultrahigh-dimensional data analysis by incorporating the useful prior knowledge in biomedical applications including imaging [Kang et al 2017, He et al 2019], survival analysis [Hong et al 2018] and genomics [He et al 2019]. As a preprocessing step for variable selection, variable screening is a fast-computational approach to dimension reduction. Traditional variable screening methods overlook useful prior knowledge and thus the practical performance is unsatisfying in many biomedical applications. To fill this gap, Dr. Kang developed a partition-based ultrahigh-dimensional variable screening method under generalized linear model, which can naturally incorporate the grouping and structural information in biomedical applications. When prior knowledge is unavailable or unreliable, Dr. Kang proposed a data-driven partition screening framework on covariate grouping and investigate its theoretical properties. The two special cases proposed by Dr. Kang: correlation-guided partitioning and spatial location guided partitioning are practically extremely useful for neuroimaging data analysis and genome-wide association analysis. When multiple types of grouping information are available, Dr. Kang proposed a novel theoretically justified strategy for combining screening statistics from various partitioning methods. It provides a very flexible framework for incorporating different types of prior knowledge.
Brain network modeling and inferences
Dr. Kang and his colleagues developed several new statistical methods for brain network modeling and inferences using resting-state fMRI data [Kang et al 2016, Xie and Kang 2017, Chen et al 2018]. Due to the high dimensionality of fMRI data (over 100,000 voxels in a standard brain template) with small sample sizes (hundreds of participants in a typical study), it is extremely challenging to model the brain functional connectivity network at voxel-levels. Some existing methods model brain anatomical region-level networks using the region-level summary statistics computed from voxel-level data. Those methods may suffer low power to detect the signals and have an inflated false positive rate, since the summary statistics may not well capture the heterogeneity within the predefined brain regions. To address those limitations, Dr. Kang proposed a novel method based on multi-attribute canonical correlation graphs [Kang et al 2016] to construct region-level brain network using voxel-level data. His method can capture different types of nonlinear dependence between any two brain regions consisting of hundreds or thousands of voxels. He also developed permutation tests for assessing the significance of the estimated network. His methods can largely increase power to detect signals for small sample size problems. In addition, Dr. Kang and his colleague also developed theoretically justified high-dimensional tests [Xie and Kang 2017] for constructing region-level brain networks using the voxel-level data under the multivariate normal assumption. Their theoretical results provide a useful guidance for the future development of statistical methods and theory for brain network analysis.
This image illustrates the neuroimaging meta-analysis data (Kang etal 2014). Neuroimaging meta-analysis is an important tool for finding consistent effects over studies. We develop a Bayesian nonparametric model and perform a meta-analysis of five emotions from 219 studies. In addition, our model can make reverse inference by using the model to predict the emotion type from a newly presented study. Our method outperforms other methods with an average of 80% accuracy.
1. Cai Q, Kang J, Yu T (2020) Bayesian variable selection over large scale networks via the thresholded graph Laplacian Gaussian prior with application to genomics. Bayesian Analysis, In Press (Earlier version won a student paper award from Biometrics Section of the ASA in JSM 2017)
2. He K, Kang J, Hong G, Zhu J, Li Y, Lin H, Xu H, Li Y (2019) Covariance-insured screening. Computational Statistics and Data Analysis: 132, 100—114.
3. He K, Xu H, Kang J† (2019) A selective overview of feature screening methods with applications to neuroimaging data, WRIES Computational Statistics, 11(2) e1454
4. Chen S, Xing Y, Kang J, Kochunov P, Hong LE (2018). Bayesian modeling of dependence in brain connectivity, Biostatistics, In Press.
5. Kang J, Reich BJ, Staicu AM (2018) Scalar-on-image regression via the soft thresholded Gaussian process. Biometrika: 105(1) 165–184.
6. Xue W, Bowman D and Kang J (2018) A Bayesian spatial model to predict disease status using imaging data from various modalities. Frontiers in Neuroscience. 12:184. doi:10.3389/fnins.2018.00184
7. Jin Z*, Kang J†, Yu T (2018) Missing value imputation for LC-MS metabolomics data by incorporating metabolic network and adduct ion relations. Bioinformatics, 34(9):1555—1561.
8. He K, Kang J† (2018) Comments on “Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data “. Bayesian Analysis, 13(1) 289-291.
9. Hong GH, Kang J†, Li Y (2018) Conditional screening for ultra-high dimensional covariates with survival outcomes. Lifetime Data Analysis: 24(1) 45-71.
10. Zhao Y*, Kang J†, Long Q (2018) Bayesian multiresolution variable selection for ultra-high dimensional neuroimaging data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15(2):537-550. (Earlier version won student paper award from ASA section on statistical learning and data mining in JSM 2014; It was also ranked as one of the top two papers in the student paper award competition in ASA section on statistics in imaging in JSM 2014)
11. Kang J, Hong GH, Li Y (2017) Partition-based ultrahigh dimensional variable screening, Biometrika, 104(4): 785-800.
12. Xie J#, Kang J# (2017) High dimensional tests for functional networks of brain anatomic regions. Journal of Multivariate Analysis, 156:70-88.
13. Cai Q*, Alvarez JA, Kang J†, Yu T (2017) Network marker selection for untargeted LC/MS metabolomics data, Journal of Proteome Research, 16(3):1261-1269
14. Kang J, Bowman FD, Mayberg H, Liu H (2016) A depression network of functionally connected regions discovered via multi-attribute canonical correlation graphs. NeuroImage, 41:431-441.
My main interest is theoretical statistics as implied to complex model from semiparametric to ultra high dimensional regression analysis. In particular the negative aspects of Bayesian and causal analysis as implemented in modern statistics.
An analysis of the position of SCOTUS judges.
I conduct research on the use of consumer-facing technologies for chronic disease self management. My work predominantly centers on the use of mobile applications that collect and manage patient generated health data overt time.
Dr. Hemphill studies conversations in social media and aims to promote just access to social media spaces and their data. She uses computational approaches to modeling political topics, predicting and addressing toxicity in online discussions, and tracing linguistic adaptations among extremists. She also studies digital data curation and is especially interested in ways to measure and model data reuse so that we can make informed decisions about how to allocate data resources.
Transportation is the backbone of the urban mobility system and is one of the greatest sources of environmental emissions and pollutions. Making urban transportation efficient, equitable and sustainable is the main focus of my research. My students and I analyze small scale survey data as well as large scale spatiotemporal data to identify travel behavior trends and patterns at a disaggregate level using econometric methods, which we then scale up to the population level through predictive and statistical modeling. We also design our own data collection methods and instruments, be it a network of smart devices or stated preference experiments. Our expertise lies in identifying latent constructs that influence decisions and choices, which in turn dictate demands on the systems and subsystems. We use our expertise to design incentives and policy suggestions that can help promote sustainable and equitable multimodal transportation systems. Our team also uses data analytics, particularly classification and pattern recognition algorithms, to analyze crash context data and develop safety-critical scenarios for automated and connected vehicle (CAV) deployment. We have developed an online game based on such scenarios to promote safe shared mobility among teenagers and young adults and plan to expand research in that area. We are also currently expanding our research to explore the use of NN in context information synthesis.
This is a project where we used classification and Bayesian models to identify scenarios that are risky for pedestrians and bicyclists. We then developed an online game based on those scenarios for middle schoolers so that they are better prepared for shared road conflicts.
My areas of interest are control, estimation, and optimization, with applications to energy systems in transportation, automotive, and marine domains. My group develops model-based and data-driven tools to explore underlying system dynamics and understand the operational environments. We develop computational frameworks and numerical algorithms to achieve real-time optimization and explore connectivity and data analytics to reduce uncertainties and improve performance through predictive control and planning.
My core research focuses on the politics and measurement of human rights, discrimination, violence, and repression. I use computational methods to understand why governments around the world torture, maim, and kill individuals within their jurisdiction and the processes monitors use to observe and document these abuses. Other projects cover a broad array of themes but share a focus on computationally intensive methods and research design. These methodological tools, essential for analyzing data at massive scale, open up new insights into the micro-foundations of state repression and the politics of measurement.
People rely more on strong ties for job help in countries with greater inequality. Coefficients from 55 regressions of job transmission on tie strength are compared to measures of inequality (Gini coefficient), mean income per capita, and population, all measured in 2013. Gray lines indicate 95% confidence regions from 1000 simulated regressions that incorporate uncertainty in the country-level regressions (see below for more details). In each simulated regression we draw each country point from the distribution of regression coefficients implied by the estimate and standard error for that country and measure of tie strength. P values indicate the simulated probability that there is no relationship between tie strength and the other variable. Laura K. Gee, Jason J. Jones, Christopher J. Fariss, Moira Burke, and James H. Fowler. “The Paradox of Weak Ties in 55 Countries” Journal of Economic Behavior & Organization 133:362-372 (January 2017) DOI:10.1016/j.jebo.2016.12.004
My research involves developing novel data collection strategies and image reconstruction techniques for Magnetic Resonance Imaging. In order to accelerate data collection, we take advantage of features of MRI data, including sparsity, spatiotemporal correlations, and adherence to underlying physics; each of these properties can be leveraged to reduce the amount of data required to generate an image and thus speed up imaging time. We also seek to understand what image information is essential for radiologists in order to optimize MRI data collection and personalize the imaging protocol for each patient. We deploy machine learning algorithms and optimization techniques in each of these projects. In some of our work, we can generate the data that we need to train and test our algorithms using numerical simulations. In other portions, we seek to utilize clinical images, prospectively collected MRI data, or MRI protocol information in order to refine our techniques.
We seek to develop technologies like cardiac Magnetic Resonance Fingerprinting (cMRF), which can be used to efficiently collect multiple forms of information to distinguish healthy and diseased tissue using MRI. By using rapid methods like cMRF, quantitative data describing disease processes can be gathered quickly, enabling more and sicker patients can be assessed via MRI. These data, collected from many patients over time, can also be used to further refine MRI technologies for the assessment of specific diseases in a tailored, patient-specific manner.
Alzheimer’s disease (AD) afflicts more than 5 million people in the United States and is gaining widespread attention. Over 400 clinical trials were run between 2002 and 2012, but only one trial has resulted in a marketable product. One of the most common explanations for these failures is likely the consideration of Alzheimer’s as a homogeneous disease. In many cases, individuals within the same group respond to a drug in different ways. Given the highly complex nature of AD, the likelihood of identifying a single drug to provide meaningful benefits to every patient is minimal. There is a pressing and unmet need to develop personalized treatment plans based on each patients’ omics profiles.
To solve this problem, my research focus is to develop a data-driven computational approach to predict drug responses for individuals with AD. This approach is based on the patients’ metabolomics and transcriptomics profile and publicly available drug databases. Transcriptomics and metabolomics are increasingly being used to corroborate our interpretation of the pathophysiological pathways underlying AD. Integration of metabolomics and transcriptomics will guide the development of precision medicine for AD. In particular, I used the metabolome and transcriptome profiles of Alzheimer’s patients from ADNI database. For each patient, I identify his/her dysregulated pathways from their metabolome profiles and his/her specific gene regulatory network from their transcriptome profiles. My preliminary data suggested that each patient with Alzheimer’s has distinct dysregulated pathways and gene regulatory network. Drug selection based on a patient’s specific metabolome and transcriptome profiles offers a tremendous opportunity for more targeted and effective disease treatment and it represents a critical innovation towards personalized medicine for AD. My long-term goal is to become an independent investigator in computational biology with a focus on translating omics data to bedside application. The overall objective of my research is to combine metabolomics and gene expression data with drug data using advanced machine learning algorithms to personalize medicine for AD.