Jinseok Kim, Ph.D., is Research Assistant Professor in the Institute for Social Research at the University of Michigan, Ann Arbor. Prof. Kim works on resolving named entity ambiguity in large-scale scholarly data (publication, patent, and funding records) in digital libraries. Especially, his current research is focused on developing methods for disambiguating author and affiliation names at a digital library scale using various supervised machine learning approaches trained on automatically labeled data . Disambiguated data from multiple sources will be integrated to be analyzed for insights into research production, scientific collaboration, funding evaluation, and research policy at a national level.
Dr. Mitchell’s research focuses on the causes and consequences of family formation behavior. He examines how social context such as neighborhood resources and values influence family processes and how those processes interplay with an individual’s genetic and epigenetic makeup to influence behavior, wellbeing, and health. His research also includes the development of new methods for integrating the collection and analysis of biological and social data.
Zhenke Wu is an Assistant Professor of Biostatistics, and a core faculty member in the Michigan Institute of Data Science (MIDAS). He received his Ph.D. in Biostatistics from the Johns Hopkins University in 2014 and then stayed at Hopkins for his postdoctoral training before joining the University of Michigan. Dr. Wu’s research focuses on the design and application of statistical methods that inform health decisions made by individuals, or precision medicine. The original methods and software developed by Dr. Wu are now used by investigators from research institutes such as CDC and Johns Hopkins, as well as site investigators from developing countries, e.g., Kenya, South Africa, Gambia, Mali, Zambia, Thailand and Bangladesh.
Profile: At a “sweet spot” of data science
By Dan Meisler
Communications Manager, ARC
If you had to name two of the more exciting, emerging fields of data science, electronic health records (EHR) and mobile health might be near the top of the list.
Zhenke Wu, one of the newest MIDAS core faculty members, has one foot firmly in each field.
“These two fields share the common goal of learning from the experience of the population in the past to advance health and clinical decisions for those to follow. I am looking forward to more work that will bring the two fields closer to continuously generate insights about human health.” Wu said. “I’m in a sweet spot.”
Wu joined U-M in Fall 2016, after earning a PhD in Biostatistics from Johns Hopkins University, and a bachelor’s in Mathematics from Fudan University. He said the multitude of large-scale studies going on at U-M and access to EHR databases were factors in his coming to Michigan.
“The University of Michigan is an exciting place that has a diversity of large-scale databases and supportive research groups in the fields I’m interested in,” he said.
Wu is collaborating with the Michigan Genomics Initiative, which is a biorepository effort at Michigan Medicine to integrate genome-wide information with EHR from approximately 40,000 patients undergoing anesthesia prior to surgery or diagnostic procedures. He’s also collaborating with Dr. Srijan Sen, Associate Professor, Department of Psychiatry and Molecular and Behavioral Neuroscience Institute, on the MIDAS-supported project “Identifying Real-Time Data Predictors of Stress and Depression Using Mobile Technology,” the preliminary results of which recently matured into an NIH-funded R01 project “Mobile Technology to Identify Mechanisms Linking Genetic Variation and Depression” that will draw broad expertise from a multi-disciplinary team of medical and data science researchers.
“One of my goals is to use an integrated and rigorous approach to predict how a person’s health status will be in the near future,” Wu said.
Wu applies hierarchical Bayesian models to these problems, which he hopes will shed light on phenomena he describes as latent constructs that are “well-known, but less quantitatively understood, e.g., intelligence quotient (IQ) in psychology.”
As another example, he cites the current challenge in active surveillance of prostate cancer patients for aggressive tumors requiring removal and/or radiation, or indolent tumors permitting continued surveillance.
“The underlying status of aggressive versus indolent cancer is not observed, which needs to be learned from the results of biopsy and other clinical measurements,” he said. “The decisions and experience of urologists and their patients will greatly benefit from more accurate understanding of the tumor status… There are lots of scientific problems in clinical, biomedical, behavioral and social sciences where you have well-known but less quantitatively understood latent constructs. These are problems that Bayesian latent variable methods can formulate and address.”
Just as Wu has a hand in two hot-button big data areas, he also sees himself as straddling the line between application and methodology.
He says the large number of data sources — sensors, mobile apps, test results, and questionnaires, to name just a few — results in richness as well as some “messiness” that needs new methodologies to adjust, integrate and translate to new scientific insights. At the same time, a valid new methodology for dealing with, for example, electronic health data, will likely find numerous different applications.
Wu says his approach was heavily influenced by his work in the Pneumonia Etiology Research for Child Health (PERCH) funded by the Gates Foundation while he was at Johns Hopkins. Pneumonia is a clinical syndrome due to lung infection that can be caused by more than 30 different species of pathogens, including bacteria, viruses and fungi. The goal of the seven-country study that enrolled more than 5,000 cases and 5,000 controls from Africa and Southeast Asia is to estimate the frequency with which each pathogen caused pneumonia in the population and the probability of each individual being infected by the list of pathogens in the lung.
“In most settings, it is extremely difficult to identify the pathogen by directly sampling from the site of infection – the child’s lung. PERCH therefore looked for other sources of evidence by standardizing and comprehensively testing biofluids collected from sites peripheral to the lung. Using hierarchical Bayesian models to infer disease etiology by integrating such a large trove of data was extremely fun and exciting”, he said.
Wu’s initial interest in math, leading to biostatistics and now data science, stems from what he called a “greedy” desire to learn the guiding principles of how the world works by rigorous data science.
“If you have new problems, you can wait for other people to ask a clean math question, or you can go work with these messy problems and figure out interesting questions and their answers,” he said.
For more on Dr. Wu, see his profile on Michigan Experts.
Nested partially latent class models for dependent binary data; Estimating disease etiology
on April 1, 2017 at 12:00 am
Nested partially latent class models for dependent binary data; Estimating disease etiologyWu, Z., Deloria-Knoll, M. & Zeger, S. L. Apr 1 2017 In : Biostatistics. 18, 2, p. 200-213 14 p.Research output: Contribution to journal › Artic […]
Bayesian estimation of pneumonia etiology: Epidemiologic considerations and applications to the pneumonia etiology research for child health study
on January 1, 2017 at 12:00 am
Bayesian estimation of pneumonia etiology: Epidemiologic considerations and applications to the pneumonia etiology research for child health studyKnoll, M. D. , Fu, W. , Shi, Q. , Prosperi, C. , Wu, Z. , Hammitt, L. L. , Feikin, D. R. , Baggett, H. C. , Howie, S. R. C. , Scott, J. A. G. , Murdoch, D. R. , Madhi, S. A. , Thea, D. M. , Brooks, W. A. , Kotloff, K. L. , Li, M. , Park, D. E. , Lin, W. , Levine, O. S. , O'Brien, K. L. & 1 others Zeger, S. L. Jan 1 2017 In : Clinical Infectious Diseases. 64, p. S213-S227Research output: Contribution to journal › Artic […]
Partially latent class models for case-control studies of childhood pneumonia aetiology
on January 1, 2016 at 12:00 am
Partially latent class models for case-control studies of childhood pneumonia aetiologyWu, Z., Deloria-Knoll, M., Hammitt, L. L. & Zeger, S. L. Jan 1 2016 In : Journal of the Royal Statistical Society. Series C: Applied Statistics. 65, 1, p. 97-114 18 p.Research output: Contribution to journal › Artic […]
Jun Li, PhD, is Professor and Chair for Research in the department of Computational Medicine and Bioinformatics and Professor of Human Genetics in the Medical School at the University of Michigan, Ann Arbor.
Ding Zhao, PhD, is Assistant Research Scientist in the department of Mechanical Engineering, College of Engineering with a secondary appointment in the Robotics Institute at The University of Michigan, Ann Arbor.
Dr. Zhao’s research interests include autonomous vehicles, intelligent/connected transportation, traffic safety, human-machine interaction, rare events analysis, dynamics and control, machine learning, and big data analysis
V.G.Vinod Vydiswaran, PhD, is Assistant Professor in the Department of Learning Health Sciences with a secondary appointment in the School of Information at the University of Michigan, Ann Arbor.
Dr. Vydiswaran’s research focuses on developing and applying text mining, natural language processing, and machine learning methodologies for extracting relevant information from health-related text corpora. This includes medically relevant information from clinical notes and biomedical literature, and studying the information quality and credibility of online health communication (via health forums and tweets). His previous work includes developing novel information retrieval models to assist clinical decision making, modeling information trustworthiness, and addressing the vocabulary gap between health professionals and laypersons.
Sriram Chandrasekaran, PhD, is Assistant Professor of Biomedical Engineering in the College of Engineering at the University of Michigan, Ann Arbor.
Dr. Chandrasekaran’s Systems Biology lab develops computer models of biological processes to understand them holistically. Sriram is interested in deciphering how thousands of proteins work together at the microscopic level to orchestrate complex processes like embryonic development or cognition, and how this complex network breaks down in diseases like cancer. Systems biology software and algorithms developed by his lab are highlighted below and are available at http://www.sriramlab.org/software/.
– INDIGO (INferring Drug Interactions using chemoGenomics and Orthology) algorithm predicts how antibiotics prescribed in combinations will inhibit bacterial growth. INDIGO leverages genomics and drug-interaction data in the model organism – E. coli, to facilitate the discovery of effective combination therapies in less-studied pathogens, such as M. tuberculosis. (Ref: Chandrasekaran et al. Molecular Systems Biology 2016)
– GEMINI (Gene Expression and Metabolism Integrated for Network Inference) is a network curation tool. It allows rapid assessment of regulatory interactions predicted by high-throughput approaches by integrating them with a metabolic network (Ref: Chandrasekaran and Price, PloS Computational Biology 2013)
– ASTRIX (Analyzing Subsets of Transcriptional Regulators Influencing eXpression) uses gene expression data to identify regulatory interactions between transcription factors and their target genes. (Ref: Chandrasekaran et al. PNAS 2011)
– PROM (Probabilistic Regulation of Metabolism) enables the quantitative integration of regulatory and metabolic networks to build genome-scale integrated metabolic–regulatory models (Ref: Chandrasekaran and Price, PNAS 2010)
Gilbert Omenn, MD, PhD, is Professor of Computational Medicine & Bioinformatics with appointments in Human Genetics, Molecular Medicine & Genetics in the Medical School and Professor of Public Health in the School of Public Health and the Harold T. Shapiro Distinguished University Professor at the University of Michigan, Ann Arbor.
Doctor Omenn’s current research interests are focused on cancer proteomics, splice isoforms as potential biomarkers and therapeutic tar- gets, and isoform-level and single-cell functional networks of transcripts and proteins. He chairs the global Human Proteome Project of the Human Proteome Organization.
The GEMS (Graph Exploration and Mining at Scale) Lab develops new, fast and principled methods for mining and making sense of large-scale data. Within data mining, we focus particularly on interconnected or graph data, which are ubiquitous. Some examples include social networks, brain graphs or connectomes, traffic networks, computer networks, phonecall and email communication networks, and more. We leverage ideas from a diverse set of fields, including matrix algebra, graph theory, information theory, machine learning, optimization, statistics, databases, and social science.
At a high level, we enable single-source and multi-source data analysis by providing scalable methods for fusing data sources, relating and comparing them, and summarizing patterns in them. Our work has applications to exploration of scientific data (e.g., connectomics or brain graph analysis), anomaly detection, re-identification, and more. Some of our current research directions include:
*Scalable Network Discovery from non-Network Data*: Although graphs are ubiquitous, they are not always directly observed. Discovering and analyzing networks from non-network data is a task with applications in fields as diverse as neuroscience, genomics, energy, economics, and more. However, traditional network discovery approaches are computationally expensive. We are currently investigating network discovery methods (especially from time series) that are both fast and accurate.
*Graph similarity and Alignment with Representation Learning*: Graph similarity and alignment (or fusion) are core tasks for various data mining tasks, such as anomaly detection, classification, clustering, transfer learning, sense-making, de-identification, and more. We are exploring representation learning methods that can generalize across networks and can be used in such multi-source network settings.
*Scalable Graph Summarization and Interactive Analytics*: Recent advances in computing resources have made processing enormous amounts of data possible, but the human ability to quickly identify patterns in such data has not scaled accordingly. Thus, computational methods for condensing and simplifying data are becoming an important part of the data-driven decision making process. We are investigating ways of summarizing data in a domain-specific way, as well as leveraging such methods to support interactive visual analytics.
*Distributed Graph Methods*: Many mining tasks for large-scale graphs involve solving iterative equations efficiently. For example, classifying entities in a network setting with limited supervision, finding similar nodes, and evaluating the importance of a node in a graph, can all be expressed as linear systems that are solved iteratively. The need for faster methods due to the increase in the data that is generated has permeated all these applications, and many more. Our focus is on speeding up such methods for large-scale graphs both in sequential and distributed environments.
*User Modeling*: The large amounts of online user information (e.g., in social networks, online market places, streaming music and video services) have made possible the analysis of user behavior over time at a very large scale. Analyzing the user behavior can lead to better understanding of the user needs, better recommendations by service providers that lead to customer retention and user satisfaction, as well as detection of outlying behaviors and events (e.g., malicious actions or significant life events). Our current focus is on understanding career changes and predicting job transitions.