Consortium for Data Scientists in Training Abstracts

October 29, 12:00 PM - October 30, 2020, 2:00 PM

MIDAS has organized the first Consortium for Data Scientists in Training in the country. MIDAS aims to build a network for these students and postdoctoral fellows, help them to receive feedback about their research, and nurture them to become the next generation of academic leaders in data science.

2020 Consortium for Data Scientists in Training Presentation Abstracts

The 2020 cohort comes from 28 universities, and will meet virtually on Oct. 29th-30th to participate in research talk presentations, networking sessions, and mentoring opportunities.

The research talk presentations will take place from 12:00pm – 2:00pm EST each day and are open to the public.

Oct. 29, Data Science and the Physical World

12:10 – 12:25 pm Karianne Bergen, postdoc, Data Science Initiative & Computer Science, Harvard University 

Shaking up Earthquake Science in the Age of Big Data

Earth scientists often rely on passive sensors to detect and study events or processes of interest. As these sensor data sets grow in volume and complexity, scalable algorithms for extracting information from massive sets are increasingly essential for geoscience research. One such task, fundamental in earthquake seismology, is earthquake detection – the extraction of weak earthquake signals from continuous waveform data recorded by sensors in a seismic network. In this talk, I will describe the data science challenges associated with earthquake detection in long-duration seismic data sets. I will discuss how new algorithmic advances in “big data” and machine learning are helping to advance the state-of-the-art in earthquake monitoring.  As a case study, I will present Fingerprint and Similarity Thresholding (FAST), a novel method for large-scale earthquake detection inspired by audio recognition technology (Yoon et al, 2015). I will conclude the talk with a brief discussion of opportunities for collaboration between the geoscience and data science communities that can advance the state-of-the art in both fields.  

12:25 – 12:40 pm Roshan Kulkarni, graduate student, Agronomy, Iowa State University; Karin Dorman, Statistics, Iowa State University; Yudi Zhang, Statistics, Iowa State University; Steven Cannon, Crop Insects and Crop Genetics Unit,USDA-ARS 

Modeling for ambiguous SNP calls in allotetraploids

Cultivated peanut is an allotetraploid crop with highly similar A and B sub-genomes, and a relatively large genome size of around 2.7 Gbps. Accurate genotype of allotetraploid peanut is challenging due to alignment ambiguities caused by homology leading to an excess of heterozygous calls. In this study we propose an allotetraploid specific method that carefully assesses the strength of A and B alignments to estimate the genotype of a sequenced individual at a single locus in a homoeologous region. The proposed method does not require evidence of haplotypes, in contrast to other methods which have been developed earlier. This method was validated on WGS re-sequenced data and simulated amplicon sequences. In providing this tool, we hope to benefit plant breeding programs by genotyping allotetraploids with greater accuracy and thereby better revealing the true variations among genotypes.

12:40 – 12:55 pm Christopher Powell, graduate student, Biological sciences, Oakland University; Ashley A. Superson, Biological Sciences, Oakland University, Fabia Ursula Battistuzzi, Center for Data Science and Big Data Analytics, Oakland University

PATS: a taxon re-sampling pipeline to test phylogenetic stability in large datasets

Hypothesis-driven evaluations of large phylogenies are time-consuming due to the exponentially large number of parameter combinations required for a (nearly) exhaustive assessment.  Additionally, they often require multiple steps that are not fully automated, forcing researchers to commit time to repetitive tasks, which limits the number of testable hypotheses. Here we present PATS (Phylogenetic Assessment of Taxon Sampling), an open-source computational pipeline that allows users to explore the effects of taxon sampling and species choice on phylogenetic tree reconstructions. Through the implementation of different permutation strategies, PATS streamlines the process of testing multiple scenarios in which different sets of species are used to estimate a phylogenetic history. Users can provide a list of species and choose to iteratively remove a single species from it (RemoveOne), remove all species except one (KeepOne), or remove groups of species (RemoveGroup). These three options allow to not only explore the effect of taxon sampling but also of species choice on phylogenetic stability. Additionally, users can choose to apply any of these permutations to rerun a complete phylogenetic analysis, from ortholog determination to tree reconstruction, or just the tree reconstruction part. We apply the pipeline to an example dataset of 103 species and a KeepOne scenario with 9 iterations. We describe the different types of analyses that can be run and the detailed outputs produced that enable the user to maintain full control over each step of the pipeline for additional downstream analyses. PATS is the first automated computational pipeline that allows to test the effect of number and choice of species on phylogenetic reconstructions. It enables large-scale testing in large datasets with minimal setup and run times, which will encourage researchers to fully investigate the robustness of any phylogenetically-based scientific discovery.

12:55-1:10 pm Sameer Sameer, graduate student, Astronomy and Astrophysics, Penn State; Jane Charlton, Astronomy & Astrophysics, Penn State

Unveiling the nature of the Circumgalactic medium

A major shortcoming in ionization analysis of QSO absorption line systems is that absorbers are often characterized by a single value of metallicity and density obtained by grouping all the components together. We have developed a new method to rectify this problem that performs component-by-component, multiphase ionization modeling of absorption systems. In this talk, I will be presenting our method that performs both photoionization and collisional ionization modeling of QSO absorption lines with CLOUDY and extracts their physical properties by Bayesian inference. This method makes use of “Optimizing” ions to trace the different phases present in an absorption system. Temperature derived from the CLOUDY modeling of “optimized” ions is used to constrain the Doppler broadening parameter for hydrogen and other observed transitions seen in the spectra. I will illustrate the method for a couple of weak low ionization absorbers and compare to results of traditional methods, and discuss the potential and limitations of the method.

1:10 – 1:25 pm Yan Li, graduate student, Department of Computer Science and Engineering, University of Minnesota; Pratik Kotwal, Department of Computer Science and Eng., University of Minnesota; Pengyue Wang, Department of Mechanical Eng., University of Minnesota; Shashi Shekhar, Department of Computer Science and Eng., University of Minnesota; William Northrop, Department of Mechanical Eng., University of Minnesota

Physics-guided Energy-efficient Path Selection

Given a spatial graph, an origin and a destination, and on-board diagnostics (OBD) data, the energy-efficient path selection problem aims to find the path with the least expected energy consumption (EEC). The challenges of the problem include the dependence of EEC on the physical parameters of vehicles, the autocorrelation of the EEC on segments of paths, the high computational cost of EEC estimation, and potential negative EEC. However, the current cost estimation models for the path selection problem do not consider vehicles’ physical parameters. Moreover, the current path selection algorithms follow the “path + edge” pattern when exploring candidate paths, resulting in redundant computation. We introduced a physics-guided energy consumption model and a maximal-frequented-path-graph shortest-path algorithm using the model. We analyze the proposed algorithms theoretically and evaluate the proposed algorithms via experiments with real-world and synthetic data. We also conduct two case studies using real-world data and a road test to validate that the proposed method can suggest paths that are more energy-efficient than the paths suggested by the commonly used routing methods.

1:40 – 1:55pm Elham Taghizadeh, Graduate student, Industrial Engineering, Wayne State University; Ratna Babu Chinnam, Industrial Engineering, Wayne State University; Saravanan Venkatachalam, Industrial Engineering, Wayne State University

Framework for Effective Resilience Assessment of Deep-Tier Automotive Supply Networks

Enhancing resilience is known as a critical strategy for deep-tier supply chain networks which face the challenges of complex and global supply base, economic volatility, rapidly changing technologies, environmental and political shocks. Literature reports that most of the supply chain disruptions stem from tier-2 and tier-3 suppliers. However, supply chains still suffer from the lack of transparency in the upstream deep-tiers of the supply network. This study provides a multi-dimensional metric to assess the resilience level of an automotive supply chain network by using secondary data sources to map deep-tiers suppliers while considering regional risks in a discrete event simulation model. We develop a framework for resilience assessment of deep-tier supply networks by considering operating policies while exploiting historical data as well as other data sources on risks. We characterize the value of visibility into the deeper tiers through scenarios and provide insights regarding resilience of supply networks. Also, our research reveals that a lack of visibility into the deep-tiers of a network can significantly distort the accuracy of supply network resilience assessment. The computational experiments show that information visibility can be critical for resilience assessment of global networks, and the need for visibility could vary by commodity.

Oct. 29, Data Science and Human Society

12:10 -12:25 pm Aviv Landau, Postdoc, The Data Science Institute, Columbia University; Max Topaz, Associate Professor, School of Nursing, Columbia University; Desmond Patton, Associate Professor, School of Social Work, Columbia University; Ashely Blanchard, Assistant Professor, New York Presbyterian Morgan Stanley Children’s Hospital.  Columbia University

Artificial Intelligence-Assisted Identification of Child Abuse and Neglect in Hospital Settings with Implications for Bias Reduction and Future Interventions

Child abuse and neglect, once a social problem, is now an epidemic. The broad adoption of electronic health records in clinical settings offers a new avenue for addressing this epidemic. We propose developing an innovative artificial intelligence system to detect and assess risk for child abuse and neglect within hospital settings, prioritizing the prevention and reduction of bias against Black and Latinx communities. The possibility of racial bias in AI systems reinforces our challenge of addressing racism through an AI system. 

To reduce racial bias and achieve objectivity, this system’s design will involve domain experts, comprised of Black and Latinx caregivers served by the hospital that will provide insights about abuse and neglect and engagement with healthcare providers. In addition, hospital clinician domain experts will produce a taxonomy of risk factors thought to correlate with child abuse and neglect in EHR data, for use in building the support algorithm.

12:25 – 12:40 pm Thibaut Horel, postdoc, Laboratory for Information & Decision Systems, Massachusetts Institute of Technology; Raj Agrawal, Electrical Engineering and Computer Science, Massachusetts Institute of Technology; Trevor Campbell, Statistics, University of British Columbia;  Lorenzo Masoero, Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Daria Roithmayr, Law, University of Southern California

The Contagiousness of Police Violence

Explanations for unlawful police violence focus on individual “bad apple” officers or deviant top-down departmental culture. Recent research suggests that violence may also diffuse through social networks, much like a disease spreads on networks of interaction. We investigate whether police violence is contagious—whether an officer’s exposure to earlier police shootings by network neighbors increases the probability that the officer will engage in future police shootings. Drawing on data from the Chicago Police Department, we construct and analyze dynamic patterns of diffusion of shooting on police professional networks. We follow both a non-parametric approach based on permutation tests and a model-driven approach using Hawkes processes. This is made particularly challenging by the presence of homophily which is typically confounded with contagion in this kind of observational studies.  Preliminary results suggest strong structural and dynamic evidence consistent with a dynamic of contagion in police-involved shootings, even after controlling for homophily.

12:40 – 12:55 pm Christopher Ick, Graduate student, Center for Data Science, New York University; Brian McFee, Center for Data Science, New York University

Robust Sound Event Detection in Urban Environments

Foreground and background separation is a critical step in acoustic signal detection, particularly so in the case of acoustic sound event detection (SED) in field recordings. Popular approaches to this task mimic the methods used in computer vision, using convolutional operators on an image to extract useful features. When using audio as an input to these methods, the images are typically time-frequency representations (spectrograms). Traditionally, log-scaling is the primary approach for foreground/background separation and noise reduction, and is the standard approach when spectrograms are the feature of interest. For single-source audio in clean acoustic environments, this is sufficient for SED. However, SED urban field recordings is a more challenging task. Urban soundscape recording is often under varying acoustic recording conditions, introducing audio deformations, such as reverb. Furthermore, urban soundscapes rarely have distinct and separate sound events, often containing multiple overlapping sound events of multiple classes.

Recent literature has demonstrated that an adaptive preprocessing method, per-channel energy normalization (PCEN), has significant performance improvements over traditional log-scaled mel-frequency spectrograms for noise removal in SED. However, PCEN is sensitive to the parameter configuration in relation to the acoustic properties of the target sound; a configuration well suited for a certain class of sound events may be poorly suited for another class in the same soundscape. To ameliorate this, we propose a novel method, multi-rate PCEN, in which we vary the parameters of our PCEN pre-processing. This results in generating a multi-layer spectrogram, an approach not unlike color image processing in machine vision. By adapting this multi-layer image approach from machine vision, we simultaneously improve our robustness to audio degradation and cross-class performance, making this approach well suited for SED in urban environments.

12:55 – 1:10 pm Renhao Cui, Graduate student, Computer Science & Engineering, The Ohio State University; Rajiv Ramnath, Computer Science & Engineering, The Ohio State University; Gagan Agrawal, Department of Computer & Cyber Sciences, Augusta University

Restricted Paraphrase Generation Model for Commercial Tweets

We propose a restricted paraphrase generation model that focuses on generating paraphrase for commercial tweets. The paraphrasing of commercial tweets requires certain elements to be kept in the result, such as the product name or promotion amount. We utilize a GPT-2 model and fine-tuned it with a structure-enriched dataset, which helps the model perform not only paraphrase generation but also the certain requirement of the commercial tweets. The model can identify and follow the restrictions automatically. In addition, there are very limited paraphrase datasets, especially in the domain of social media. Therefore, we apply domain transfer and rely on a general parallel-machine-translated dataset to train the model. Our model has shown to outperform general paraphrase generation models as well as the CopyNet model, in terms of paraphrase similarity, diversity, and the ability to follow the restriction.

1:10 – 1:25 pm Rezvaneh (Shadi) Rezapour, Graduate Student, School of Information Sciences, University of Illinois at Urbana-Champaign

Text Mining for Social Good; Context-aware Measurement of Social Impact and Effects Using Natural Language Processing

Exposure to information sources of different types and modalities, such as social media, movies, scholarly reports, and interactions with other communities and groups can change a person’s values as well as their knowledge and attitude towards various social phenomena. My doctoral research aims to analyze the effect of these stimuli on people and groups by applying mixed-method approaches that include techniques from natural language processing, close readings, and machine learning. This research leverages different types of user-generated texts (i.e., social media and customer reviews), and professionally-generated texts (i.e., scholarly publications and organizational documents) to study (1) the impact of information that was produced with the aim of advancing social good for individuals and society, and (2) the impact of social and individual biases and values on people’s language use. This work contributes to advancing knowledge, theory, and computational solutions in the field of computational social science. The approaches and insights discussed can provide a better understanding of people’s attitudes and judgment towards issues and events of general interest, which is necessary for developing solutions for minimizing biases, filter bubbles, and polarization while also improving the effectiveness of interpersonal and societal discourse.

1:25 – 1:40 pm Juandalyn Burke, PhD Candidate, Biomedical Informatics and Medical Education, University of Washington

Using an Ecological Inference Software Tool to Detect Vote Dilution 

The most basic characteristic of a democratic system is the right to vote. The Voting Rights Act (VRA) of 1965 was established to ensure fair voting practices were enacted and that elected officials were representative of the community they served. The VRA prohibits unfair and discriminatory voting practices, including racially polarized voting and vote dilution, based on the race or an individual’s association with minority language groups.  However, in the United States, violations of the VRA are difficult to prove because information on race and ethnicity is not collected in the voting process.  By definition, racially polarized voting occurs when distinct racial or ethnic groups vote divergently to elect their separate candidates of choice. Vote dilution occurs when the racial majority group votes to block the minority group from electing their preferred candidate. The eiCompare software package detects both racially polarized voting and vote dilution by inferring the race or ethnicity of the voters in a population using several methods of ecological inference. We improved and added features to the eiCompare package including: geocoding, more accurate procedures in detecting the race of voters, better visualization of ecological inference outcomes, parallel processing, and analysis of historical voting data. We think these new features will allow for better detection of racially polarized voting and vote dilution and will help to support evidence presented in voting rights litigation.

1:40 – 1:55 pm Michael D. Jackson, PhD Student, Statistics, Rice University

WaveL_2E + NARX, A data-centric W-NN approach to Forecasting Dynamic Financial Time Series

Statisticians have been attempting to find the “best’ model to predict market price movements for decades.  Recent successes in both Artificial Neural Networks (ANN) and Wavelets (W) analysis have placed these two methods in the spotlight of quantitative traders looking for the next best tool.  The W-ANN, which combines wavelet denoising and ANN, has been proposed as a method to combine the two strategies to predict the financial price series.   We explore how a more data-driven denoising approach, with a dynamic wavelet thresholding technique designed for signal rich time series often found in finance, improves on naive implementations of this model.  The WaveL2E employs a dynamically-optimized continuous wavelet transform that results in an adaptive thresholding model that has been shown to improve results when compared to traditional, static thresholding techniques on financial time-series.  We explore how this technique improves the predictability of this established quantitative trading model.

Oct. 30, Data Science Theory and Methodology

12:05 – 12:20 pm Paidamoyo Chapfuwa, Graduate student, Electrical and Computer Engineering, Duke University; C. Tao, Duke University; M. Pencina, Duke University L. Carin, Duke University; R. Henao, Duke University; C. Li Microsoft Research, Redmond 

Bringing modern machine learning to survival analysis

Models for predicting the time of a future event are crucial for risk assessment, across a diverse range of applications. Existing time-to-event (survival) models have focused primarily on preserving the pairwise ordering of estimated event times (i.e., relative risk). We propose neural time-to-event models that account for calibration and uncertainty while predicting accurate absolute event times. Specifically, an adversarial nonparametric model is introduced for estimating matched time-to-event distributions for probabilistically concentrated and accurate predictions. We also consider replacing the discriminator of the adversarial nonparametric model with a survival-function matching estimator that accounts for model calibration. The proposed estimator can be used as a means of estimating and comparing conditional survival distributions while accounting for the predictive uncertainty of probabilistic models. Finally, we present two general extensions that facilitate any time-to-event model to account for model interpretability and counterfactual inference. For interpretability, we formulate an interpretable time-to-event driven clustering method of patients via a Bayesian nonparametric stick-breaking representation of the Dirichlet Process. For counterfactual inference, we introduce a model-free nonparametric hazard ratio estimator and a unified representational learning framework for individualized treatment effect estimation of survival outcomes from observation data. We present extensive results on challenging datasets with an open-sourced library, making our work easily available for the community to build upon.

12:20 – 12:40 pm Haekyu Park, Graduate Student, Computer Science and Engineering, Georgia Institute of Technology; Nilaksh Das, Georgia Tech, Georgia Tech Research Institute; Zijie J. Wang, Georgia Tech, Georgia Tech Research Institute; Fred Hohman, Georgia Tech, Georgia Tech Research Institute; Robert Firstman, Georgia Tech, Georgia Tech Research Institute; Emily Rogers, Georgia Tech, Georgia Tech Research Institute; Duen Horng Chau, Georgia Tech, Georgia Tech Research Institute

Bluff: Interactive Interpretation of Adversarial Attacks on Deep Learning

Deep neural networks (DNNs) are now commonly used in many domains. However, they are vulnerable to adversarial attacks: carefully crafted perturbations on data inputs that can fool a model into making incorrect predictions. Despite significant research on developing DNN attack and defense techniques, people still lack an understanding of how such attacks penetrate a model’s internals. We present Bluff, an interactive system for visualizing, characterizing, and deciphering adversarial attacks on vision-based neural networks. Bluff allows people to flexibly visualize and compare the activation pathways for benign and attacked images, revealing mechanisms that adversarial attacks employ to inflict harm on a model. Bluff is open-sourced and runs in modern web browsers.

12:40 – 12:55 pm Tanima Chatterjee, Graduate student, Computer Science, University of Illinois at Chicago; Bhaskar DasGupta, Computer Science, UIC; Nasim Mobasheri, Computer Science, UIC; Ismael G.Yero, Mathematics, Universidad de Ca ́diz

On the Computational Complexities of Three Privacy Measures for Large Networks Under Active Attack

With the arrival of the modern internet era, large public networks of various types have come to existence to benefit the society as a whole and several research areas such as sociology, economics and geography in particular. However, the societal and research benefits of these networks have also given rise to potentially significant privacy issues in the sense that malicious entities may violate the privacy of the users of such a network by analyzing the network and deliberately using such privacy violations for deleterious purposes. Such considerations have given rise to a new active research area that deals with the quantification of privacy of users in large networks and the corresponding investigation of computational complexity issues of computing such quantified privacy measures. We formalize three natural problems related to such a privacy measure for large networks and provide non-trivial theoretical computational complexity results for solving these problems. Our results show the first two problems can be solved efficiently, whereas the third problem is provably hard to solve within a logarithmic approximation factor. Furthermore, we also provide computational complexity results for the case when the privacy requirement of the network is severely restricted, including an efficient logarithmic approximation.

12:55 – 1:10 pm Behnaz Moradi, Research Associate at the University of Virginia

Introducing retraced non-backtracking random walk (RNBRW) as a tool to uncover the mesoscopic structure of open-source software (OSS) networks

A way to identify communities may be through how units interact with each other; units within a community may be more likely to interact with each other than across different communities, this may follow that cycles are prevalent within communities than across communities. Thus, the detection of these communities can be aided through the use of measures of the local “richness” of cyclic structures. In this talk, we develop the renewal non-backtracking random walk (RNBRW)–a variant of a random walk in which the walk is prohibited from returning back to a node in exactly two steps and terminates and restarts once it completes a loop–as a way of quantifying this cyclic structure. Specifically, we propose using the retracing probability of an edge–the likelihood that the edge completes a cycle in an RNBRW–as a way of quantifying cyclic structure. Intuitively, edges with larger retracing probabilities should be more important to the formation of cycles, and hence, to the detection of communities. We show that retracing probabilities can be estimated efficiently through repeated iterations of RNBRW. Additionally, since RNBRW runs can be performed in parallel, accurate estimations can be obtained even when the network contains millions of nodes. We illustrate the application of this methodology in the inference of the structure of OSS networks by pre-weighting edges through RNBRW as a warm-up step for the state of the art community detection algorithms. We also develop a goodness-of-fit test to help determine whether communities exist within a network. We test the null hypothesis that the network is a realization of an Erdös-Rényi graph–a random graph in which each edge is equally likely to be formed, and hence, contains no inherent community structure. Rejecting this null implies that the network comes from a distribution with an inherent community structure, e.g., a planted partition model.

1:10 – 1:25pm Sanjeev Kaushik, PhD Student, School of Computing and Information Sciences, Florida International University

Securing the IoT Communication

In this talk, we will discuss the attempts made to resolve the fundamental security problems in Internet-of-Things (IoT) communication. Most of the solutions revolves around a data-centric approach with the information being provided the preference and being handled directly to get a secure system that can work autonomously without any hassles. The solutions described will involve a mostly client-driven environment and reinforce it with best-practices from the existing host-based approaches for IoT networks. With these solutions, the existing security and privacy concerns will be addressed and an introduction to data-centric security approaches will be discussed. This presentation will showcase the preliminary work and exploration done in this context and the future directions we intend to work on.

1:25 – 1:40pm Arya Farahi, Data Science Fellow, Michigan Institute for Data and AI in Society, University of Michigan

Towards Trustworthy and Fair Classifiers

Applications of machine learning and likelihood-based classifiers are increasingly infiltrating and shaping the fabric of our society. Hence, it is paramount to evaluate their performance not only from a lens of predictive power but also from a fairness point of view. In this talk, I argue that there is a link between uncertainty calibration of a classifier and fairness in many practical settings. Furthermore, I argue that a fairly calibrated model must be group-wise calibrated. However, the majority of proposed metrics fail in identifying such miscalibration. In this talk, I propose a novel metric and a hypothesis testing framework that can answer the following questions: (i) whether the outputs of a classifier is group-wise calibrated, and (ii) given two miscalibrated models, which one’s prediction is less unfair. I will then propose a remedy strategy to correct for group-wise biases and illustrate the performance of the proposed remedy strategy in a number of simulated settings and real-world settings.

1:40 – 1:55pm Qi Zhao, University of California, San Diego

Persistence Enhanced Graph Neural Network

Local structural information can increase the adaptability of graph convolutional networks to large graphs with heterogeneous topology. Existing methods only use relatively simple topological information, such as node degrees. We present a novel approach lever-aging advanced topological information, i.e.,persistent homology, which measures the in-formation flow efficiency at different parts of the graph. To fully exploit such structural in-formation in real world graphs, we propose a new network architecture which learns to use persistent homology information to reweight messages passed between graph nodes during convolution. For node classification tasks,our network outperforms existing ones on a broad spectrum of graph benchmarks.

Oct. 30, Data Science and Human Health

12:05 – 12:20 pm Sarah Ben Maamar, Postdoc, Chemical and Biological Engineering, Northwestern University; Sophia Liu, Chemical and Biological Engineering, Northwestern University; Reese Richardson, Chemical and Biological Engineering, Northwestern University; Zhiheng Bai, Chemical and Biological Engineering, Northwestern University; Luis Nunes A. Amaral, Chemical and Biological Engineering, Northwestern University

Comprehensive analysis of the reproducibility of RNAseq computational pipelines

Next generation sequencing technologies revolutionized biomedical research and became unavoidable due to their low costs, high amount of data generated and the wide variety of their applications. In particular, RNA-sequencing (RNA-seq) has become widely used in biological and biomedical fields as this technique allows the evaluation of gene expression levels in model organisms under different contexts. These contexts include the comparison of sick versus healthy cells; the effect of specific drugs on cells’ gene expression; monitoring of changes in gene expression over time; or the discovery of the potential role of an unknown gene when comparing different tissues.

As the output of RNA-seq is complex and large, processing and analysis of such data requires the use of complex computational pipelines involving multiple steps and softwares to make the data comprehensible. RNA-seq computational pipelines vary according to the application and can have up to six steps, for which up to ten different softwares are available for each task. Each software also offers multiple parameters to better tune the analysis for each application and dataset.

Despite the endless choices, there is currently no standardized pipeline agreed upon in the broad biomedical field. Thus, unless a computational pipeline used to process a dataset is thoroughly documented, it is almost impossible to reproduce the results obtained from a dataset after processing.

In this work, we analyze the documentation and replicability associated to each step of RNA-seq computational pipelines used to study differential gene expression in the model bacteria Escherichia coli. We particularly assess the intrinsic bias introduced by the use of each software for each step as well as the bias associated to each parameter choice. Interestingly, we found two to three steps of RNA-seq computational pipeline are particularly undermining the comparability of the results between studies.

12:20 – 12:40 pm Valeri Vasquez, Graduate Student, Energy and Resources Group, University of California Berkeley

Optimizing Genetic-Based Public Health Interventions

I develop a nonlinear mathematical program, parameterized with empirical data, to obtain optimal strategies for the control of disease-carrying mosquitoes using a genetic technology called gene drive. The model incorporates resource and biological constraints as well as laboratory-informed genetic inheritance patterns. In permitting simultaneous constraints on both states and controls, the mathematical programming approach furnishes methodological improvements over classical optimal control and existing simulation-based work, while the inclusion of ecological details enables the scientifically sound design of future field trials.

12:40 – 12:55 pm Abby Stevens, Graduate student, Statistics, University of Chicago; Anna Hotton, Department of Medicine, University of Chicago; Chaitanya Kaligotla, Decision and Infrastructure Sciences Division, Argonne National Laboratory and Consortium for Advanced Science and Engineering, University of Chicago; Jonathan Ozik, Decision and Infrastructure Sciences Division, Argonne National Laboratory and Consortium for Advanced Science and Engineering, University of Chicago; Charles M. Macal, Decision and Infrastructure Sciences Division, Argonne National Laboratory and Consortium for Advanced Science and Engineering, University of Chicago

Modeling the Impact of Social Determinants of Health on COVID-19 Transmission and Mortality to Understand Health Inequities

The COVID-19 pandemic has highlighted drastic health inequities, particularly in cities such as Chicago, Detroit, New Orleans, and New York City. Reducing COVID-19 morbidity and mortality will likely require an increased focus on social determinants of health, given their disproportionate impact on populations most heavily affected by COVID-19. A better understanding of how factors such as household income, housing location, health care access, and incarceration contribute to COVID-19 transmission and mortality is needed to inform policies around social distancing and testing and vaccination scale-up.

This work builds upon an existing agent-based model of COVID-19 transmission in Chicago, CityCOVID. CityCOVID consists of a synthetic population that is statistically representative of Chicago’s population (2.7 million persons), along with their associated places (1.4 million places) and behaviors (13,000 activity schedules). During a simulated day, agents move from place-to-place, hour-by-hour, engaging in social activities and interactions with other colocated agents, resulting in an endogenous colocation or contact network. COVID-19 transmission is determined via a simulated epidemiological model based on this generated contact network by tuning (fitting) model parameters that result in simulation output that matches observed COVID-19 death and hospitalization data from the City of Chicago.

Using the CityCOVID infrastructure, we quantify the impact of social determinants of health on COVID-19 transmission dynamics by applying statistical techniques to empirical data to study the relationship between social determinants of health and COVID-19 outcomes.

12:55 – 1:10 pm Sean Kent, Graduate Student, Statistics, University of Wisconsin – Madison; Menggang Yu, Department of Biostatistics and Medical Informatics, University of Wisconsin – Madison; Yifei Liu, Google (Previously at University of Wisconsin – Madison)

Multiple Instance Learning from Distributional Instances

The Multiple Instance Learning (MIL) setting—a form of semi-supervised learning where instances are grouped into bags and each bag has a label determined by the unseen instance-level labels in that bag—is well studied in the literature, especially in the classification regime. A recent data set examining collagen fiber imaging features in ductal carcinoma in situ (DCIS) patients falls outside the scope of previously studied MIL problems. In this data, slides are labeled as tumor tissue or normal tissue, and each slide has multiple spots that are passed through imaging software to pull out features on the thousands of fibers within that spot. We think of this data set as an extension of MIL where the instances (spots) are distributions instead of feature-vectors, and each distribution has a large number of samples (fibers) drawn from it in the data. The novel setting, which we call Multiple Instance Learning from Distributional Instances (MILD), has no off-the-shelf methods that can be directly applied. In this presentation, we examine several approaches to predict the bag label under this novel framework in both motivating toy-examples and the data set of interest. We find that our proposed method outperforms other indirect methods that ignore this unique data structure.

1:10 – 1:25 pm LaKeithia Glover, graduate student, Social Work, Clark Atlanta University

Suicide Rates on African African Youth as it relates to the Influence of Different types of Communication 

Suicide rates for African American youth and youth adults has seen a steady increase in recent years. For generations African Americans were discouraged and steadily removed from communication tools to promote areas of open interaction. This resulted in increased trauma, mental health needs, and disproportionate access to resources. With a continual lack of access to supportive communication tools over multiple generations increased mental health concerns and a practice of verbally minimizing issues soon became a multi-generational norm. However presently as our society transforms through the aid of technology, the ways that individuals communicate has taken on a new trajectory. Communication or perceived lack thereof through media, family, spirituality all identify major influences of suicide rates on that of African American youth and young adults.

1:10 – 1:25 pm Matt Satusky, graduate student, University of North Carolina at Chapel Hill

BioData Catalyst and Deep Learning: Feature extraction on a full-feature platform

NHLBI BioData Catalyst is a cloud-based analytics ecosystem that hosts close to 3 petabytes of health data including whole genome sequencing, RNA-seq, clinical variables, and medical imaging. Through harmonization of variables across studies, BioData Catalyst allows for cross-study cohort creation and provides a collaborative space for researchers to build and share analytical methods within a secure environment.

With the ongoing COVID-19 pandemic, BioData Catalyst has been chosen to host clinical data from collaborating institutions to allow researchers a singular access point for developing COVID diagnostic applications. To expedite these new research goals, we have begun incorporating deep learning tools for chest CT image analysis and natural language processing of medical records.

Oct. 30, Data Science Theory and Methodology

12:05 – 12:20 pm Paidamoyo Chapfuwa, Graduate student, Electrical and Computer Engineering, Duke University; C. Tao, Duke University; M. Pencina, Duke University L. Carin, Duke University; R. Henao, Duke University; C. Li Microsoft Research, Redmond 

Bringing modern machine learning to survival analysis

Models for predicting the time of a future event are crucial for risk assessment, across a diverse range of applications. Existing time-to-event (survival) models have focused primarily on preserving the pairwise ordering of estimated event times (i.e., relative risk). We propose neural time-to-event models that account for calibration and uncertainty while predicting accurate absolute event times. Specifically, an adversarial nonparametric model is introduced for estimating matched time-to-event distributions for probabilistically concentrated and accurate predictions. We also consider replacing the discriminator of the adversarial nonparametric model with a survival-function matching estimator that accounts for model calibration. The proposed estimator can be used as a means of estimating and comparing conditional survival distributions while accounting for the predictive uncertainty of probabilistic models. Finally, we present two general extensions that facilitate any time-to-event model to account for model interpretability and counterfactual inference. For interpretability, we formulate an interpretable time-to-event driven clustering method of patients via a Bayesian nonparametric stick-breaking representation of the Dirichlet Process. For counterfactual inference, we introduce a model-free nonparametric hazard ratio estimator and a unified representational learning framework for individualized treatment effect estimation of survival outcomes from observation data. We present extensive results on challenging datasets with an open-sourced library, making our work easily available for the community to build upon.

12:20 – 12:40 pm Haekyu Park, Graduate Student, Computer Science and Engineering, Georgia Institute of Technology; Nilaksh Das, Georgia Tech, Georgia Tech Research Institute; Zijie J. Wang, Georgia Tech, Georgia Tech Research Institute; Fred Hohman, Georgia Tech, Georgia Tech Research Institute; Robert Firstman, Georgia Tech, Georgia Tech Research Institute; Emily Rogers, Georgia Tech, Georgia Tech Research Institute; Duen Horng Chau, Georgia Tech, Georgia Tech Research Institute

Bluff: Interactive Interpretation of Adversarial Attacks on Deep Learning

Deep neural networks (DNNs) are now commonly used in many domains. However, they are vulnerable to adversarial attacks: carefully crafted perturbations on data inputs that can fool a model into making incorrect predictions. Despite significant research on developing DNN attack and defense techniques, people still lack an understanding of how such attacks penetrate a model’s internals. We present Bluff, an interactive system for visualizing, characterizing, and deciphering adversarial attacks on vision-based neural networks. Bluff allows people to flexibly visualize and compare the activation pathways for benign and attacked images, revealing mechanisms that adversarial attacks employ to inflict harm on a model. Bluff is open-sourced and runs in modern web browsers.

12:40 – 12:55 pm Tanima Chatterjee, Graduate student, Computer Science, University of Illinois at Chicago; Bhaskar DasGupta, Computer Science, UIC; Nasim Mobasheri, Computer Science, UIC; Ismael G.Yero, Mathematics, Universidad de Ca ́diz

On the Computational Complexities of Three Privacy Measures for Large Networks Under Active Attack

With the arrival of the modern internet era, large public networks of various types have come to existence to benefit the society as a whole and several research areas such as sociology, economics and geography in particular. However, the societal and research benefits of these networks have also given rise to potentially significant privacy issues in the sense that malicious entities may violate the privacy of the users of such a network by analyzing the network and deliberately using such privacy violations for deleterious purposes. Such considerations have given rise to a new active research area that deals with the quantification of privacy of users in large networks and the corresponding investigation of computational complexity issues of computing such quantified privacy measures. We formalize three natural problems related to such a privacy measure for large networks and provide non-trivial theoretical computational complexity results for solving these problems. Our results show the first two problems can be solved efficiently, whereas the third problem is provably hard to solve within a logarithmic approximation factor. Furthermore, we also provide computational complexity results for the case when the privacy requirement of the network is severely restricted, including an efficient logarithmic approximation.

12:55 – 1:10 pm Behnaz Moradi, Research Associate at the University of Virginia

Introducing retraced non-backtracking random walk (RNBRW) as a tool to uncover the mesoscopic structure of open-source software (OSS) networks

A way to identify communities may be through how units interact with each other; units within a community may be more likely to interact with each other than across different communities, this may follow that cycles are prevalent within communities than across communities. Thus, the detection of these communities can be aided through the use of measures of the local “richness” of cyclic structures. In this talk, we develop the renewal non-backtracking random walk (RNBRW)–a variant of a random walk in which the walk is prohibited from returning back to a node in exactly two steps and terminates and restarts once it completes a loop–as a way of quantifying this cyclic structure. Specifically, we propose using the retracing probability of an edge–the likelihood that the edge completes a cycle in an RNBRW–as a way of quantifying cyclic structure. Intuitively, edges with larger retracing probabilities should be more important to the formation of cycles, and hence, to the detection of communities. We show that retracing probabilities can be estimated efficiently through repeated iterations of RNBRW. Additionally, since RNBRW runs can be performed in parallel, accurate estimations can be obtained even when the network contains millions of nodes. We illustrate the application of this methodology in the inference of the structure of OSS networks by pre-weighting edges through RNBRW as a warm-up step for the state of the art community detection algorithms. We also develop a goodness-of-fit test to help determine whether communities exist within a network. We test the null hypothesis that the network is a realization of an Erdös-Rényi graph–a random graph in which each edge is equally likely to be formed, and hence, contains no inherent community structure. Rejecting this null implies that the network comes from a distribution with an inherent community structure, e.g., a planted partition model.

1:10 – 1:25pm Sanjeev Kaushik, PhD Student, School of Computing and Information Sciences, Florida International University

Securing the IoT Communication

In this talk, we will discuss the attempts made to resolve the fundamental security problems in Internet-of-Things (IoT) communication. Most of the solutions revolves around a data-centric approach with the information being provided the preference and being handled directly to get a secure system that can work autonomously without any hassles. The solutions described will involve a mostly client-driven environment and reinforce it with best-practices from the existing host-based approaches for IoT networks. With these solutions, the existing security and privacy concerns will be addressed and an introduction to data-centric security approaches will be discussed. This presentation will showcase the preliminary work and exploration done in this context and the future directions we intend to work on.

1:25 – 1:40pm Arya Farahi, Data Science Fellow, Michigan Institute for Data and AI in Society, University of Michigan

Towards Trustworthy and Fair Classifiers

Applications of machine learning and likelihood-based classifiers are increasingly infiltrating and shaping the fabric of our society. Hence, it is paramount to evaluate their performance not only from a lens of predictive power but also from a fairness point of view. In this talk, I argue that there is a link between uncertainty calibration of a classifier and fairness in many practical settings. Furthermore, I argue that a fairly calibrated model must be group-wise calibrated. However, the majority of proposed metrics fail in identifying such miscalibration. In this talk, I propose a novel metric and a hypothesis testing framework that can answer the following questions: (i) whether the outputs of a classifier is group-wise calibrated, and (ii) given two miscalibrated models, which one’s prediction is less unfair. I will then propose a remedy strategy to correct for group-wise biases and illustrate the performance of the proposed remedy strategy in a number of simulated settings and real-world settings.

1:40 – 1:55pm Qi Zhao, University of California, San Diego

Persistence Enhanced Graph Neural Network

Local structural information can increase the adaptability of graph convolutional networks to large graphs with heterogeneous topology. Existing methods only use relatively simple topological information, such as node degrees. We present a novel approach lever-aging advanced topological information, i.e.,persistent homology, which measures the in-formation flow efficiency at different parts of the graph. To fully exploit such structural in-formation in real world graphs, we propose a new network architecture which learns to use persistent homology information to reweight messages passed between graph nodes during convolution. For node classification tasks,our network outperforms existing ones on a broad spectrum of graph benchmarks.

Oct. 30, Data Science and Human Health

12:05 – 12:20 pm Sarah Ben Maamar, Postdoc, Chemical and Biological Engineering, Northwestern University; Sophia Liu, Chemical and Biological Engineering, Northwestern University; Reese Richardson, Chemical and Biological Engineering, Northwestern University; Zhiheng Bai, Chemical and Biological Engineering, Northwestern University; Luis Nunes A. Amaral, Chemical and Biological Engineering, Northwestern University

Comprehensive analysis of the reproducibility of RNAseq computational pipelines

Next generation sequencing technologies revolutionized biomedical research and became unavoidable due to their low costs, high amount of data generated and the wide variety of their applications. In particular, RNA-sequencing (RNA-seq) has become widely used in biological and biomedical fields as this technique allows the evaluation of gene expression levels in model organisms under different contexts. These contexts include the comparison of sick versus healthy cells; the effect of specific drugs on cells’ gene expression; monitoring of changes in gene expression over time; or the discovery of the potential role of an unknown gene when comparing different tissues.

As the output of RNA-seq is complex and large, processing and analysis of such data requires the use of complex computational pipelines involving multiple steps and softwares to make the data comprehensible. RNA-seq computational pipelines vary according to the application and can have up to six steps, for which up to ten different softwares are available for each task. Each software also offers multiple parameters to better tune the analysis for each application and dataset.

Despite the endless choices, there is currently no standardized pipeline agreed upon in the broad biomedical field. Thus, unless a computational pipeline used to process a dataset is thoroughly documented, it is almost impossible to reproduce the results obtained from a dataset after processing.

In this work, we analyze the documentation and replicability associated to each step of RNA-seq computational pipelines used to study differential gene expression in the model bacteria Escherichia coli. We particularly assess the intrinsic bias introduced by the use of each software for each step as well as the bias associated to each parameter choice. Interestingly, we found two to three steps of RNA-seq computational pipeline are particularly undermining the comparability of the results between studies.

12:20 – 12:40 pm Valeri Vasquez, Graduate Student, Energy and Resources Group, University of California Berkeley

Optimizing Genetic-Based Public Health Interventions

I develop a nonlinear mathematical program, parameterized with empirical data, to obtain optimal strategies for the control of disease-carrying mosquitoes using a genetic technology called gene drive. The model incorporates resource and biological constraints as well as laboratory-informed genetic inheritance patterns. In permitting simultaneous constraints on both states and controls, the mathematical programming approach furnishes methodological improvements over classical optimal control and existing simulation-based work, while the inclusion of ecological details enables the scientifically sound design of future field trials.

12:40 – 12:55 pm Abby Stevens, Graduate student, Statistics, University of Chicago; Anna Hotton, Department of Medicine, University of Chicago; Chaitanya Kaligotla, Decision and Infrastructure Sciences Division, Argonne National Laboratory and Consortium for Advanced Science and Engineering, University of Chicago; Jonathan Ozik, Decision and Infrastructure Sciences Division, Argonne National Laboratory and Consortium for Advanced Science and Engineering, University of Chicago; Charles M. Macal, Decision and Infrastructure Sciences Division, Argonne National Laboratory and Consortium for Advanced Science and Engineering, University of Chicago

Modeling the Impact of Social Determinants of Health on COVID-19 Transmission and Mortality to Understand Health Inequities

The COVID-19 pandemic has highlighted drastic health inequities, particularly in cities such as Chicago, Detroit, New Orleans, and New York City. Reducing COVID-19 morbidity and mortality will likely require an increased focus on social determinants of health, given their disproportionate impact on populations most heavily affected by COVID-19. A better understanding of how factors such as household income, housing location, health care access, and incarceration contribute to COVID-19 transmission and mortality is needed to inform policies around social distancing and testing and vaccination scale-up.

This work builds upon an existing agent-based model of COVID-19 transmission in Chicago, CityCOVID. CityCOVID consists of a synthetic population that is statistically representative of Chicago’s population (2.7 million persons), along with their associated places (1.4 million places) and behaviors (13,000 activity schedules). During a simulated day, agents move from place-to-place, hour-by-hour, engaging in social activities and interactions with other colocated agents, resulting in an endogenous colocation or contact network. COVID-19 transmission is determined via a simulated epidemiological model based on this generated contact network by tuning (fitting) model parameters that result in simulation output that matches observed COVID-19 death and hospitalization data from the City of Chicago.

Using the CityCOVID infrastructure, we quantify the impact of social determinants of health on COVID-19 transmission dynamics by applying statistical techniques to empirical data to study the relationship between social determinants of health and COVID-19 outcomes.

12:55 – 1:10 pm Sean Kent, Graduate Student, Statistics, University of Wisconsin – Madison; Menggang Yu, Department of Biostatistics and Medical Informatics, University of Wisconsin – Madison; Yifei Liu, Google (Previously at University of Wisconsin – Madison)

Multiple Instance Learning from Distributional Instances

The Multiple Instance Learning (MIL) setting—a form of semi-supervised learning where instances are grouped into bags and each bag has a label determined by the unseen instance-level labels in that bag—is well studied in the literature, especially in the classification regime. A recent data set examining collagen fiber imaging features in ductal carcinoma in situ (DCIS) patients falls outside the scope of previously studied MIL problems. In this data, slides are labeled as tumor tissue or normal tissue, and each slide has multiple spots that are passed through imaging software to pull out features on the thousands of fibers within that spot. We think of this data set as an extension of MIL where the instances (spots) are distributions instead of feature-vectors, and each distribution has a large number of samples (fibers) drawn from it in the data. The novel setting, which we call Multiple Instance Learning from Distributional Instances (MILD), has no off-the-shelf methods that can be directly applied. In this presentation, we examine several approaches to predict the bag label under this novel framework in both motivating toy-examples and the data set of interest. We find that our proposed method outperforms other indirect methods that ignore this unique data structure.

1:10 – 1:25 pm LaKeithia Glover, graduate student, Social Work, Clark Atlanta University

Suicide Rates on African African Youth as it relates to the Influence of Different types of Communication 

Suicide rates for African American youth and youth adults has seen a steady increase in recent years. For generations African Americans were discouraged and steadily removed from communication tools to promote areas of open interaction. This resulted in increased trauma, mental health needs, and disproportionate access to resources. With a continual lack of access to supportive communication tools over multiple generations increased mental health concerns and a practice of verbally minimizing issues soon became a multi-generational norm. However presently as our society transforms through the aid of technology, the ways that individuals communicate has taken on a new trajectory. Communication or perceived lack thereof through media, family, spirituality all identify major influences of suicide rates on that of African American youth and young adults.

1:10 – 1:25 pm Matt Satusky, graduate student, University of North Carolina at Chapel Hill

BioData Catalyst and Deep Learning: Feature extraction on a full-feature platform

NHLBI BioData Catalyst is a cloud-based analytics ecosystem that hosts close to 3 petabytes of health data including whole genome sequencing, RNA-seq, clinical variables, and medical imaging. Through harmonization of variables across studies, BioData Catalyst allows for cross-study cohort creation and provides a collaborative space for researchers to build and share analytical methods within a secure environment.

With the ongoing COVID-19 pandemic, BioData Catalyst has been chosen to host clinical data from collaborating institutions to allow researchers a singular access point for developing COVID diagnostic applications. To expedite these new research goals, we have begun incorporating deep learning tools for chest CT image analysis and natural language processing of medical records.