2022 Poster Award Recipients
Underline denotes the presenting author
Underline denotes the presenting author
Summary: Antibiotic resistance is becoming a significant public health concern worldwide, with few novel treatments being discovered. Drug combination therapy is a promising solution against antibiotic resistance. But, the search for effective drug combinations within a vast combinatorial space is time- and resource-intensive. We developed an ML-based algorithm that, (1) predicts effective multi-drug therapies, (2) utilizes multi-omics datasets for uncovering complex drug-drug interactions, (3) overcomes the need for high-cost experimental datasets, (4) provides the lucid interpretation of model predictions, and (5) accounts for drug toxicity profiles to design safer treatments. Current approaches cannot address more than one of the above concerns simultaneously.
Methods: The approach involves a three-step process, (1) generating feature profiles for individual drug treatments using multi-omics data such as metabolomics, proteomics, structural profiles, etc., followed by (2) preprocessing to compute joint profiles for a combination accounting for similarity and uniqueness among drugs in the treatment, and (3) feeding the information to the ML algorithm for model development, and evaluating performance.
Results: Performance for the approach was evaluated on several drug combination datasets, two-way interactions (R=0.6015***), two-way interactions in Glycerol (R=0.5178***), three-way interactions (R=0.4515***), sequential interactions (R=0.4337***) [*** indicates p-value < 1e-3]. The trained model was interpreted using the Testing with Activation Concept Vectors (TCAV) approach, which concludes that subsystems like Pyruvate metabolism, TCA cycle, and Oxidative Phosphorylation play an important role in predictions made by the model. Additionally, the trained model was fine-tuned to predict the toxicity of combination therapy to ensure the safety of the treatment.
Impact: As novel treatments are not readily discovered, it makes it crucial to design treatments using approved FDA drugs. Our developed algorithm aims to provide an innovative and unique perspective on utilizing machine learning in guiding the development of multi-drug treatments using FDA drugs.
While there are sources for Spanish-English bilingual speech data, they are not sufficiently accessible for analysis. We propose an open-access data repository of Spanish-English bilingual speech called ES COCO (English-Spanish COde-switching COrpus). ES COCO will contain tagged speech from podcasts and already-created corpora, and meta-data such as speaker and demographic information. While some multilingual corpora exist (ex. BilingBank (MacWhinney, 2019)), this will be the largest Spanish-English corpus where researchers do not need to aggregate data. Most of the ES COCO data are not-transcribed audio recordings.
We use the XLS-R neural network, fine-tuned on the Spanish and English components of the CommonVoice dataset, for speech-to-text conversion (Babu et al., 2021; Ardila et al., 2020). We increase the accuracy by boosting it with an n-gram language model trained on Spanish-English datasets from the LinCE Benchmark (Aguilar et al., 2020). Once converted to text, we apply an automated tagging process for part-of-speech and language to annotate linguistic features of the data. These processes rely on a transformer language model, XLM RoBERTa (Conneau et al., 2019), fine-tuned using Spanish-English datasets from LinCE (Aguilar et al., 2020).
In addition to providing the corpus in a machine-readable format, we enable data exploration with a user-friendly interface, which can be run locally on the user’s machine or accessed via web browser. Users can search and filter the corpus by linguistic feature and view results in context, allowing them to quickly answer questions about bilingual language practices.
By creating this corpus of Spanish-English speech data, we remove the largest barriers in language research: the time and financial cost of collecting, transcribing, and tagging data. ES COCO is particularly beneficial for researchers who are not at R1 institutions and have limited access to funding, personnel, time, and the language communities required for language research.
Autonomous vehicles represent one of the most active technologies currently being developed, with research areas addressing, among others, the modeling of the states and behavioral elements of the occupants. This paper contributes to this line of research by studying the circadian rhythm of individuals using a novel multimodal dataset of 36 subjects consisting of five information channels. These channels include visual, thermal, physiological, linguistic, and background data.
Moreover, we propose a framework to explore whether the circadian rhythm can be modeled without continuous monitoring and investigate the hypothesis that multimodal features have a greater propensity for improved performance using data points specific to certain times during the day. Our analysis shows that multimodal fusion can lead to an accuracy of up to 77% on identifying energized and enervated states of the participants. Our findings highlight the validity of our hypothesis and present a novel approach for future research.
Optical multi-layer thin films are widely used in optical and energy applications requiring photonic designs. Engineers often design such structures based on their physical intuition. However, solely relying on human experts can be time-consuming and may lead to sub-optimal designs, especially when the design space is large.
In this work, we frame the multi-layer optical design task as a sequence generation problem. Based on reinforcement learning, a deep sequence generation network is proposed for efficiently generating optical layer sequences. We train the deep sequence generation network with proximal policy optimization to generate multi-layer structures with desired properties. The proposed method is applied to two energy applications.
Our algorithm successfully discovered high-performance designs, outperforming structures designed by human experts and state-of-art algorithms. We believe our algorithm based on reinforcement learning can extend to many other multi-layer tasks and achieve high performance.
Underline denotes the presenting author
We propose a stochastic method for solving equality constrained optimization problems that utilizes predictive variance reduction. Specifically, we develop a method based on the sequential quadratic programming paradigm that employs variance reduction in the gradient approximations. Under reasonable assumptions, we prove that a measure of first-order stationarity evaluated at the iterates generated by our proposed algorithm converges to zero in expectation from arbitrary starting points, for both constant and adaptive step size strategies. Finally, we demonstrate the practical performance of our proposed algorithm on constrained binary classification problems that arise in machine learning
To highlight difficulties in learning-based optimal control in nonlinear stochastic dynamic systems, we study admission control for a classical Erlang-B blocking system with unknown service rate. At every job arrival, a dispatcher decides to assign the job to an available server or to block it. Every served job yields a fixed reward for the dispatcher, but it also results in a cost per unit time of service.
Our goal is to design a dispatching policy that maximizes the long-term average reward for the dispatcher based on observing the arrival times and the state of the system at each arrival. Critically, the dispatcher observes neither the service times nor departure times so that reinforcement learning based approaches do not apply. Hence, we develop our learning-based dispatch scheme as a parametric learning problem a’la self-tuning adaptive control. In our problem, certainty equivalent control switches between an always admit policy (always explore) and a never admit policy (immediately terminate learning), which is distinct from the adaptive control literature. Therefore, our learning scheme judiciously uses the always admit policy so that learning doesn’t stall.
We prove that for all service rates, the proposed policy asymptotically learns to take the optimal action, and we also present finite-time regret guarantees. The extreme contrast in the certainty equivalent optimal control policies leads to difficulties in learning that show up in our regret bounds for different parameter regimes. We explore this aspect in our simulations and also follow-up sampling related questions for our continuous-time system. parameter regimes. We explore this aspect in our simulations and also follow-up sampling related questions for our continuous-time system.
We present FedScale, a federated learning (FL) benchmarking suite with realistic datasets and a scalable runtime to enable reproducible FL research. FedScale datasets encompass a wide range of critical FL tasks, ranging from image classification and object detection to language modeling and speech recognition. Each dataset comes with a unified evaluation protocol using real-world data splits and evaluation metrics.
To reproduce realistic FL behavior, FedScale contains a scalable and extensible runtime. It provides high-level APIs to implement FL algorithms, deploy them at scale across diverse hardware and software backends, and evaluate them at scale, all with minimal developer efforts. We combine the two to perform systematic benchmarking experiments and highlight potential opportunities for heterogeneity-aware co-optimizations in FL.
FedScale is open-source and actively maintained by contributors from different institutions at fedscale.ai. We welcome feedback and contributions from the community.
Although machine learning (ML) models are being used in many fields to make predictions or take decisions, often we don’t know why these models make the predictions or take the decisions they do. Such an ML model commonly known as “black-box” cannot answer why it is certain about its prediction, what accounted for the uncertainty, and how much perpetuated bias exists. The lack of accountability of ML models impedes trustworthy communication between humans and models.
The current paradigm in the ML research includes model-based interpretability (e.g., linear model) and post hoc interpretability. While model-based interpretability suffers from poor accuracy, post hoc interpretability of black box models lacks adequate descriptive accuracy.
Here we present a method of developing high accuracy interpretable machine learning models in the context of materials discovery and design. Also, we attempt to establish causal relationships between input features and target outputs. In principle, this data-driven approach can be used in other disciplines, including science, arts, engineering, and health care.
Humans acquire language through sensorimotor experience with the world. The ability to connect language to their referents in the physical world (referred to as grounding) play an important role in language understanding and language learning. Such ability, although effortlessly for humans, is notoriously difficult for AI agents.
To address this limitation, we introduce a new task formulation and new metrics to emphasize grounding in word learning. Specifically, we introduce Open-Vocabulary Referential Cloze (RefCloze) to challenge vision-language systems to perform visually grounded and object-centric language modeling. We propose Masked Language DEtection TRansformer (MaskDETR), a novel and simple visually grounded language model by pre-training on image-text pairs with fine-grained word-object alignment.
Through extensive experiments, we demonstrate MaskDETR as a more coherent grounded word learner, and that learning the referential grounding between words and objects is crucial to grounded word learning and processing. We further present a comprehensive inquiry on the cognitive plausibility of such vision-language transformer as a human-like word learner. The RefCloze task formulation, the new evaluation metrics, together with our empirical findings, will provide insight for future work on grounded language acquisition.
Two of the most fundamental challenges in Natural Language Understanding (NLU) at present are: (a) how to establish whether deep learning-based models score highly on NLU benchmarks for the `right’ reasons; and (b) to understand what those reasons would even be. We investigate the behavior of reading comprehension models with respect to two linguistic `skills’: coreference resolution and comparison.
We propose a definition for the reasoning steps expected from a system that would be `reading slowly’, and compare that with the behavior of five models of the BERT family of various sizes, observed through saliency scores and counterfactual explanations. We find that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the `right’ information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.
The full paper has been accepted to COLING 2022 and is available here.
We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling algorithm, using special classes of sparsity-inducing priors (e.g. spike-and-slab) to model the unknown parameter, and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high dimensional and sparse contextual bandits.
For faster computation, we use spike-and-slab prior to model the unknown parameter and variational inference instead of MCMC to approximate the posterior distribution. Extensive simulations demonstrate improved performance of our proposed algorithm over existing ones. This encourages the use of Thompson sampling algorithm in high-dimensional bandit problems arising in many modern areas like recommendation system, personalized healthcare system, experimental design, etc.
We can benefit from researchers in application domains where bandit methodology is useful. We can also benefit from collaborations with researchers working on bandits problems in fields such as computer science, operations research, electrical engineering, etc.
While there are sources for Spanish-English bilingual speech data, they are not sufficiently accessible for analysis. We propose an open-access data repository of Spanish-English bilingual speech called ES COCO (English-Spanish COde-switching COrpus). ES COCO will contain tagged speech from podcasts and already-created corpora, and meta-data such as speaker and demographic information. While some multilingual corpora exist (ex. BilingBank (MacWhinney, 2019)), this will be the largest Spanish-English corpus where researchers do not need to aggregate data. Most of the ES COCO data are not-transcribed audio recordings.
We use the XLS-R neural network, fine-tuned on the Spanish and English components of the CommonVoice dataset, for speech-to-text conversion (Babu et al., 2021; Ardila et al., 2020). We increase the accuracy by boosting it with an n-gram language model trained on Spanish-English datasets from the LinCE Benchmark (Aguilar et al., 2020). Once converted to text, we apply an automated tagging process for part-of-speech and language to annotate linguistic features of the data. These processes rely on a transformer language model, XLM RoBERTa (Conneau et al., 2019), fine-tuned using Spanish-English datasets from LinCE (Aguilar et al., 2020).
In addition to providing the corpus in a machine-readable format, we enable data exploration with a user-friendly interface, which can be run locally on the user’s machine or accessed via web browser. Users can search and filter the corpus by linguistic feature and view results in context, allowing them to quickly answer questions about bilingual language practices.
By creating this corpus of Spanish-English speech data, we remove the largest barriers in language research: the time and financial cost of collecting, transcribing, and tagging data. ES COCO is particularly beneficial for researchers who are not at R1 institutions and have limited access to funding, personnel, time, and the language communities required for language research.
With the rapid growth of online news aggregators, the debate on whether news aggregators should pay news publishers for redistributing their content has become very salient. However, there is little understanding of the impact of carrying news on news aggregators, especially for their non-news content.
Our research fills this gap by examining the impact of news on non-news user engagement and content generation on Facebook. We leverage a natural experiment, Facebook’s Australian news shutdown, to estimate this using both an event study and a difference-in-differences analysis. We find that both user engagement and content generation of non-news content on Facebook decreased after the news shutdown. We also find that these effects were more pronounced for more influential, socially active, experienced, and verified accounts.
These results suggest positive spillover effects of news content on non-news content. A simple quantification exercise shows that the impact of carrying news is economically significant for a platform like Facebook. Our results provide timely and relevant implications for regulators and social media platforms.
This is a multi-faceted approach that uses multiple tools:
– OpenShift & Jenkins for daily cron job to retrieve course info and peer review results
– Google Cloud Services – BigQuery for flexible data storage
– Vertex AI to build and deploy machine learning solutions
– PyTorch and the RoBERTa transformer architecture to perform powerful natural language processing and inferences
– Tableau to build interactive and intuitive dashboards that delivers focused insights for instructors and students
Prediction markets offer an alternative to polls and surveys for the elicitation and combination of private beliefs about uncertain events. The advantages of prediction markets include time-continuous aggregation and score-based incentives for truthful belief revelation. Traditional prediction markets aggregate point estimates of forecast variables. However, exponential family prediction markets (Abernethy et al., 2014) provide a framework for eliciting and combining entire belief distributions of forecast variables.
We study a member of this family, Gaussian markets, which combine the private Gaussian belief distributions of traders about the future realized value of some real random variable. Specifically, we implement a multi-agent simulation environment with a central Gaussian market maker and a population of Bayesian traders. Our trader population is heterogeneous, separated on two variables: informativeness, or how much information a trader privately possesses about the random variable, and budget.
We draw inspiration from a previous work (Martin et al., 2021) which studied another member of the exponential family in simulation. We generalize their notion of informativeness and provide a characterization of the corresponding budget-constrained optimization process. Within our market ecosystem, we analyze the impact of trader budget and informativeness, as well as the arrival order of traders, on the market’s convergence. We also study financial properties of the market such as trader compensation and market maker loss.
The board game industry has grown tremendously in the past few decades. Partly due to the Covid-19 pandemic, its growth continues as people spent more time at home and is projected to reach revenues of around 30 billion by 2026. As another piece of evidence, the global board games market is projected to grow by $3.02 billion during the 2021-2026 with an approximate 13% of CAGR (compounded annual growth rate) and is expected to reach $13 billion by 2026 (Businesswire 2021).
On the other hand, supply chain disruptions such as cargo shortage resultant from the Covid-19 pandemic had brought about difficulty to some game publishers such as Tasty Minstrel Games that were reportedly in “virtual bankruptcy,” despite their games being largely appreciated by customers. Still others were merged with larger publishers.
In this study, we focus on the Asmodee Group, one of the major players in the board game industry, who “quietly built a board-game empire with Catan, Pandemic, and Ticket to Ride (Tullis, 2021)” through multiple merge and acquisition of smaller publishers and game distributors. Using the data from www.boardgamegeek.com (BGG), the largest online community of board-game users and designers, via Application Programming Interface (API) with Python, we obtained data in August 2021.
After data cleaning, we investigate the relationships between designer teams and game performance. We measure game performance in three different ways: including popularity of a game, ratings from the customers, and attention received from the market. Our findings provide insights on what constitutes a good design team to make a good game.
Pathogens are becoming increasingly drug resistant, yet drug discovery methods have failed to produce new classes of antimicrobials for decades, thus there is an urgent need to identify effective therapies from existing FDA approved drugs. Multi-drug regimens are currently used to fight antibiotic resistance, but they are often chosen empirically, leading to suboptimal treatment outcomes and the spread of resistance.
To create new multi-drug treatment plans, computational tools are needed to narrow the vast sample size of FDA approved drugs in combination. Current computational methods rely on costly experimental data and black-box algorithms. Our model replaces omics data with drug-protein binding affinity calculations to predict effective drug combinations for Escherichia coli. Initial performance assessments show our model performs as well as models that require omics inputs. Molecular docking and neural networks were used to calculate an affinity between 59 drugs and 1499 proteins in E. coli. These drug-protein interactions were then used as features in the ML model.
During model construction the biochemical principles behind drug mechanisms of action were investigated by examining the extensive set of drug-protein interaction calculations as well as three omics studies covering chemogenomics, transcriptomics, and metabolomics. We have optimized a complex system of molecular-scale, protein-drug interactions with macro-scale, drug-drug interactions data to quickly predict drug therapies that could be used to treat deadly drug resistance pathogens.
Due to the flexible, multiscale, and hybrid nature of our model, many combinations, infeasible to interrogate via physical experiments due to cost and time, could be examined. Secondly, the model enables exploration of the underlying biological and chemical factors that influence drug mechanisms of action for better design of drug combination therapy. Our predictive model combines machine learning, deep learning, and physics-based molecular docking, which could impact and inspire future hybrid methodologies in AI and Data Science.
Changes resulting from a change in a single amino acid in a protein can be either disease causing or benign, due in large part to protein stability and protein binding with other proteins, nucleic acids, or small molecule ligands. We developed a database, the Annotated Database of Disease RElated Structures and Sequences (ADDRESS), mapping human genetic mutations to protein structures in the Protein Data Bank (PDB).
We found that mutations that shift the equilibrium more towards the unfolded (non-native) state are more often disease causing on average than those that approximately retain the stability. Interestingly, the threshold at which mutations become pathogenic is substantially less than the average stability of proteins in general, perhaps indicating the importance of cellular kinetics in a system where proteins are constantly degraded and misfolded.
We built decision trees inclusive of various topology relations and found that the cross relation was especially indicative of whether the mutation causes disease, in the case of non-essential proteins with low stability change. We also found that, in the case of treatability of a set of lysosomal storage disorders, stability change, binding to ligand, and an aspect of topology likely related to kinetics of the system were important in indicating whether a drug was effective. Incorporation of binding and aggregation propensity will build upon the current database.
Chromatin architecture, a key regulator of gene expression, can be inferred using chromatin contact data from chromosome conformation capture or Hi-C technology. However, classical Hi-C does not preserve multi-way contacts. Here we use long sequencing reads to map genome-wide multi-way contacts and investigate higher order chromatin organization in the human genome. Multiway chromatin contact data captured with Pore-C technology contains structural information beyond the pairwise data captured with traditional Hi-C. This allows for more precise representation of chromatin architecture and lends itself to efficient representation by hypergraphs to capture this higher order network structure.
We use hypergraph theory for data representation and analysis, and quantify higher order structures in neonatal fibroblasts, biopsied adult fibroblasts, and B lymphocytes. Hypergraphs and tensors are natural representations of the contact structure in the genome.
Furthermore, we investigated the relationship between the multiway and pairwise data captured with Pore-C and Hi-C technology. By integrating multi-way contacts with chromatin accessibility, gene expression, and transcription factor binding, we introduce a data-driven method to identify cell type-specific transcription clusters. We provide transcription factor-mediated functional building blocks for cell identity that serve as a global signature for cell types.
Gabrielle Dotson, Can Chen, Stephen Lindsly, Anthony Cicalo, Sam Dilworth, Charles Ryan, Sivakumar Jeyarajan, Walter Meixner, Cooper Stansbury, Joshua Pickard, Nicholas Beckloff, Amit Surana, Max Wicha, Lindsey Muir, and Indika Rajapakse. “Deciphering Multi-way Interactions in the Human Genome.” Nature Communications, in Press (2022).
Joshua Pickard, Rahmey Salhm, Can Chen, Amit Surana, Indika Rajapakse. “Hypergraph Analysis Toolbox for Long Read Sequencing,” Manuscript in preparation
Autonomous vehicles represent one of the most active technologies currently being developed, with research areas addressing, among others, the modeling of the states and behavioral elements of the occupants. This paper contributes to this line of research by studying the circadian rhythm of individuals using a novel multimodal dataset of 36 subjects consisting of five information channels. These channels include visual, thermal, physiological, linguistic, and background data.
Moreover, we propose a framework to explore whether the circadian rhythm can be modeled without continuous monitoring and investigate the hypothesis that multimodal features have a greater propensity for improved performance using data points specific to certain times during the day. Our analysis shows that multimodal fusion can lead to an accuracy of up to 77% on identifying energized and enervated states of the participants. Our findings highlight the validity of our hypothesis and present a novel approach for future research.
Summary: Antibiotic resistance is becoming a significant public health concern worldwide, with few novel treatments being discovered. Drug combination therapy is a promising solution against antibiotic resistance. But, the search for effective drug combinations within a vast combinatorial space is time- and resource-intensive. We developed an ML-based algorithm that, (1) predicts effective multi-drug therapies, (2) utilizes multi-omics datasets for uncovering complex drug-drug interactions, (3) overcomes the need for high-cost experimental datasets, (4) provides the lucid interpretation of model predictions, and (5) accounts for drug toxicity profiles to design safer treatments. Current approaches cannot address more than one of the above concerns simultaneously.
Methods: The approach involves a three-step process, (1) generating feature profiles for individual drug treatments using multi-omics data such as metabolomics, proteomics, structural profiles, etc., followed by (2) preprocessing to compute joint profiles for a combination accounting for similarity and uniqueness among drugs in the treatment, and (3) feeding the information to the ML algorithm for model development, and evaluating performance.
Results: Performance for the approach was evaluated on several drug combination datasets, two-way interactions (R=0.6015***), two-way interactions in Glycerol (R=0.5178***), three-way interactions (R=0.4515***), sequential interactions (R=0.4337***) [*** indicates p-value < 1e-3]. The trained model was interpreted using the Testing with Activation Concept Vectors (TCAV) approach, which concludes that subsystems like Pyruvate metabolism, TCA cycle, and Oxidative Phosphorylation play an important role in predictions made by the model. Additionally, the trained model was fine-tuned to predict the toxicity of combination therapy to ensure the safety of the treatment.
Impact: As novel treatments are not readily discovered, it makes it crucial to design treatments using approved FDA drugs. Our developed algorithm aims to provide an innovative and unique perspective on utilizing machine learning in guiding the development of multi-drug treatments using FDA drugs.
Managing file-based workflows is a cross-disciplinary headache. We highlight how tools originally developed to manage simulation data can be applied to simplify certain tasks in machine learning and data science, like generating data, selecting models, and streamlining the hyperparameter optimization of neural networks. The signac framework consists of three Python packages to help organize file-based projects, define reproducible computational workflows, and explore the data. It provides a command line and Python interface to access and manage project data as well as submit cluster jobs to high performance computing schedulers.
Signac implements a file-based database with no need to explicitly define a data schema. Signac organizes collections of parameter values as signac jobs and stores them in a flat directory structure. Using the command line or Python query interface, you can access data stored in the job directory, get job-specific file paths, and generate human-readable directory structures for sharing. This frees you from thinking about the minutiae of file organization and lets the data schema evolve with the project.
Signac-flow lets you define a computational workflow composed of operations with pre- and post-conditions. Using the command line interface, operations can be run locally or submitted to high performance computing systems and features built-in support for GreatLakes. The signac-dashboard package helps you inspect and filter jobs in a signac project. It runs a local web server and can interactively display files, videos, and images such as learning curves.
Developers and users are active in the Slack channel and happy to welcome new users. Check out signac.io for more.
Optical multi-layer thin films are widely used in optical and energy applications requiring photonic designs. Engineers often design such structures based on their physical intuition. However, solely relying on human experts can be time-consuming and may lead to sub-optimal designs, especially when the design space is large.
In this work, we frame the multi-layer optical design task as a sequence generation problem. Based on reinforcement learning, a deep sequence generation network is proposed for efficiently generating optical layer sequences. We train the deep sequence generation network with proximal policy optimization to generate multi-layer structures with desired properties. The proposed method is applied to two energy applications.
Our algorithm successfully discovered high-performance designs, outperforming structures designed by human experts and state-of-art algorithms. We believe our algorithm based on reinforcement learning can extend to many other multi-layer tasks and achieve high performance.
Moiré patterns in van der Waals heterostructures have lately received a significant focus in 2D materials research. They are open to external control through the twist-angle between the layers while providing significant impact on the band-structure and properties of the heterostructure, such as “magic-angle” superconducting graphene. While existing nanoscale measurement techniques such as transmission electron microscopy and near-field tip-enhanced microscopy are able to directly measure the Moiré patterns, these techniques are typically slow, costly, and often require sample preparation incompatible with other measurements and experiments.
We attempt to overcome these limitations by applying machine learning to the far-field data scattered through a metalens placed in the near-field region of the sample. The metalens, which consists of a collection of dipole resonators placed in the near field of the sample, is able to scatter the evanescent high-spatial-frequency near-field information to detectors in the far field. Using a U-Net convolutional neural network trained according to the metalens scatterer arrangement, we are able to reconstruct the near-field pattern from the scattered far-field data.
We model the problem using a simulation of the metalens that models the interaction of the resonant dipoles with the near-field as well as the dipole-dipole interactions within the metalens. This allows us to quickly generate a training dataset of tens of thousands of near-field configurations and far-field output. Using this, we investigate the effect of different metalens designs on the near-field reconstruction.
These results pave the way for future physical implementations to allow direct single-shot measurement of Moiré lattices in heterostructures in the far-field. They may find further application in nanofabrication metrology, enabling single-shot optical measurement of subwavelength features beyond current scanning optical techniques or electron microscopy.
Metal-organic frameworks (MOFs) are the pioneering candidates for solving some of the grand challenges of our society, including clean energy, carbon dioxide capture, and water purification. The crystalline nanoporous structure of these materials is advantageous for such applications compared to other solid-state materials.
However, the stability (i.e., structural integrity) of many MOFs is compromised under different operating conditions (e.g., temperature, pressure, chemical environment). Often, stability information of MOFs under these conditions is unavailable. Determining the stability of MOFs at different physicochemical conditions is a tedious experimental exercise involving multiple characterization methods. Experimentally examining the stability of MOFs under various operating conditions is impractical for the over 100,000 already-synthesized MOFs, and a standardized computational approach to determine MOFs’ stability is not available.
Here we report a comprehensive data-driven approach to predict the thermal, chemical, and mechanical stabilities of MOFs. We combine cheminformatics and materials informatics feature engineering approaches for training machine learning (ML) models. We develop four optimized ML models for the prediction of thermal, mechanical, and solvent removal stability of MOFs. The predictive performance of our ML models for thermal and solvent removal stability are better than those reported elsewhere. In principle, our models can be used for the prediction of stability of an arbitrary MOF under different operating conditions.
Designing optical structures for generating structural colors is challenging due to the complex relationship between the optical structures and the color perceived by human eyes. Machine learning-based approaches have been developed to expedite this design process. However, existing methods solely focus on structural parameters of the optical design, which could lead to sub-optimal color generation due to the inability to optimize the selection of materials.
To address this issue, an approach Neural Particle Swarm Optimization is proposed. The proposed methods combine the mixture density networks as well as optimization and achieves high design accuracy and efficiency on two structural color design tasks; the first task is designing environmental-friendly alternatives to chrome coatings and the second task concerns reconstructing pictures with multilayer optical thin films. Several designs that could replace chrome coatings have been discovered; pictures with more than 200,000 pixels and thousands of unique colors can be accurately reconstructed in a few hours.
Cities shape people. But can people shape cities? Urban planners and designers experience a dearth of insightful scientific tools that assist them in urban research. Moreover, predominant data analysis in urban scholarship has used data to create city systems and structures that have shaped people’s access to the city and its resources – albeit inequitably.
This pattern throughout history to use data, mapping, and systematic planning to suppress the voice of the vulnerable mandates a shift in perspective. Contrary to being the “new” oil, the research argues that data has always been a resource of the powerful. Most often, data has been the key to drawing the larger pictures and connecting relationships between apparently disparate entities, resourcefully sectioned for the benefit of a few privileged groups and before the less powerful people have had a chance to get a sense for or offer reflections on the macro picture. Understandably the first known traces of data analysis and mapping were created to fight and win wars. They have been strategic attempts to achieve goals that were not always about overarching good causes.
Metricle is a counter data tool in the making that allows researchers to analyze the built landscape to retrace systemic neglect in neighborhoods from the traditional top down efforts to offer a counter-lens through which to view city design and planning. A perspective that correlates people’s needs with the existing physical landscape as opposed to performing analysis to merely establish dominion. It does this through a systematic investigation of spatial imagery in conjunction with available census and external data about the place. It uses a novel technique to co-relate and associate various related and interdependent metrics to find broad trends or anomalies in the data. These captured trends or outliers are then studied in relation to the deductions made from critical observation of spatial imagery to draw causation and speculate on recommendations.
Bicycling is a promising transport mode to make our communities more sustainable, healthier, and more equitable.Motor vehicle traffic volumes, often measured by the Annual Average Daily Traffic (AADT), have been widely used in making engineering decisions. However, little data on bicycle traffic volumes have been collected and used in most U.S. cities.
In this project, we collected bicycle traffic data using a commercial automated bike counter at two locations: a multi-use path in Dearborn and a protected bike lane in Ann Arbor. Validation studies were conducted to examine the counter accuracy using video cameras. A total of 9 weeks data collection was conducted with more than 13,000 people on bikes counted as of today (the data collection in Ann Arbor is still ongoing). Data analysis was conducted to examine the bicycling traffic patterns in terms of traffic in different time of day, day of the week, and weather conditions. The primary trip purposes (i.e., commuting, recreation) at each location can be inferred from the patterns.
In addition, an open-source, public interactive dashboard (linked here) was developed that allows other researchers, traffic engineers, city planners, and the general public to freely explore the bike traffic data. The dashboard supports selecting date ranges, traffic directions, and data resolution (e.g., daily, hourly, 15-minutes). The outcome of this work can be used to get insights of bicycle infrastructure usages and support data-driven decision-making by the city planners and community engagement.
The Michigan Institute for Data Science (MIDAS) Student Leadership Board is made up of 10 students representing multiple schools, programs, majors, and degree levels at the University of Michigan. The group is responsible for organizing and carrying out community service events and advising MIDAS leaders on various data science activities that benefit the student community.
In an increasingly data-driven world, data science is ubiquitous in big business and academic research. Local community organizations also stand to benefit from statistical insight; however, these groups often lack the time, resources, or skills to collect and analyze data. Statistics in the Community (STATCOM) is an outreach program that offers the expertise of graduate students, free of charge, to non-profit community and governmental organizations.
University-community partnerships such as STATCOM offer many benefits for both students and stakeholders alike. Community partners gain a deeper understanding of their operational processes and benefit from assessing program efficacy, optimizing resource allocation, and evaluating further areas of unmet need. Beyond the fulfillment of positively impacting their community, student volunteers gain hands-on experience working with data, answering complex questions, and effectively communicating statistical concepts and results to others — crucial skills of benefit throughout their careers. This poster will exemplify the unique collaboration between STATCOM at the University of Michigan and its community.
MDST (Michigan Data Science Team) is the leading practical data science and machine learning club at the University of Michigan, with over a hundred UM students working together on a range of projects each semester. We are dedicated to educating about the applications of data science and ML, while providing opportunities for members’ professional, academic, and career development. This means we work on projects, hold workshops, host corporate tech talks, and also have a couple of social events throughout the semester.
At this summit we aim to spread awareness of our club throughout the UM data science and AI community. Specifically we hope to share details of the interesting projects we work on, and the range of opportunities we offer to UM students and corporate partners alike.
Michigan Eco Data exists to foster a community of individuals at the University of Michigan who solve environmental and biological problems with the innovative use of technology and data analysis. The organization hosts environmental-tech talks, facilitates group ecological projects, provides support to individual student’s projects, hosts group environmental trips, and plans social gatherings. Our end goal is to build and form supported network of talented students with interests in the intersection and utilization of environmentalism and data driven discovery.
Innovation Partnerships experts champion the creation of corporate research alliances and collaborations to accelerate the development of promising research. We enable the translation, commercial development and licensing of groundbreaking research discoveries and technologies. We help create new ventures to usher in change.