The 2018 MIDAS Annual Symposium, titled “Serving Society through Data Science,” took place October 8-9 at the Rackham Building. The event brought several preeminent data scientists to Ann Arbor as speakers, and also featured high-impact research at U-M from investigators across campus, and student achievements.

Read about winners of the poster competition at https://midas.umich.edu/happenings/2018-poster-winners/

Featured speakers

  • “Big Data in Manufacturing Systems with Internet-of-Things Connectivity” 
    Dawn Tilbury
    , Professor, Mechanical Engineering and Electrical Engineering and Computer Science, University of Michigan.
  • “Big (Network) Data: Challenges and Opportunities for Data Science”
    Patrick Wolfe, Frederick L. Hovde Dean of Science, Purdue University.
  • “The Data Science Expert in the Room”
    Katherine Ensor, Director, Center for Computational Finance and Economic Systems (CoFES), Rice University.
  • The Elements of Translational Data Science”
    Raghu Machiraju, Interim Director, Translational Data Analytics Institute, The Ohio State University

U-M research talks

  • A Network Analysis Approach to Regional Input Output Multipliers
    Tayo Fabusuyi, Post-Doc, U-M Transportation Research Institute
  • A Minimalist Approach to Computation in Music
    Somangshu Mukherji, Assistant Professor, School of Music, Theater and Dance
  • Survey Equivalence: An Information-theoretic Measure of Classifier Accuracy When the Ground Truth is Subjective
    Paul Resnick, Professor and Associate Dean for Research and Faculty Affairs, School of Information
  • Fundamental Limits of Exact Support Recovery in High Dimensions
    Zheng Gao, Graduate Student, Statistics
  • Active Remediation: The Search for Lead Pipes in Flint, Michigan
    Eric Schwartz, Assistant Professor of Marketing, Ross School of Business
  • State Innovation Model: Towards a Learning Health System to Reduce Emergency Department Visits in Livingston and Washtenaw County
    Elliott Brannon, MD/PhD Student, Health Infrastructures and Learning Systems, Medical School
  • Network Structure, Efficiency, and Performance in WikiProjects
    Edward Platt, Graduate Student, School of Information
  • Mining Students’ In- and Out-of-Class behaviors to Create Earlier Warning System
    Sungjin Nam, Graduate Student, School of Information and College of Engineering
  • Multiclass Meta-learning
    Salimeh Yasaei Sekeh, Post-Doc, College of Engineering
  • What is Bitcoin? Exploration, Exploitation, and the Emergence of the Cryptocurrency Category
    Lynette Shaw, Assistant Professor, Complex Systems
  • Predicting Bicyclist Destination and Route by Link Using Large Scale GPS Based Naturalistic Bicycling Data
    Yuting Wu, Undergraduate, U-M Transportation Research Institute and School of Information
  • Systems-level Analysis of a Cytokine-induced Cell Cycle using Dynamic Metabolic Network Modeling
    Ho-Joon Lee, Research Investigator, Molecular and Integrative Physiology

Other Events

  • A poster session and student poster competition consisting of approximately 90 posters (poster size maximum is 4 feet high X 6 feet wide)
  • Industry perspectives on data science and social good, a panel discussion with Quicken Loans, Ford, Wacker Chemical, and other MIDAS corporate partners

The poster session includes 87 submissions from the following categories: (where to print a poster)

Biomedical Sciences Business and Marketing Climate Research & Natural Disasters
Computer Science Data Science Education Data Science Methodology
Data Security Energy Research Engineering Research
Healthcare Research Learning Analytics Music
Science and Society Social Science and Economics Transportation research

Schedule

Monday, October 8, 2018
RACKHAM BUILDING, 915 E. WASHINGTON ST., ANN ARBOR

Click on each schedule item to expand.

8 a.m. - Check-in and Coffee

Fourth floor, Rackham Building

8:30 a.m. - Welcome

Al Hero and Brian Athey, MIDAS Co-Directors

athey_brian-bestBrian Athey is the Michael A. Savageau Collegiate Professor and Chair of the Department of Computational Medicine and Bioinformatics, and Professor of Psychiatry and Internal Medicine. HeroJan2010Al Hero is the John H. Holland Distinguished University Professor of Electrical Engineering and Computer Science, R. Jamison and Betty Williams Professor of Engineering, Professor of Biomedical Engineering, and Professor of Statistics

8:45 a.m. - Patrick Wolfe, Dean, College of Science, Purdue University

Title: Big (Network) Data: Challenges and Opportunities for Data Science

Abstract: How do we draw sound and defensible conclusions from big data? This question lies at the heart of data science. In this talk I will first describe some of the challenges and opportunities inherent in this rapidly emerging field, and then discuss the current state of the art in one area of particular interest: big network data.  Progress in this area includes the development of new large-sample theory that helps us to view and interpret networks as statistical data objects, along with the transformation of this theory into new statistical methods to model and draw inferences from network data in the real world. The insights that result from connecting theory to practice also feed back into pure mathematics and theoretical computer science, prompting new questions at the interface of combinatorics, analysis, probability, and algorithms.

Bio: Patrick Wolfe is a 1998 graduate of the University of Illinois in electrical engineering and music. He earned his doctorate from Cambridge University in 2003 where he held a National Science Foundation graduate research fellowship. After teaching at Cambridge and Harvard, he joined the faculty of University College London (UCL) in 2012 as a professor of statistics and computer science. He is the founding executive director of UCL’s Big Data Institute, and a trustee of the Alan Turing Institute, the United Kingdom’s national institute for data science, where he played a key role in its establishment, and shaping its programs. While at Harvard, Prof. Wolfe received the Presidential Early Career Award for Scientists and Engineers  from the White House, and has provided expert advice on applications of data science to social, commercial and policy challenges. He has also received awards for his research from the Royal Society, the Acoustical Society of America, and IEEE. He is active in the global mathematics, statistics, and physical sciences communities, and most recently was an organizer and Simons Foundation fellow at the Isaac Newton Institute for Mathematical Sciences 2016 semester research program on Theoretical Foundations for Statistical Network Analysis. Currently, Prof. Wolfe serves as the Dean of the College of Science at Purdue University.

10 a.m. - Research Talks, Session 1

  • Survey Equivalence: An Information-theoretic Measure of Classifier Accuracy When the Ground Truth is Subjective (Paul Resnick, School of Information, and Grant Schoenebeck, Electrical Engineering and Computer Science)
    Abstract: Many classification tasks have no objective ground truth. Examples include: which content or explanation is “better” according to some community? is this comment toxic? what is the political leaning of this news article? The traditional modeling approach assumes each item has an objective true state that is perceived by humans with some random error. It fails to account for the fact that people have greater agreement on some items than others. I will describe an alternative model where the true state is a distribution over labels that raters from a specified population would assign to an item. This leads to information gain (mutual information) as a theoretically justified and computationally tractable measure of a classifier’s quality, and an intuitive interpretation of information gain in terms of the sample size for a survey that would yield the same expected error rate.
  • Systems-level Analysis of a Cytokine-induced Cell Cycle using Dynamic Metabolic Network Modeling, Ho-Joon Lee, Research Investigator, Molecular and Integrative Physiology
    Abstract: The cell cycle is a fundamental process in biology for cell growth and proliferation. Its dysregulation is at the heart of many diseases including cancer. We previously studied the cell cycle in a model system of murine pro-B cells upon activation of a quiescent state by a cytokine, IL-3, using time-course quantitative proteomics and metabolomics. The data consist of 6 time points covering the IL-3-induced first cell cycle and the initial quiescent state. Here we build a metabolic network model for the IL-3-induced cell cycle system from time-course metabolomics data using a dynamic genome-scale metabolic network modeling framework we recently developed. We model the whole cell cycle by a sequence of 4 linearized cell-cycle phases. Our model correctly identified enzymes whose differential expression best explains the changes in metabolic flux between the phases of the cell cycle. Our results are consistent with a previous finding that the IL-3-induced cell cycle exhibits cancer-like metabolic reprograming. Furthermore, we discover significantly altered biochemical reactions in methionine metabolism in the G1/S transition. This result provides additional insights into mechanisms directly related to our previous finding that the uptake rate of methionine is highest among all essential amino acids in the early G1 phase. Our model also reveals phase-specific active reactions. In particular, we find that lysine metabolism, nucleotides synthesis, fatty acid elongation, extracellular/mitochondrial/peroxisomal transport systems, and heme biosynthesis are major active processes in the G0/G1 transition, which suggests a global mechanism for the transition into cell growth and proliferation. Our systems-level analysis of heterogeneous high-dimensional time-course data using a mechanistic metabolic network model is expected to serve as a general framework for diverse dynamic systems such as cell cycle, cellular differentiation, development, and immune activation.
  • A Network Analysis Approach to Regional Input Output Multipliers (Tayo Fabusuyi, U-M Transportation Research Institute)
    Abstract: We address a practical problem often faced by economic development organizations using data from the Greater Pittsburgh area. Local economic development organizations are often tasked with promoting the health and vitality of the regional economy. However, the unique composition of each geographical area calls for a distinct approach that reflects the peculiarities of the local economy. The study presents an approach by which the information obtained from input-output analysis and conventional metrics of economic development are enriched by the concepts and metrics of network analysis. We illustrate this approach by visualizing the economy of the Greater Pittsburgh area and by computing a set of network metrics which identify the interrelationships among industries within the economy. The insight provided on the structural makeup of the regional economy not only reveals latent opportunities within the economy that are not be evident using conventional methodologies but also provides information on the optimal resource allocation strategy.
  • A Minimalist Approach to Computation in Music (Somangshu Mukherji, School of Music, Theater and Dance)
    Abstract: Music is one of the central components of human society—so it is not surprising that studying music has been central to the computational sciences too, since the beginnings of the computer revolution. Within this long history, however, there has always been a tendency to assume that music is just made up of sequences of sounds, and furthermore that understanding music just requires understanding the statistics inherent in such sequences—for example, the probabilities through which certain sounds precede or follow others in a sequence. My talk challenges this paradigm, by considering some of the complex, abstract aspects of musical structure, which underlie musical sequences, and which have been much discussed in the field of music theory—yet often ignored in the computational study of music. I will illustrate how these abstract musical properties explain how music is created by the human mind, and also how they lead to the variety of musical styles and idioms seen across the world—thus forcing us to reconsider the old cliché about music being “the universal language.” I will also discuss how this view of music is shared with certain theories about how the mind creates and processes language, as described especially within the framework known as the Minimalist Program, in contemporary linguistics. This Minimalist ‘musicolinguistic’ perspective not only describes, therefore, some hitherto-ignored, shared, properties of musical and linguistic structure, it also provides a fascinating new window into how the mind works, especially in the ways that it creates music and language—phenomena that are ex hypothesi unique in nature. So, as I conclude, this perspective encourages us to rethink what we know of the human mind and its information-processing abilities—and perhaps even rethink what it means to study music and computation, in the face of prevailing trends in the field.

11 a.m. - Dawn Tilbury, Professor, Mechanical Engineering and Electrical Engineering and Computer Science, U-M

Title: Big Data in Manufacturing Systems with Internet-of-Things Connectivity”

Abstract: As we move into an era of more connected, smarter manufacturing systems, a number of opportunities and challenges arise. The connecting of plant floor sensors and actuators via Internet of Things (IoT) technology has the ability to transform manufacturing systems operations.  More data available in real-time, combined with high-fidelity simulation data and cloud computing, enables the supply chain to be directly connected with operations on the factory floor, and the current status of parts and machines. Thus far, most of the work on IoT-enabled manufacturing systems has focused on the integration of the large volumes of data gathered with IoT-devices and their transformation into useful information, through advanced analytics. To become smarter, manufacturing systems need to close the loop and transform IoT data into manufacturing knowledge and useful actions. Closing the loop will allow manufacturing systems to become more responsive to market changes and customer desires, and will improve production quality, asset utilization, and profitability. To realize these goals, the future of IoT-enabled manufacturing requires closer collaboration between experts in control, manufacturing, and information systems.  This talk will discuss the current trends in data collection, including the types of data and the uses for analytics and predictions, and the opportunities for closed-loop control.  A case study on a small manufacturing systems testbed is used to validate the approaches.

Bio: Dawn Tilbury holds a Ph.D in Electrical Engineering and Computer Science and an MS in Electrical Engineering, both from the University of California-Berkeley, and an BS in Electrical Engineering from the University of Minnesota. Her research interests include control theory and applications; logic control for manufacturing systems including diagnostics, fault handling, and recovery; modular control systems; networked control systems; performance management of computing systems; web-based tutorials for controls education. Her many honors and awards include a Distinguished Engineering Educator Award from the Society of Women Engineers; an American Society of Mechanical Engineers Fellow; a Service Excellence Award from the U-M College of Engineering, and an Institute of Electrical and Electronics Engineers Fellow. She currently serves in a leadership position at a federal funding agency.

Noon - Lunch & Poster Session @ Michigan League

Box lunches are available for those who made a selection during the registration process.

The poster session includes data science research presentations from students, faculty and staff of the University of Michigan.

Students participating in the poster session will compete for eleven (11) awards which carry cash prizes.  Awards will be announced at 5:30pm in Rackham Auditorium.  Winning posters will be displayed at Weiser Hall during the Open House and Reception (6pm).

Awards will be made for:

  • Most Innovative Use of Data
  • Most Likely Societal Impact
  • Most Interesting Methodological Advances
  • Most Likely Transformative Scientific Impact
  • Most Likely Health Impact
  • Best Overall

2 p.m. - Katherine Ensor, Director, Center for Computational Finance and Economic Systems (CoFES), Rice University

Title: The Data Science Expert in the Room

Abstract: In today’s data driven world, data scientists and statisticians are in high demand. We have a wide range of methods and tools available to answer key questions across the spectrum of human inquiry. Through our methodological training we are also able to expand upon our core expertise to widen this set of tools when necessary. In this talk, I will tell the story of the importance of stepping up and serving as the data science expert in the room, and finding the right tools for the questions asked. The background of the story is the devastation that Hurricane Harvey caused to the greater Houston area. At the same time that we were helping our neighbors, the scientists in the Houston area quickly moved to action to bring their expertise to the challenges the disaster brought. As a data scientist and a leading scholar in Houston, I have a seat at the table to help with Houston’s recovery and reconstruction as the region moves forward. From a data collection perspective, the Kinder Urban Data Platform served as an expeditious way to integrate the real time data processes and to fully understand the longterm human health impact we established the Harvey Registry. Our wide range of data science tools and expertise is indispensable as the community transitions data to knowledge and action. For example, understanding the environmental and housing impact requires integration of spatially referenced data through advanced spatial statistics. Finally, for the bigger issues of changing flood patterns, methods in spatial-temporal extremes are necessary to address the key questions put forward by the hydrologists and city planners with whom I work. In each case, it was critical to the timely success of the project that I stepped up to serve as the data science expert and to execute the research on a time scale that met the needs of the team. I offer that our methodological training supports us all in serving the public good and humanitarian causes.

Bio: Katherine Bennett Ensor is the Noah G. Harding Professor of Statistics at Rice University where she serves as director of the Center for Computational Finance and Economic Systems (CoFES) and also director of the Kinder Institute’s Urban Data Platform. Ensor served as chair of the Department of Statistics from 1999 through 2013. She has shaped data science at Rice as a member of the campus wide hiring committee and currently serves on the campus wide program committee. Ensor’s research focuses on development of statistical and data science methods for practical problems. Her expertise is on dependent data covering time, space and dimension with applied interests in finance, energy, environment and risk management. New work in urban analytics addresses the environmental impact on public health and includes her leadership of the development of the Rice Kinder Institute Urban Data Platform. She is a fellow of the American Statistical Association, the American Association for the Advancement of Science and has been recognized for her leadership, scholarship and mentoring. Ensor is senior Vice President of the American Statistical Association and a member of the National Academies Committee on Applied and Theoretical Statistics. She holds a BSE and MS in Mathematics from Arkansas State University and a PhD in Statistics from Texas A&M University.

3 p.m. - Panel: Data Science for Social Good, An Industry Perspective

Moderator: Karen Fireman

Bio: Karen Schreiber Fireman has over 25 years of experience in finance with an emphasis on analysis, investments and risk management. She is a chartered financial analyst (CFA) with a degree in computer engineering, a Masters in math, and an MBA. She has worked in the oil and gas industry and for major financial institutions, including hedge funds, fund of funds, and insurance companies. Karen has held several C-suite positions including Chief Compliance Officer of Columbia Partners, Chief Technology Officer of Mantech Cyber, and Chief Financial Offiner of Columbus Properties. She sits on the MIDAS External Advisory Board.

Panelists:

  • James Carson, Data Science Team Leader, Quicken Loans
  • David Corliss, American Statistical Association columnist, Founder of Peace-Work
  • Raj Dhanasri, Senior Manager, Deloitte Digital
  • Paul McCarthy, Analytics Manager, Ford Motor Company
  • Michael Schneiderhan,  Director of Information Technology  – Analytic Services, Wacker Chemie

4:15 p.m. - Research Talks, Session 2

  • ActiveRemediation: The Search for Lead Pipes in Flint, Michigan (Eric Schwartz, Ross School of Business)
    Abstract: We detail our ongoing work in Flint, Michigan to detect pipes made of lead and other hazardous metals. After elevated levels of lead were detected in residents’ drinking water, followed by an increase in blood lead levels in area children, the state and federal governments directed over $125 million to replace water service lines, the pipes connecting each home to the water system. In the absence of accurate records, and with the high cost of determining buried pipe materials, we put forth a number of predictive and procedural tools to aid in the search and removal of lead infrastructure. Alongside these statistical and machine learning approaches, we describe our interactions with government officials in recommending homes for both inspection and replacement, with a focus on the statistical model that adapts to incoming information. Finally, in light of discussions about increased spending on infrastructure development by the federal government, we explore how our approach generalizes beyond Flint to other municipalities nationwide.
  • Fundamental Limits of Exact Support Recovery in High Dimensions (Zheng Gao, Statistics)
    Abstract: We study the signal support recovery problem in high dimensions, that is, the estimation of the locations of the non-zero entries of sparse signals when observed with additive errors. The problem is of fundamental importance in the era of big-data. One would like to understand when and to what extent can a high-dimensional sparse signal be accurately estimated, and conversely, when such endeavours are hopeless.
    The problem of sparse signal detection was first been studied by Ingster (1998) and Donoho (2004), where a phase-transition phenomenon was demonstrated. A similar phase-transition result holds for the problem of approximate support recovery. We show in this paper a new phase-transition phenomenon in the exact support recovery problem.Under general distributional assumptions, we characterize the required signal sizes as a function of the sparsity level, in order for the support recovery to be asymptotically exact. If the signal sizes are below this new strong classification boundary, no thresholding procedure can achieve asymptotically exact support recovery. We show that thresholding procedures are optimal for a large class of asymptotically log-concave error-distributions. In these cases, therefore, the strong classification boundary is universal.
    We also show that this strong classification boundary holds under very general, but not arbitrary dependence assumptions. The concept of relative stability of maxima plays a key role in describing the dependence structures. A complete characterization is given in the Gaussian case. In particular, we establish a necessary and sufficient condition for the uniform relative stability of Gaussian triangular arrays, which may be of independent interest.
    Finally we demonstrate, perhaps surprisingly, that thresholding procedures are not always optimal in support recovery problems especially in the regime of heavy-tailed super-exponential error distributions.
  • State Innovation Model: Towards a Learning Health System to Reduce Emergency Department Visits in Livingston and Washtenaw County (Elliott Brannon, Medical School)
    Abstract: High utilizers of the Emergency Department (ED) often have complex needs that require coordination of care between multiple organizations. We describe a Learning Health Systems (LHS) approach to reducing ED visits, in which an intervention is delivered to a cohort of high utilizers identifed using population-level data and predictive modeling. We utilize a random forest model that utilizes electronic  health record  data from three health  systems in Livingston and Washtenaw County in Michigan to predict the number of ED visits each resident will incur in the next six months. Using 5-fold cross-validation, the model achieves a root-mean-squared-error of 0.51 visits and a mean absolute error of 0.24 visits. Using time- based validation, the model achieves a root-mean-squared error of 0.74 visits and a mean absolute error of 0.29 visits. Patients projected to have high ED utilization are being enrolled in a community-wide care coordination intervention using twelve sites across two counties. We believe that the repeated cycles of modeling and intervention demonstrate an LHS in action.
  • Network Structure, Efficiency, and Performance in WikiProjects (Edward Platt, School of Information)
    Abstract: The internet has enabled collaborations at a scale never before possible, but the best practices for organizing such large collaborations are still not clear. Wikipedia is a visible and successful example of such a collaboration which might offer insight into what makes large-scale, decentralized collaborations successful. We analyze the relationship between the structural properties of WikiProject coeditor networks and the performance and efficiency of those projects. We confirm the existence of an overall performance-efficiency trade-off, while observing that some projects are higher than others in both performance and efficiency, suggesting the existence factors correlating positively with both. Namely, we find an association between low-degree coeditor networks and both high performance and high efficiency. We also confirm results seen in previous numerical and small-scale lab studies: higher performance with less skewed node distributions, and higher performance with shorter path lengths. We use agent-based models to explore possible mechanisms for degree-dependent performance and efficiency. We present a novel local-majority learning strategy designed to satisfy properties of real-world collaborations. The local-majority strategy as well as a localized conformity-based strategy both show degree-dependent performance and efficiency, but in opposite directions, suggesting that these factors depend on both network structure and learning strategy. Our results suggest possible benefits to decentralized collaborations made of smaller, more tightly-knit teams, and that these benefits may be modulated by the particular learning strategies in use.

Tuesday, October 9, 2018

Click on each schedule item to expand.

8 a.m. - Check-in and Coffee

Fourth floor, Rackham Building

8:30 a.m. - Welcome

Al Hero and Brian Athey, MIDAS Co-Directors

athey_brian-bestBrian Athey is the Michael A. Savageau Collegiate Professor and Chair of the Department of Computational Medicine and Bioinformatics, and Professor of Psychiatry and Internal Medicine. HeroJan2010Al Hero is the John H. Holland Distinguished University Professor of Electrical Engineering and Computer Science, R. Jamison and Betty Williams Professor of Engineering, Professor of Biomedical Engineering, and Professor of Statistics

8:35 a.m. - Research Talks (Session 3)

  • What is Bitcoin? Exploration, Exploitation, and the Emergence of the Cryptocurrency Category (Lynette Shaw, Complex Systems)
    Abstract: In  under  a  decade,  cryptocurrency  has  gone  from  the  radical  monetary  experiment  of  an  online  group  of  political  activists  to  the  basis  of  a  multimillion  dollar,  financial  technology  industry.  Alongside  this  rise,  the  question  of  what  cryptocurrency  is  has  accompanied  it.  This  work  applies  established  models  of  market  categorization  processes  to  explain  the  emergence  of  the  cryptocurrency  category  out  of  the  decentralized  production  context  from  which  it  arose.  Using  an  original  set  of  sources  documenting  the  history  of  cryptocurrency’s  development,  automated  content  analysis  of  over  7,500  media  reports  between  2011  through  early  2016,  and  consideration  of  quantitative  metrics  reflecting  online  searches,  market  activity,  and  venture  capital  funding,  this  analysis  clarifies  how  a  multivocal  identity  was  an  essential  feature  of  cryptocurrency’s  widespread  adoption  and  development  across  a  diverse  coalition  of  groups  during  its  broad  “exploratory”  (March  1991)  period  of  development,  but  also,  a  factor  which  left  it  open  to  being  preferentially  defined  per  the  interests  of  late-arriving,  powerful  audiences  as  it  matured  into  a  later  “exploitation”  (March  1991)  phase  of  its  development.  In  so  doing,  this  article  offers  an  empirical  contribution  to  research  on  new  digital  monies  and  a  consideration  of  how  established  models  of  categorization  apply  in  decentralized  development  contexts.
  • Multiclass Meta-learning (Salimeh Yasaei Sekeh, College of Engineering)
    Abstract: One of the open problems in data science is determining the quality of data for training a learning algorithm, e.g., a classifier.  This is a meta-learning problem: to learn the intrinsic quality of data directly from a sample of the data. Meta-learning is important since empirical performance prediction is crucial to optimizing the data life cycle. Examples where meta-learning is applied include sequential design of experiments, deep learning and sensor management in the fields of statistics, machine learning and systems engineering, respectively. In this work, we introduce a geoemetric meta-learning framework for multiclass classification based on the global minimal spanning tree that spans all labeled features over feature space. This framework provides tight bounds on the Bayes error rate bounds with low computational complexity.
  • Predicting Bicyclist Destination and Route by Link Using Large Scale GPS Based Naturalistic Bicycling Data (Yuting Wu, UMTRI and SI)
    Abstract: With introduction of automated connected vehicle and infrastructure technologies, bicyclists and vehicles are facing a new era of road sharing practices that depend largely on the ability of an automated vehicle to detect the bicyclist and predict her trajectory. In this project, our goal is to develop a model based on individuals’ bicycling trip history to predict one’s possible route, link by link, and intended destination – given a trip origin, our model predicts the most probable links to be traversed in a sequence to reach the predicted destination, using a Markov model that follows first-order Markov chain process. For each new trip, the starting origin is clustered into either one of the previously determined clusters or a new cluster based on its proximity to other trip origins. As the trip proceeds, at each time step where a new segment is observed, the next probable link and the intended destination is predicted and subsequently updated. After the trip is finished, the trip is added to the training data and previous knowledge is updated, making the model flexible to adopt sequentially increasing data. The model is implemented on a naturalistic cycling data collected via GPS enabled smartphone application and is evaluated based on destination accuracy rate and route accuracy rate, measured by the ratio of number of correct segments to number of predicted segments. For each individual, 80% of the previous trips are used as training data, while remaining 20% are used for model validation with users segmented based on their number of trips recorded. While our algorithm predicts destinations with ~98% accuracy, the link prediction accuracy is ~70%. Research is underway to examine whether additional trip and user attribute data can improve prediction accuracy.
  • Mining Students’ In- and Out-of-Class behaviors to Create Earlier Warning System (SungJin Nam, SI and COE)
    Abstract: In this study, we analyzed the relationship between students’ success in an entry-level STEM course with their incoming profile and course-taking behaviors. We focused on analyzing finer grained behavioral signals collected while students are taking lectures and incorporating with students’ incoming profiles. We formed up four research questions for the study. For the first research question, we identified which behavioral variables at in- and out-of-class periods significantly describe students’ success. We found variables like the number of correct answers in class activity questions (at in- and out-of-class periods), the number of confused slides (in-), and the frequency of viewing the lecture video (in-) are stably significant. For the second and the third research question, we incorporated student profile factors as random intercepts, and extended the model with additional random slopes of selected behavioral variables to examine different patterns per student profile factors. We could identify significant gender gaps exist in mid-level GPA groups, and some possible advising scenarios for student behaviors based on their incoming profile. For the last research question, we developed a weekly forecasting model for student success. We achieved 72% accuracy by using both behavioral signals and student profiles in a mixed-effect modeling setting. We believe the findings in this paper can provide insights on how to integrate fine grained behavioral signals with incoming profile factors. Further research can be done for more generalizable results with data collected from different STEM and non-STEM courses, and comparing with other machine learning techniques for achieving better prediction performance.

9:45 a.m. - Raghu Machiraju, Interim Director, Translational Data Analytics Institute, The Ohio State University

Title: The Elements of Translational Data Science

Abstract: In this talk, I will describe the large scale investment in Data Science made by The Ohio State University (OSU). Translational Data Science Institute (TDAI) represents the aspirations of the campus community to create a pan university nerve center dedicated to research, academic programming and outreach in data science and analytics. I will discuss how research communities are being grown from within TDAI and the OSU campus. The ongoing activities of one particular community dedicated to Smart and Connected Communities and Distributed Sensing will be described in detail. A recent success has been the creation of a NSF BIGDATA Spoke dedicated to the development of open source tools to address the opioid epidemic in rural Midwest. I will describe, at length, various aspects of the project and emphasize the need for appropriate data collections, the need to collect data in an agile fashion for effective use, and the importance of adequate data infrastructure. Further, I will also discuss the important  role of “user and community engagement” in collecting data and for the validation of developed tools. I will end this talk by describing examples from my own research that hinges on informed data collection and reiterate the need for proper data and community elements to enable impactful translation of the methods of data science for the community-at-large.

Bio: Raghu Machiraju is a Professor of Biomedical Informatics, Computer Science and Engineering, and Pathology at The Ohio State University. He is currently serving as the Interim Faculty Lead and Executive Director of Translational Data Analytics Institute.  Raghu is also a co-founder of a biotech startup dedicated to the automation of wet the wet laboratory. He has been always been interested in the application of computing to problems in the physical and biological and life sciences. Raghu’s current research has led to the development of  machine learning methods that include multiple modalities from the clinic and the laboratory (e.g., histology images, clinical and transcript data) and the deployment of machine learning methods to automate protocols wet laboratories. As the Faculty Lead of TDAI, he has engaged in building salient research communities of practice on campus and academic programs in data science and analytics.

10:50 a.m. - Student Team Presentations

  • Michigan Data Science Team
  • Statistics in the Community
  • Computational Social Science Workshop
  • Michigan Student Artificial Intelligence Lab

11:35 a.m. - Concluding Remarks

Al Hero and Brian Athey, MIDAS Co-Directors

athey_brian-bestBrian Athey is the Michael A. Savageau Collegiate Professor and Chair of the Department of Computational Medicine and Bioinformatics, and Professor of Psychiatry and Internal Medicine. HeroJan2010Al Hero is the John H. Holland Distinguished University Professor of Electrical Engineering and Computer Science, R. Jamison and Betty Williams Professor of Engineering, Professor of Biomedical Engineering, and Professor of Statistics

Sponsors

Platinum Sponsor and Affiliate Member

Affiliate Member

Gold Sponsor