This page features resources submitted by U-M data science researchers.  Ensuring reproducible data science is no small task: computational environments may vary drastically and can change over time; specialized workflows might require specialized infrastructure not easily available; sensitive projects might involve restricted data; the robustness of algorithmic decisions and parameter selections varies widely; crucial steps (e.g. wrangling, cleaning, mitigating missing data issues, preprocessing) where choices are made might not be well-documented.  Our resource collection will help researchers tackle some of these challenges.  If you would like to submit tools, publications and other resources to be included in this page, please email midas-research@umich.edu

Theory and definition

What can and should be reproduced, and to what extent a result can be reproduced.

See All Resources

  • Learn More
    Assessing the reproducibility of high-throughput experiments with a Bayesian hierarchical model
    A Bayesian hierarchical model framework and a set of computational toolkits to evaluate the overall reproducibility of high-throughput biological experiments, and identify irreproducible and reproducible signals via...
  • Learn More
    Replicating predictive models in scale for research on MOOC
    A program of reproducibility research in the domain of learning analytics with a specific focus on predictive models of student success. It includes an open-source software infrastructure,...

Reproducible study designs

Rigorous and cutting-edge statistical and data science methods that improve reproducibility.

See All Resources

  • Learn More
    A tutorial on propensity score based methods for the analysis of medical insurance claims data
    A tutorial that offers practical guidance on the full analytic pipeline for causal inference using propensity score methods. These methods are especially useful for population-based studies using medical insurance...
  • Learn More
    Example: Complete documentation and sharing of data and analysis with the example of a micro-randomized trial
    An example of pre-registration of study protocols and open source documents and code to clearly describe key assumptions and decisions made for data curation and analysis of...
  • Learn More
    Example: Unifying initial conditions of galaxy formation simulation for research replication
    This project demonstrates the importance of controlling the initial condition of a numerical simulation of galaxy formation to allow the replication of research findings. Different groups use different initial...
  • Learn More
    Large-Scale, Reproducible Implementation and Evaluation of Heuristics for Optimization Problems
    Research developing new heuristics for optimization problems is often not reproducible; for instance, only 4% of papers for two famous optimization problems published their source code. This limits the...

Fully reproducible projects

Guidelines and tools for recording and sharing data, code and documentation to reproduce the findings of a project, even with variations in data, computational hardware and software, and statistical and algorithmic decisions.

See All Resources

  • Learn More
    American Economic Association (AEA) Data & Code Repository at openICPSR
    The American Economic Association (AEA) shares replication packages (data and code) through a newly established AEA Data and Code Repository at the Inter-university Consortium for Political and Social Research...
  • Learn More
    Automatic capture of data transformations to improve metadata
    The C2Metadata Project has created software that automatically captures data transformations in common statistical packages in a simple yet expressive representation regardless of the original languages. It encourages data...
  • Learn More
    Codifying tacit knowledge in functions using R
    Data collection efforts often come with a lot of documentation, of varying degrees of completeness.  Having large amounts of documentation or incomplete documentation can make it hard for analysts...
  • Learn More
    Complete reproduction of a study through the use of Github, Docker and R package
    A multi-pronged approach to make code and data easy to access, make the entire analysis available, ensure the computational environment is archived, and make the code useful...
  • Learn More
    Example: Complete documentation and sharing of data and analysis with the example of a micro-randomized trial
    An example of pre-registration of study protocols and open source documents and code to clearly describe key assumptions and decisions made for data curation and analysis of...
  • Learn More
    Example: Effective communication for reproducible research
    This example highlights the importance of open communication and team work for reproducible research, both for making one’s work reproducible by others, and for reproducing people’s work. ...
  • Learn More
    Multi-informatic Cellular Visualization
    MiCV, a Multi-informatic Cellular Visualization tool that provides a uniform web interface to a set of essential analytical tools for high-dimensional datasets. Biologists looking to scRNA-seq as a high-throughput exploratory research...
  • Learn More
    Principles and tools for developing standardized and interoperable ontologies
    In the informatics field, a formal ontology is a human- and computer-interpretable set of terms and relations that represent entities in a specific domain and how they relate to...
  • Learn More
    Replicating predictive models in scale for research on MOOC
    A program of reproducibility research in the domain of learning analytics with a specific focus on predictive models of student success. It includes an open-source software infrastructure,...
  • Learn More
    Rigorous code review for better code release
    A systematic approach to code review and code release.  For code review, this team conducted a “blind” experiment, in which a data analyst had to re-create the...
  • Learn More
    Transparent, reproducible and extensible data generation and analysis for materials simulations
    Our approach for community software development and an introduction to data and workflow management tools. Reproducible workflows are achieved with the open-source signac software framework for managing large and...

Generalizable Tools

Guidelines and tools for documentation, coding, and running analyses that standardizes the methods for reproducible results across studies.

See All Resources

  • Learn More
    Automatic capture of data transformations to improve metadata
    The C2Metadata Project has created software that automatically captures data transformations in common statistical packages in a simple yet expressive representation regardless of the original languages. It encourages data...
  • Learn More
    BioContainers: an open-source and community-driven framework for software standardization
    The BioContainers initiative, a free and open-source, community-driven project dedicated to help life science researchers and data analysts to improve software standardization and reproducibility. It facilitates the...
  • Learn More
    Codifying tacit knowledge in functions using R
    Data collection efforts often come with a lot of documentation, of varying degrees of completeness.  Having large amounts of documentation or incomplete documentation can make it hard for analysts...
  • Learn More
    Complete reproduction of a study through the use of Github, Docker and R package
    A multi-pronged approach to make code and data easy to access, make the entire analysis available, ensure the computational environment is archived, and make the code useful...
  • Learn More
    Large-Scale, Reproducible Implementation and Evaluation of Heuristics for Optimization Problems
    Research developing new heuristics for optimization problems is often not reproducible; for instance, only 4% of papers for two famous optimization problems published their source code. This limits the...
  • Learn More
    Multi-informatic Cellular Visualization
    MiCV, a Multi-informatic Cellular Visualization tool that provides a uniform web interface to a set of essential analytical tools for high-dimensional datasets. Biologists looking to scRNA-seq as a high-throughput exploratory research...
  • Learn More
    Principles and tools for developing standardized and interoperable ontologies
    In the informatics field, a formal ontology is a human- and computer-interpretable set of terms and relations that represent entities in a specific domain and how they relate to...
  • Learn More
    Replicating predictive models in scale for research on MOOC
    A program of reproducibility research in the domain of learning analytics with a specific focus on predictive models of student success. It includes an open-source software infrastructure,...
  • Learn More
    Rigorous code review for better code release
    A systematic approach to code review and code release.  For code review, this team conducted a “blind” experiment, in which a data analyst had to re-create the...
  • Learn More
    Transparent, reproducible and extensible data generation and analysis for materials simulations
    Our approach for community software development and an introduction to data and workflow management tools. Reproducible workflows are achieved with the open-source signac software framework for managing large and...

Assessments of Reproducibility

Methods to test the consistency of results across studies.

See All Resources

  • Learn More
    Assessing the reproducibility of high-throughput experiments with a Bayesian hierarchical model
    A Bayesian hierarchical model framework and a set of computational toolkits to evaluate the overall reproducibility of high-throughput biological experiments, and identify irreproducible and reproducible signals via...
  • Learn More
    Large-Scale, Reproducible Implementation and Evaluation of Heuristics for Optimization Problems
    Research developing new heuristics for optimization problems is often not reproducible; for instance, only 4% of papers for two famous optimization problems published their source code. This limits the...
  • Learn More
    Replicating predictive models in scale for research on MOOC
    A program of reproducibility research in the domain of learning analytics with a specific focus on predictive models of student success. It includes an open-source software infrastructure,...