Reproducibility Resources

This page features resources submitted by U-M data science researchers. Ensuring reproducible data science is no small task: computational environments may vary drastically and can change over time; specialized workflows might require specialized infrastructure not easily available; sensitive projects might involve restricted data; the robustness of algorithmic decisions and parameter selections varies widely; crucial steps (e.g. wrangling, cleaning, mitigating missing data issues, preprocessing) where choices are made might not be well-documented. Our resource collection will help researchers tackle some of these challenges.

If you would like to submit tools, publications and other resources to be included in this page, please email [email protected].

A tutorial on propensity score based methods for the analysis of medical insurance claims data

A tutorial that offers practical guidance on the full analytic pipeline for causal inference using propensity score methods. These methods are especially useful for population-based studies using medical insurance claims dataset, which require thoughtful sample selection and analytic strategies to counter confounding, both measured and unmeasured. The methods are demonstrated for several common types of outcomes such as time to event and longitudinal repeated measures.

American Economic Association (AEA) Data & Code Repository at openICPSR

The American Economic Association (AEA) shares replication packages (data and code) through a newly established AEA Data and Code Repository at the Inter-university Consortium for Political and Social Research (ICPSR). In 2019, the AEA adopted a revised Data and Code Availability Policy “to improve the reproducibility and transparency of materials supporting research published in the AEA journals.” The revised policy introduced pre-publication verification of reproducibility by a dedicated data editor team, and moved replication packages … Read more

Assessing the reproducibility of high-throughput experiments with a Bayesian hierarchical model

A Bayesian hierarchical model framework and a set of computational toolkits to evaluate the overall reproducibility of high-throughput biological experiments, and identify irreproducible and reproducible signals via rigorous false discovery rate control procedures.

Automatic capture of data transformations to improve metadata

The C2Metadata Project has created software that automatically captures data transformations in common statistical packages in a simple yet expressive representation regardless of the original languages. It encourages data sharing and re-use by reducing the cost of documenting data management and preparation programs. The system first translates statistical transformation scripts (in SPSS, Stata, SAS, R, or Python) into a software-independent data transformation language, and then it updates the original metadata to match the transformed data. … Read more

BioContainers: an open-source and community-driven framework for software standardization

The BioContainers initiative, a free and open-source, community-driven project dedicated to help life science researchers and data analysts to improve software standardization and reproducibility. It facilitates the requests and maintenance of bioinformatics containers, and the interaction between the users and the community. The project is based on light-weight software containers technology such as Docker, and is integrated with the BioConda project enabling the automatic generation of containers for each BioConda recipe.

Codifying tacit knowledge in functions using R

Data collection efforts often come with a lot of documentation, of varying degrees of completeness. Having large amounts of documentation or incomplete documentation can make it hard for analysts to use the data correctly. In this presentation, Dr. Fisher describes an approach that data distributors can use to make it easier for analysts to use the data correctly. The approach uses functions that are specific to a given data set (data-specific functions) to encode tacit … Read more

Complete reproduction of a study through the use of Github, Docker and R package

A multi-pronged approach to make code and data easy to access, make the entire analysis available, ensure the computational environment is archived, and make the code useful to a wide audience. The tools include making all code available on Github; creating a fully documented R package on CRAN to allow the primary algorithms from the project to be easily used by others; creating a supplementary R package that contains the data and helper functions to … Read more

Example: Complete documentation and sharing of data and analysis with the example of a micro-randomized trial

An example of pre-registration of study protocols and open source documents and code to clearly describe key assumptions and decisions made for data curation and analysis of a micro-randomized trial. The documentation also includes sensitivity analyses showing how the results change under alternative decisions. The workflow provides a template for other scholars to use.

Example: Effective communication for reproducible research

This example highlights the importance of open communication and team work for reproducible research, both for making one’s work reproducible by others, and for reproducing people’s work. A few aspects of code sharing are emphasized in this example: whether the code is accessible, whether it is thoroughly and clearly documented and whether it is generalizable.

Example: embedding the computational pipeline in the publication (Unlisted)

One approach for embedding the complete computational pipeline (in Github) in a publication.

Example: Unifying initial conditions of galaxy formation simulation for research replication

This project demonstrates the importance of controlling the initial condition of a numerical simulation of galaxy formation to allow the replication of research findings. Different groups use different initial conditions as the starting point for their numerical modeling, complicating the comparison of results between groups. Are discrepant predictions for galaxy properties due to choices in the physics modeled in the simulation, or are they just due to different initial conditions? This work ensures that a … Read more

Large-Scale, Reproducible Implementation and Evaluation of Heuristics for Optimization Problems

Research developing new heuristics for optimization problems is often not reproducible; for instance, only 4% of papers for two famous optimization problems published their source code. This limits the impact of the research both within the heuristics community and more broadly among practitioners. In this work, the authors built a large-scale open-source code-base of heuristics. Each heuristic was then evaluated on a library of 3,296 instances. Such large-scale evaluation allows insight into which heuristics work … Read more

No results match the criteria.

Reproducibility Resources

Filters