If you would like to submit tools, publications and other resources to be included in this page, please email midas-research@umich.edu.
This page features resources submitted by U-M data science researchers. Ensuring reproducible data science is no small task: computational environments may vary drastically and can change over time; specialized workflows might require specialized infrastructure not easily available; sensitive projects might involve restricted data; the robustness of algorithmic decisions and parameter selections varies widely; crucial steps (e.g. wrangling, cleaning, mitigating missing data issues, preprocessing) where choices are made might not be well-documented. Our resource collection will help researchers tackle some of these challenges.
A tutorial on propensity score based methods for the analysis of medical insurance claims data
A tutorial that offers practical guidance on the full analytic pipeline for causal inference using propensity score methods. These methods are especially useful for population-based studies using medical insurance claims dataset, which require thoughtful sample selection and analytic strategies to counter confounding, both measured and unmeasured. The methods are demonstrated for several common types of …
American Economic Association (AEA) Data & Code Repository at openICPSR
The American Economic Association (AEA) shares replication packages (data and code) through a newly established AEA Data and Code Repository at the Inter-university Consortium for Political and Social Research (ICPSR). In 2019, the AEA adopted a revised Data and Code Availability Policy “to improve the reproducibility and transparency of materials supporting research published in the …
Assessing the reproducibility of high-throughput experiments with a Bayesian hierarchical model
A Bayesian hierarchical model framework and a set of computational toolkits to evaluate the overall reproducibility of high-throughput biological experiments, and identify irreproducible and reproducible signals via rigorous false discovery rate control procedures.
Automatic capture of data transformations to improve metadata
The C2Metadata Project has created software that automatically captures data transformations in common statistical packages in a simple yet expressive representation regardless of the original languages. It encourages data sharing and re-use by reducing the cost of documenting data management and preparation programs. The system first translates statistical transformation scripts (in SPSS, Stata, SAS, R, …
BioContainers: an open-source and community-driven framework for software standardization
The BioContainers initiative, a free and open-source, community-driven project dedicated to help life science researchers and data analysts to improve software standardization and reproducibility. It facilitates the requests and maintenance of bioinformatics containers, and the interaction between the users and the community. The project is based on light-weight software containers technology such as Docker, and …
Codifying tacit knowledge in functions using R
Data collection efforts often come with a lot of documentation, of varying degrees of completeness. Having large amounts of documentation or incomplete documentation can make it hard for analysts to use the data correctly. In this presentation, Dr. Fisher describes an approach that data distributors can use to make it easier for analysts to use …
Complete reproduction of a study through the use of Github, Docker and R package
A multi-pronged approach to make code and data easy to access, make the entire analysis available, ensure the computational environment is archived, and make the code useful to a wide audience. The tools include making all code available on Github; creating a fully documented R package on CRAN to allow the primary algorithms from the …
Example: Complete documentation and sharing of data and analysis with the example of a micro-randomized trial
An example of pre-registration of study protocols and open source documents and code to clearly describe key assumptions and decisions made for data curation and analysis of a micro-randomized trial. The documentation also includes sensitivity analyses showing how the results change under alternative decisions. The workflow provides a template for other scholars to use.
Example: Effective communication for reproducible research
This example highlights the importance of open communication and team work for reproducible research, both for making one’s work reproducible by others, and for reproducing people’s work. A few aspects of code sharing are emphasized in this example: whether the code is accessible, whether it is thoroughly and clearly documented and whether it is generalizable.
Example: embedding the computational pipeline in the publication (Unlisted)
One approach for embedding the complete computational pipeline (in Github) in a publication.
Example: Unifying initial conditions of galaxy formation simulation for research replication
This project demonstrates the importance of controlling the initial condition of a numerical simulation of galaxy formation to allow the replication of research findings. Different groups use different initial conditions as the starting point for their numerical modeling, complicating the comparison of results between groups. Are discrepant predictions for galaxy properties due to choices in …
Large-Scale, Reproducible Implementation and Evaluation of Heuristics for Optimization Problems
Research developing new heuristics for optimization problems is often not reproducible; for instance, only 4% of papers for two famous optimization problems published their source code. This limits the impact of the research both within the heuristics community and more broadly among practitioners. In this work, the authors built a large-scale open-source code-base of heuristics. …
No results match the criteria.