2020 REPRODUCIBILITY CHALLENGE
A significant challenge across scientific fields is the reproducibility of research results, and third-party assessment of such reproducibility. Ensuring that results can be reliably reproduced is no small task: computational environments may vary drastically and can change over time, rendering code unable to run; specialized workflows might require specialized infrastructure not easily available; sensitive projects might involve data that cannot be directly shared; the robustness of algorithmic decisions and parameter selections varies widely; data collection methods may include crucial steps (e.g. wrangling, cleaning, missingness mitigation strategies, preprocessing) where choices are made but not well-documented. Yet a cornerstone of science remains the ability to verify and validate research findings, so it is important to find ways to overcome these challenges.
The Michigan Institute for Data Science (MIDAS) is pleased to announce the 2020 Reproducibility Challenge. Our goal is to highlight high-quality, reproducible work at the University of Michigan by collecting examples of best practices across diverse fields. Besides incentivizing reproducible workflows and enabling a deeper understanding of issues of reproducibility, we hope the results of the challenge will provide templates that others can follow if they wish to adopt more reproducible approaches to disseminating their work.
The MIDAS Reproducibility Challenge is open to researchers from any field that make use of data, broadly construed. We seek projects and corresponding artifacts and/or publications that contribute to reproducibility in a noteworthy manner. Some examples could include:
- An illustration of a definition of reproducibility for at least one application of data science;
- Metadata with sufficient transparency to allow full understanding of how the data collection, processing and computational workflows or code resulted in a study’s findings;
- An analysis workflow that can be reproduced by others, even with different hardware or software;
- A thorough description of key assumptions, parameter and algorithmic choices in the experimental or computational methods, so that others can test the robustness and generalizability of such choices.
- Procedures or tools that other researchers can adopt to improve data transparency, analysis workflow, and to test the sensitivity of research findings to variations in data and in human decisions.
Prizes: There will be a prize pool of up to $15,000 cash award for the winning teams. Depending on the submissions, the entire amount may be awarded to a single winning team. Alternatively, we may award prizes to winners in multiple categories, as described above.
In addition, teams with effective approaches for reproducibility (as reflected in the submissions) will get preferential consideration for the next round of Propelling Original Data Science (PODS) grants in the fall of 2020.
All selected projects will be collected on a public webpage highlighting reproducible work at Michigan and providing best practice examples for other researchers.
How to Enter: Research teams in any research field that make use of data, broadly construed, are welcome to enter. One of the PIs/co-PIs on the submitted project should be a U-M investigator. If submitting a paper/manuscript, either the senior author or the corresponding author (or both) should be a U-M investigator.
If you would like to nominate a colleague, simply email email@example.com. We will then request the materials from the nominee.
To enter in this competition, you may submit either 1 or 2, not both. 3 is optional, but likely to be very helpful to the judges. In particularly, if option 2 is selected, 3 can ensure that the judges do not overlook important contributions.
- A reproducibility report containing the following components (as many as applicable to your projects):
- Required: A brief summary of a research project as an example to illustrate your effort to ensure reproducibility. This needs not be a full paper. In this summary, briefly describe the research question, methods, main findings and whether this work has been reproduced.
- A description of the dataset, and access to the data along with associated metadata.
- A detailed description of your data treatment and analysis workflow. Data treatment could include data cleaning, dimension reduction, treating missing data and other steps taken.
- Clear specification of hardware and software used in your analysis.
- If any step in the data treatment and analysis involves human judgment, you should associate sufficient provenance information. These include any specific assumptions, parameters and thresholds that you used, along with your rationale.
- Access to computational tools for data analysis, such as executable code and analysis notebooks. These could be a link to a public-facing executable analysis notebook (such as Jupyter or RMarkdown notebooks), or information on how to access an executable version of software used in your analysis.
- If you have already published a paper or submitted a manuscript to a journal and believe that the paper/manuscript is fully reproducible, you can submit the paper/manuscript in its entirety or as a link, to this challenge. If supplementary data or software packages are necessary to achieve reproducibility, please submit links to these as well. One example of a paper that highlights reproducibility is: Replication Study: Transcriptional amplification in tumor cells with elevated c-Myc. DOI: 10.7554/eLife.30274 Similarly, if you have already published a paper or submitted a manuscript to a journal to provide best practices to ensure reproducibility, you can submit the paper/manuscript in its entirety or as a link to this Challenge. One example of such work is: Reproducibility in density functional theory calculations of solids, DOI: 10.1126/science.aad3000 In unusual circumstances, there may be a stream of work building upon one central idea. In such a case, multiple papers may be included in the submission along with a narrative (option 3 below) identifying the common, central idea being claimed as the core contribution that should be evaluated by the judges.
- (Optional, but strongly recommended, particularly if the primary submission uses option 2) You may also submit a narrative of your general approach to reproducibility. This could include, but are not limited to:
- A description of any challenges you faced in making the work reproducible;
- The procedures/checklists and tools that your team adopts to ensure reproducibility;
- How your team encouraged the adoption of best practices and how widely adopted these best practices are in your team;
- The broader impact of your work.
- Any work you did to reproduce someone else’s results;
- Any work you did to help your field increase reproducibility.
- Any additional elements of rubrics for reproducibility assessment that you might have observed from within your community or elsewhere that could serve as a reasonable guide.
- Submissions are due by 11:59 pm, March 15, 2020.
- Winners will be announced on April 15, 2020.
- Award ceremony, winners’ presentations and reception: 9 am – 1 pm, April 21, 2020, Michigan League.
Judging Criteria: Submissions will be judged based on the following factors:
- The clarity and thoroughness of the report;
- Its potential as an example for others to follow;
- The ease and accuracy with which the results described in the report could be reproduced;
- The broader impact of the work towards addressing reproducibility challenges .
Work that attempts to overcome significant barriers to reproducibility (such as proprietary data or “black box” analytics) will be recognized, even if it represents an imperfect solution.
- Jake Carlson: Manager, Deep Blue Repositories and Research Data Services, U-M Libraries
- H.V. Jagadish: Director, MIDAS, and Professor, Computer Science and Engineering, CoE
- Matthew Kay: Assistant Professor, School of Information
- Jing Liu: Managing Director, MIDAS
- Josh Pasek: Assistant Professor, Communication and Media, LSA
- Brian Puchala: Assistant Research Scientist, Materials Science and Engineering, CoE
- Arvind Rao: Associate Professor, Computational Medicine and Bioinformatics, and Radiation Oncology, Med. School
Resources: The following are samples of tools and resources that may be of assistance:
- The Open Science Framework and the Center for Open Science
- The Dataverse
- Google Colab
- Archiving Code with Zenodo on Github
- The CRAN Time Machine for R Reproducibility
- RStudio Cloud
- Pipeline workflow environment
All questions should be sent to: firstname.lastname@example.org.