The Michigan Institute for Data Science (MIDAS) is pleased to announce the 2021 Reproducibility Challenge.
A significant challenge across scientific fields is the reproducibility of research results, in both the narrow sense of repeating the calculations using the same data, and in the broader sense of deriving generalized insight from one or more investigations that bear on a common question. The aim of reproducible research is to make the production of scientific knowledge transparent, traceable, and trustworthy.
Data science research faces unique challenges to ensure that results can be reliably reproduced: statistical assumptions must be surfaced; there may be complex relationships among potentially confounding variables; data collection and processing may include crucial steps (e.g. wrangling, cleaning, missing value mitigation strategies) where choices are made but not well-documented or justified; computational environments may vary drastically and can change over time; workflows might require specialized infrastructure not easily available; sensitive projects might involve data that cannot be directly shared; the strengths and limitations of algorithms and methodologies for data analysis vary widely.
MIDAS organized the 2020 Reproducibility Challenge to highlight high-quality, reproducible work through examples of best practices across diverse fields. The many entries that we received highlight important conceptual issues of reproducible data science in multiple dimensions and the creative practical approaches U-M researchers have used to address these challenges. Building on the 2020 Challenge, we now turn our focus to actionable solutions that can be shared with other researchers to improve reproducibility. We seek ways to validate research findings, allowing for expected variations in data, code, statistical assumptions, and computing environment.
The MIDAS Reproducibility Challenge II is open to researchers from any field that makes use of data, broadly construed. We seek entries in four categories:
- Guides, processes and tools that can be readily adopted by researchers in one or multiple research areas. Other researchers, even if only within a narrowly defined research field, should be able to follow such guides, processes and tools to make their work reproducible.
- A template for a reproducible project. The goal is to allow others to emulate and develop similar approaches and lower the burden of making their projects reproducible. We especially welcome submissions that seek to make analyses not only reproducible, but transparent. This might include especially well organized or well documented code (for example, using notebooks); saving intermediate datasets / results so that they can be easily inspected; code that draws attention to and allows easy modification of analytical choices such as statistical tuning parameters, data cleaning choices, etc.
- An analysis of reproducibility in a research field, including meta-analysis or reproducing published works, to identify good practices for and challenges to replication and confirmation of published works, and proposals for new practices and tools to overcome barriers. A research field needs to be well defined, but can be broad or narrow, e.g. “single-cell RNAseq”, or “child development research with video data”.
- Training materials for reproducible research. These should be well developed materials for courses or workshops, with a clear theme that can be easily adopted by other instructors.
There will be a prize pool of up to $15,000 for the winning teams. Depending on the submissions, the entire amount may be awarded to a single winning project. Alternatively, we may award prizes to winners in multiple categories, as described above.
The award announcement and a Reproducibility Day will be organized in Jan. 2022. Winning teams will share their work at the event.
All selected projects will be collected on a public webpage highlighting reproducible work at Michigan and providing best practice examples for other researchers.
Research teams in any research field that make use of data, broadly construed, are welcome to enter. At least one member of the team should be a U-M investigator. If you would like to nominate a colleague, simply email email@example.com. We will then request the materials from the nominee.
The submission should include four components. All components may eventually be included in the MIDAS online resource collection for reproducibility. Please prepare your materials so that they are suitable for the broad U-M data science and AI research community.
Please use font 11 or above, single-space.
- The Summary (limit to 2 pages) with the following components:
- The category it fits in and the rationale;
- Its key merits and innovation, if applicable;
- What type of research (research field, type of data, etc) may benefit from your submission;
- What skills are needed for researchers to adopt your solution.
- The Product (no page limit). This may be one of the following:
- A detailed instructional manual / protocol for researchers in one or more research fields to improve the reproducibility of their work.
- A detailed template that researchers can follow to make an entire project reproducible.
- A list of technical recommendations that can be implemented for a research field to improve research reproducibility.
- A representative sample of your training material, which may be a combination of the syllabus, video recordings, interactive tools, class notes and others.
- The Narrative (under 10 pages, including Figures and diagrams, but not references). This should provide the context of the Product and any clarification that is not included in the Summary or the Product. The following are examples of what elements can be included in the Narrative, but each entry may include only a subset of the following, as well as elements that are not listed below.
- The detailed report of how your Product was developed. You may include the link to a published paper.
- If you include the description of a research project as an example to illustrate how to use your Product to ensure reproducibility, it needs not be a full paper, but it should include essential information such as:
- A description of the dataset, and access to the data and associated metadata.
- A detailed description of your data treatment and analysis workflow. Data treatment could include data cleaning, dimension reduction, treating missing data and other steps taken.
- Clear specification of hardware and software used in your analysis.
- If any step in the data treatment and analysis involves human judgment, you should associate sufficient provenance information. These include any specific assumptions, parameters and thresholds that you used, along with your rationale.
- Access to computational tools for data analysis, such as executable code and analysis notebooks. These could be a link to a public-facing executable analysis notebook (such as Jupyter or RMarkdown notebooks), or information on how to access an executable version of software used in your analysis.
- A description of any challenges you faced and lessons learned.
- How your team encouraged the adoption of best practices and how widely adopted these best practices are in your team.
- The broader impact of your work.
- References (no page limit).
Submissions will be judged based on the following factors:
- The benefits of your Product to the research community.
- The ease with which your Product can be adopted by others.
- The clarity of the components of the submission.
- Innovation of the work towards lowering burdens for research reproducibility.
- Work that attempts to overcome significant barriers to reproducibility (such as proprietary data or “black box” analytics) will be recognized, even if it represents an imperfect solution.
The following are samples of templates and resources that may be of assistance:
- Information about the 2020 MIDAS Reproducibility Challenge and resulting resources
- The Open Science Framework and the Center for Open Science
- The Dataverse
- Google Colab
- Archiving Code with Zenodo on Github
- The CRAN Time Machine for R Reproducibility
- RStudio Cloud
- Pipeline workflow environment
- Pegasus workflow management system
Research Professor, ICPSR
Professor of History
Director of Deep Blue Repository and Research Data Services, U-M
Johann Gagnon Bartsch
Assistant Professor, Statistics
Assistant Professor, Chemical Engineering
Professor, Electrical Engineering and Computer Science
Managing Director, MIDAS
Assistant Professor, U-M Department of Internal Medicine