Data and AI Intensive Research with Rigor and Reproducibility (DAIR³)

MIDAS leads the Data and AI Intensive Research with Rigor and Reproducibility (DAIR³) program, which includes weeklong bootcamps in the summer that focus on ethical issues in biomedical data science; data management, representation, and sharing; rigorous analytical design; the design and reporting of AI models; generative AI; reproducible workflow; and assessing findings across studies. Additionally, the bootcamp also includes grant writing sessions and research collaboration discussions.

Learn More

The rigor of scientific research and the reproducibility of research results are essential for the validity of research findings and the trustworthiness of science. However, research rigor and reproducibility remains a significant challenge across scientific fields, especially for research with complex data types from heterogeneous sources, and long data manipulation pipelines. This is especially critical as data science and artificial intelligence (AI) methods emerge at lightning speed and researchers scramble to seize the opportunities that the new methods bring.

While researchers recognize the importance of rigor and reproducibility, they often lack the resources and the technical know-how to achieve this consistently in practice. With funding from the National Institutes of Health, a multi-university team offers a nationwide program to equip faculty and technical staff in biomedical sciences with the skills needed to improve the rigor and reproducibility of their research, and help them transfer such skills to their trainees.

Trainees will then be guided over a one-year period to incorporate the newly acquired mindset, skills and tools into their research; and develop training for their own institutions.

The DAIR³ team and instructors include faculty and staff research leaders from the University of Michigan, the College of William and Mary, Jackson State University, and University of Texas San Antonio. This highly diverse team will model the culture of diversity that we promote, and will support trainees who are demographically, professionally and scientifically diverse, and are from a diverse range of institutions, including those with limited resources.

The second round of bootcamps will be offered in the summer of 2025, with full scholarships to support trainees from Minority-Serving Institutions, underrepresented demographic groups, and resource-constrained institutions.

Clifton Addison

Associate Professor of Biostatistics, Jackson State University

Yalanda Barner

Assistant Professor of Health Policy and Management, Jackson State University

Johann Gagnon-Bartsch

Associate Professor of Statistics, College of Literature, Science, and the Arts, University of Michigan

Juan Gutiérrez

co-Principal Investigator
Professor, Chair of Mathematics, University of Texas at San Antonio

Gregory Hunt

Assistant Professor of Mathematics, College of William and Mary

H. V. Jagadish

Edgar F Codd Distinguished University Professor & Bernard A Galler Collegiate Professor of Elec. Eng. and Computer Science; MIDAS Director, University of Michigan

Brenda Jenkins

Director of Training and Education, Jackson State University

Erin Kaleba

Director, Data Office for Clinical and Translational Research, University of Michigan

Jing Liu

Principal Investigator
MIDAS Executive Director, University of Michigan

Jodyn Platt

Associate Professor of Learning Health Sciences; Associate Professor of Health Management and Policy, University of Michigan Medical School

Kelly Psilidis

Faculty Training Program Manager, University of Michigan

Arvind Rao

Associate Professor of Computational Medicine and Bioinformatics; Associate Professor of Radiation Oncology, Medical School and Associate Professor of Biostatistics, School of Public Health, University of Michigan

Michele Randolph

Evaluation Specialist, Marsal School of Education, University of Michigan

Kerby Shedden

Professor of Statistics, College of Literature, Science, and the Arts; Professor of Biostatistics, School of Public Health and Center Director, Statistical Consultation and Research, University of Michigan

Curriculum

0.1. RCR in the context of biomedical data science.

The instructor and trainees will discuss case studies about how the key RCR concepts apply to
data-intensive and AI-enabled research, including:

Research misconduct involving data and AI
Human subject protection reflected in the use of data and AI
Responsible authorship, peer review, data, code and algorithm sharing
Conflicts of interest

0.2. An overview of rigor and reproducibility considerations in biomedical research that employs data science and AI methods.

Key issues along the data-intensive and AI-enabled research pipeline;
The role of the PI, technical personnel and trainees.

0.3. The complexity of biomedical data and the need for insight integration.

Biomedical problems cover a wide range of scales ranging from the planetary biosphere, to population health conditions, to the tiniest molecular interactions within cells. The increasing variety of data across scales change in their purpose, precision, and complexity, thus requiring diverse approaches to their life cycle and infrastructure needs.

Common data types and the scientific inquiries that they enable.
Nuanced considerations for different types of data and different types of research questions.
Considerations of using AI tools with biomedical data.

1.1. What are ethics?

Ethics are a set of principles that guide our actions, telling us what we ought to do and, perhaps
more importantly, what we ought not to do. In this section, we will introduce the historical
foundations laid by the Belmont Report and discuss why a researcher advancing scientific goals
should worry about ethics.

1.2. Informed Consent.

The Tuskegee experiments, the Belmont Report, and the history of regulating human subjects research.
Informed Consent as a central principle for interventional studies.
The limitations of informed consent for retrospective studies and data analyses.
Possible alternative frameworks to determine appropriateness.

1.3. Privacy.

Individual harms that can result from the lack of privacy.
The significant privacy risks that remain even with de-identified data.
Additional risks with AI tools applied in research.

1.4. Fairness.

The many ways in which results of studies could be biased or unfair.
The many ways in which this bias could creep into the analysis.
The many definitions of fairness.A framework to think about issues of fairness and equity in the data analysis pipeline, all the way from data collection to result presentation.

2.1. Data management.

Data management is the process by which organizations/researchers gather, store, access and
secure data. Sound and efficient data management processes and protocols facilitate not only
more efficient access to data analytics but more open and reproducible research. Effective data
management is essential for any rigorous and reproducible research.

The role leadership plays in the overall success of a research project.
Understanding the tools of data management.
Developing a data management platform.
Developing a Manual of Operations that outlines data management policies and protocol.
Standardizing data in order to promote competence, effectiveness, and allow widespread usage.
Prepare data to develop new research insights, and to enable organizations for better decision making, enabling users to consolidate data into useful reports.
- Data primitives and data standards.
- Data workflows.
- Versioning data, source code and digital objects.
Data storage and security: Balancing data storage and data security and data availability
to researchers and decision-makers in a structured, contextualized way.

2.2. Data representation.

There are many choices made in how to represent data: as continuous-valued variables or as
bucketized categories; with the full coloration and nuance of a text note or as a record with a set
of tags; and so on. These choices matter to the types of analyses they permit or facilitate, and
also in terms of the biases they potentially engender. Finally, choices also impact how a user
understands and utilizes the presented data.

The costs and benefits of linking data.
How to make choices with quantization and discretization.
How to choose good tags.
The concept of cherry-picking, and how to frame results fairly.

2.3. Metadata.

Different types of metadata and their purposes.
Introducing metadata for different types of biomedical data, such as those in the Common Data Elements repository.
Examples of good and bad metadata for reproducibility.
Effective approaches to position metadata as an integral component in the research workflow.

2.4. Data sharing.

Code and data are interdependent components of a transparent and reproducible research
project. Without access to the data underlying research, critique and replication of findings are
impossible. Merely providing technical access is not enough. Researchers must produce and
curate data to be FAIR: findable, accessible, interoperable, and reusable. However, research data must often be restricted in ways that directly conflict with the goal of open access. Data
accessibility is thus a continuum with varying constraints, incentives, and disincentives.

NIH Data Management and Sharing Plan
The FAIR Guiding Principles: Findable, Accessible, Interoperable, and Reusable.
- Findable data is easily located by humans and machines.
- Accessible data can be retrieved and are accompanied by persistent metadata.
- Interoperable data use standard models and specifications, potentially integrating with other data.
- Reusable data are well described, include provenance, and meet community standards.
Privacy and confidentiality.
Continuum of data accessibility.
Five Safes model for sensitive data.
Sharing linkages and linked data.

This unit will cover modern data analysis for biomedical studies with a particular focus on study design. Topics will include randomization, blocking, replication, positive and negative controls,
missing data, selection biases, confounding, dropout, and pre-specification of analysis protocol. A strong emphasis will be on the interplay of study design and statistical analysis, and in particular the impact of design in determining which statistical analyses may be performed credibly, and how design may be used to improve transparency.

3.1. An introduction to fundamental concepts.

Review of classical and modern aspects of experimental design, including randomization and replication.
Statistical power and multiple comparisons.
Observational data, confounding, and missing data.
The role of design in identifying errors, mitigating errors, and quantifying errors.

3.2. Case studies representative of modern biomedical studies. They will demonstrate how challenges arising from the complex biases and error structures prevalent in biomedical experiments can be overcome through careful design, and how different aspects of study design interact with one another and with subsequent statistical analysis.

An animal model study.
Genomic assay and batch effects.
Longitudinal epidemiological study of population health.

3.3. Summary of the fundamental concepts and their implementation in diverse settings.

Predictive modeling is oneof the most common data science methods employed by biomedical and clinical researchers, from identifying patients who are at risk for various diseases, to predicting the progression and outcome of an individual patient. The validity and reproducibility of the modeling outcomes depends on many factors, including the selection of task-appropriate training data with corresponding quality assessments, rigorous model development (data preprocessing, model and feature selections and so on), and principled performance reporting and inference. In this unit, we will review the basics of predictive modeling and approaches to build an accurate and reproducible model, introduce best practices in reporting that will allow others to appropriately interpret and reproduce the results, and discuss guiding principles on how to reproduce others’ results.

4.1. A review of predictive modeling modeling. The modeling workflow, including data and quality assessment, preprocessing, the selection of training and testing data.

4.2. Data preparation. Data cleaning, distributional checks, dimension reduction and their underlying assumptions, and consequences for downstream inference.

4.3. Modeling tools.

Commonly used models (linear and simple non-linear models: regression, support vector machines and decision tree ensembles) and their fit for different data and research questions.
Common methods to evaluate the performance accuracy of a model.
Responsible model interpretation and model reporting, including goodness-of-fit statistics, accuracy, precision, recall, specificity and sensitivity components, particularly the cost-sensitive and class-imbalanced aspects of clinical prediction models.

4.4. Assessment of bias and fairness within predictive models.

Factors that cause bias in predictive modeling and approaches to prevent the bias.
Tools to measure bias, such as IBM AIF360 and Aequitas.

4.5. How to report research with predictive models.

Datasheets for Datasets, Dataset Nutrition label.
Model Scorecards.
The TRIPOD, SPIRIT-AI & CONSORT-AI statements: Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis.
CLAIM: Checklist for Artificial Intelligence in Medical Imaging.

4.6. How to read an ML paper. Using example biomedical research papers, we will discuss how to identify key information across research studies to compare predictive models and model designs and validate a specific research study.

This unit will help researchers develop complete and customized workflows for creating reproducible analyses. We will articulate a clear set of goals for researchers to strive for when developing reproducible analyses, present a set of software tools that help researchers achieve these goals, and demonstrate how these tools can be used in conjunction with one another to create flexible workflows.

5.1. Goals of Reproducible Analyses: reproducible, user friendly, transparent, reusable, version controlled, permanently archived.

5.2. Reproducibility via Code Notebooks.

Code notebooks.
Markdown.
Interactivity.
Jupyter.
Display and code formats.
Interoperability.

5.3. Best practices for Reproducible Programming.

Immortalizing code choices.
Makefiles.
DRY principle and refactoring.
Avoiding magic numbers.
Caching intermediate results.
Seeding random numbers.

5.4. Version Control.

What is version control and when should you use it.
Git.
Jupytext.

5.5. Containers.

Dependency issues.
Solutions: venv, renv, containers.
Docker, singularity, podaman.
Building/running containers.
Interactivity and containers.

5.6. Putting Everything Together.

Examples of analysis organization.
Examples of makefiles and caches.
Dos and Don’ts.
Archiving packages and analysis code.

Meta-analysis or research synthesis combines evidence from multiple studies to obtain a more rigorous and reproducible result than can be provided by any single study. Approaches to meta-research include systematic reviews, meta-analyses, and replication studies; we focus mainly on meta-analyses in this unit. Two key principles underlying modern meta-analysis are (i) the quantity of interest, e.g. an association, risk measure, or treatment effect may vary among subpopulations of interest and among research methodologies; (ii) stated uncertainty in reported research findings may be miscalibrated. The goal of the meta-analysis is to estimate a consensus value if warranted, to quantify the potential for and sources of heterogeneity, and to evaluate and adjust for miscalibration.

6.1. Key concepts in research synthesis.

Main approaches to research synthesis.
Study selection.
Ascertaining study characteristics.
Publication bias.

6.2. Basic adjustment for heterogeneity and miscalibration.

Statistical adjustment for known sources of inter-study heterogeneity.
Assessment of calibration.
Network meta analysis.

6.3. Assessment of study results heterogeneity.

I² statistic for heterogeneity.
Forest plots to assess heterogeneity.
Model-based approaches for understanding heterogeneity.

6.4. Multiple testing and causality.

Need for multiplicity adjustment.
Limitations of interpreting observational associations.

Transformer-based generative AI models, such as ChatGPT, have opened up many opportunities to accelerate research, while posing new challenges for research rigor and reproducibility. In this unit, trainees will build skills to effectively leverage transformer-based AI systems, specifically ChatGPT, in biomedical research. More importantly, trainees will be equipped with both theoretical knowledge and practical skills for assessing the impact of these AI tools on the rigor and reproducibility of research, make decisions on when such tools should or should not be deployed, and take appropriate cautions.

7.1. Theoretical Foundations.

Overview of transformers, attention mechanisms, and their revolutionary role in Natural Language Processing.
Historical perspective on the evolution from RNNs and CNNs to transformers.
Understanding the mathematical underpinnings of transformer architecture.

7.2. Transformers in Biomedical Research.

Case studies of transformers’ impact on biomedical data analysis.
Exploration of BERT, GPT, and other transformer models in biomedical applications.
Discussion on ethical considerations and the importance of bias evaluation in model training and deployment.

7.3. Data Management for Transformers.

Data curation and preprocessing for transformer models.
Implementing data versioning and provenance tracking for reproducibility.
Techniques for dataset splitting and augmentation specific to biomedical data.

7.4. Setting Up the Environment.

Installation and configuration of the necessary software and libraries.
Introduction to OpenAI’s API and the specific features of ChatGPT-4.
Ensuring computational reproducibility: environments, seed settings, and model checkpoints.

7.5. Model Training and Fine-Tuning.

Principles of training transformers on domain-specific biomedical datasets.
Strategies for fine-tuning ChatGPT-4 on specialized corpora for enhanced biomedical relevance.
Metrics and evaluation techniques for assessing model performance.

7.6. Data Sharing with Transformers.

Best practices for sharing models and datasets in compliance with biomedical data regulations.
Utilizing platforms for model sharing while ensuring data privacy and security.
Annotating and documenting models for reproducibility and transparency.

7.7. Data Representation and Result Interpretation.

Approaches for input representation in biomedical datasets for transformers.
Techniques for interpreting transformer model outputs in biomedical contexts.
Understanding and mitigating biases in model predictions.

Applications for 2025 are now closed. Please check back in the fall for the 2026 application.

Develop the intellectual framework and technical skills to ensure the rigor and reproducibility of biomedical and healthcare research with cutting-edge data science and artificial intelligence (AI) methods.

Open to university faculty and research scientists.

Participation in the training program is free of charge. Scholarships are available.

Session #1

Monday, May 12 – Saturday, May 17, 2025

Jackson State University – Jackson, MS

Session #2

Monday, June 16 – Saturday, June 21, 2025

University Texas San Antonio – San Antonio, TX