MIDAS Reproducibility Challenge Showcase: Oliver He & Jie Song

June 23, 2020 2:00 PM - 3:00 PM

View Event Recording

Dr. Yongqun “Oliver” He & Jie Song

Associate Professor – Microbiology and Immunology, University of Michigan 

Graduate Student, Computer Science and Engineering, University of Michigan

View Recording (Full)

View Recording (Oliver He)

View Recording (Jie Song)

Oliver He: XOD: The eXtensible ontology development (XOD) principles and tool implementation to support ontology interoperability and data reproducibility

The major challenge in data reproducibility is the lack of standardized and interoperable representation of heterogeneous data and the semantic relations among the data. In AI and data science, an ontology is a structured vocabulary comprised of human- and computer-interpretable terms and relations that represent entities and relationships in a specific domain. Ontologies have emerged to become critical to the data and metadata standardization, integration, sharing, and computer-assisted reasoning and analysis. With hundreds of ontologies developed, the ontology interoperability has become a major issue. We propose a set of eXtensible ontology development (XOD) principles and tools in order to develop standardized and interoperable ontologies, leading to a fundamental solution to the data reproducibility challenge. The XOD principles and tools can be applied to the data reproducibility in biomedicine, pharmacy, transportation, and various other fields.

Jie Song: C2Metadata: Continuous Capture of Metadata for Statistical Data

To reduce the cost and increase the completeness of metadata, we aim to work with common statistical packages to automate the capture of metadata at the granularity of individual data transformations in a simple yet expressive representation regardless of the original languages used. C2Metadata is such a system that implements this idea, creating efficiencies and reduce the costs of data collection, preparation, and re-use. The system first reads statistical transformation scripts (in SPSS, Stata, SAS or R) and the original metadata in (either of) two internationally accepted XML-based standards: the Data Documentation Initiative (DDI) and Ecological Metadata Language (EML). It then walks over individual data transformations and uses a software-independent data transformation description (Standard Data Transformation Language (SDTL)) to update the original metadata, which permits the tracking of dataset changes at different levels. We define SDTL based on a small set of transformation operators that comprise the Standard Data Transformation Algebra (SDTA), and cover the majority of data transformations used in statistical software.  SDTL is a declarative language describing the purpose of transformations in an informative way. To ease the understanding of the process without the necessity of learning SDTL or SDTA, we further extend the interpretation from SDTL to natural language as part of the updated metadata. We currently target two research communities (social and behavioral sciences and earth observation sciences) with strong metadata standards that rely heavily on statistical analysis software. Our work is generalizable to other domains, such as biomedical research. More details of the project is available at http://c2metadata.org.

————————————————————————-

The Reproducibility Showcase features a series of online presentations and tutorials from May to August, 2020.  Presenters are selected from the MIDAS Reproducibility Challenge 2020.  

A significant challenge across scientific fields is the reproducibility of research results, and third-party assessment of such reproducibility. The goal of the MIDAS Reproducibility Challenge is to highlight high-quality, reproducible work at the University of Michigan by collecting examples of best practices across diverse fields.  We received a large number of entries that illustrate wonderful work in the following areas: 

  1. Theory – A definition of reproducibility and what aspects of reproducibility are critical in a particular domain or in general.
  2. Reproducing a Particular Study – Comprehensive record of parameters and code that allows for others to reproduce the results in a particular project.
  3. Generalizable Tools – A general platform for coding or running analyses that standardizes the methods for reproducible results across studies.
  4. Robustness – Metadata, tools and processes to improve the robustness of results to variations in data, computational hardware and software, and human decisions.
  5. Assessments of Reproducibility – Methods to test the consistency of results from multiple projects, such as meta-analysis or the provision of parameters that can be compared across studies.
  6. Reproducibility under Constraints – Sharing code and/or data to reproduce results without violating privacy or other restrictions.

On Sept. 14, 2020, MIDAS will also host a Reproducibility Day, which is a workshop on concepts and best practices of research reproducibility.  Please save the date on your calendar.