Skip to content

Pan-Canadian-Genome-Library/data-dictionary

Repository files navigation

PGCL Clinical Data and Sequencing Metadata Schemas

About

This repository contains the schemas used for data ingestion into the PCGL. The PCGL schemas use existing ontologies and standards, providing an extensible interoperable framework for ease of data sharing.

The PCGL data model adopts a three-tiered schema structure: Base, Extension, and Custom. The Base schema is the core PCGL data model required for all data submission. If studies have additional well-curated data fields that they want to submit, they can create an Extension schema with the additional fields and combine the Base + Extension to form a Custom schema for their study.

The canonical version of the core data model is the LinkML document in the base directory. All other data model artifacts and documentation are derived from the base.yaml LinkML document.

Schema Framework

The Schema framework is divided into three parts and defined as follows:

Term Definition
Base Contains common data fields shared across all domains (e.g., patient demographics, vital status, laboratory results).
Defined and maintained by PCGL.
Drive the Research portal for data exploration to ensure all users interact with a consistent set of data elements and enhance the data interoperability
Extension Include domain-specific data fields unique to each study or disease area.
Collaboratively developed by individual Program/Study according to guidelines and templates provided by PCGL
Extend the base schema to meet the precise needs of each study without affecting the base schema
Custom The result of merging the base schema and the extension schema.
Represents the complete schema used by a program or study.
Register to Lectern (Schema Registry)
Stored, managed and versioned by Lectern

Base schema diagram

diagram of base schema

Schema Overview

Within Base, Extension and Custom are Entities that represent objects within the schema and serve as the basis for information collection. Types of entities include : participant, sample, treatment, etc...

Entities will contain fields which serve to collect a specific type of information for example a status, metric, a measurement or ID.

** INSERT ER DIAGRAM HERE **

LinkML

The schemas are coded in linkML format. We have chosen linkML becuase:

  • schemas can be used with DataHarmonizer, a browser spreadsheet editor locally and offline
  • Data can validated through command line tools locally and offline
  • linkML supports object-like inheritance
  • Supports mapping for establish onotologies

Lectern and Lyric support

PCGL data submission uses multiple components of the Overture platform. Lectern manages schemas while Lyric manages data ingestion and validation.

Lectern utilizes a custom JSON formatted syntax that requires conversion from linkML format to Lectern accepted. We keep schemas in linkML format due to the previously mentioned strengths. For more details on downsides see restrictions/README.md.

Repository Layout

Folder Purpose
Base Contains linkML files for core data model entities
Extension Sub-divided per project, contains YAML files that extend base entities
Custom Sub-divided per project, contains 3 YAMLs
Scripts Scripts for aggregating schemas and exporting into various types. See README.md within folder for more details
Lectern Sub-divided per project, JSON schema files containing aggregated entities into a signle schema
Restrictions Sub-divided per project,JSON schema files containing specialized restrictions for entities.
Test_data Sub-divided per project, contains examples of good and bad data for testing.
CSV Sub-divided per project, contains the flattened CSV version of custom YAML
DataHarmonizer Sub-divided per project, contains the zip packaged dataharmonizer for local offline validation.
Typescript_export Sub-divided per project, contains the export typescript used for data harmonizer.

Data Coordination Center Admin Happy Path

Update to any of the following schema will require a full regeneration of resource:

  • Base Schema (e.g. base/participant.yaml)
  • Extension Schema (e.g. extension/example/participant.yaml)
  1. Update Custom Schema (e.g. extension/custom/participant.yaml)
  2. Use scripts/generateCustomLinkmlFromReference.py to generate extension/custom/example_dh.yaml and extension/custom/example_full.yaml

Lectern resources

  1. Use scripts/generateFlatCsvFromFullLinkml.py and extension/custom/example_full.yaml to generate csv/example/example.yaml
  2. Use scripts/generateLecternJsonFromCustomLinkml and extension/custom/example_full.yaml to generate lectern/example/example.json
  3. Register lectern/example/example.json in lectern per project
  4. Register Lectern provided IDs in Lyric

Dataharmonizer resources

  1. Pull latest version of https://github.com/cidgoh/DataHarmonizer locally
  2. Run scripts/dh-validate.py from DataHarmonizer folder on extension/custom/example_dh.yaml to generate web/templates/examples/schema.json
  3. Copy typescript_export/example/export.js to web/templates/examples
  4. Compress folder and copy over to dataHarmonizer/example/example.tar.gz

About

Schemas for PCGL clinical data model and sequencing metadata

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors