This repository contains the schemas used for data ingestion into the PCGL. The PCGL schemas use existing ontologies and standards, providing an extensible interoperable framework for ease of data sharing.
The PCGL data model adopts a three-tiered schema structure: Base, Extension, and Custom. The Base schema is the core PCGL data model required for all data submission. If studies have additional well-curated data fields that they want to submit, they can create an Extension schema with the additional fields and combine the Base + Extension to form a Custom schema for their study.
The canonical version of the core data model is the LinkML document in the base directory. All other data model artifacts and documentation are derived from the base.yaml LinkML document.
The Schema framework is divided into three parts and defined as follows:
| Term | Definition |
|---|---|
| Base | Contains common data fields shared across all domains (e.g., patient demographics, vital status, laboratory results). |
| Defined and maintained by PCGL. | |
| Drive the Research portal for data exploration to ensure all users interact with a consistent set of data elements and enhance the data interoperability | |
| Extension | Include domain-specific data fields unique to each study or disease area. |
| Collaboratively developed by individual Program/Study according to guidelines and templates provided by PCGL | |
| Extend the base schema to meet the precise needs of each study without affecting the base schema | |
| Custom | The result of merging the base schema and the extension schema. |
| Represents the complete schema used by a program or study. | |
| Register to Lectern (Schema Registry) | |
| Stored, managed and versioned by Lectern |
Base schema diagram
Within Base, Extension and Custom are Entities that represent objects within the schema and serve as the basis for information collection. Types of entities include : participant, sample, treatment, etc...
Entities will contain fields which serve to collect a specific type of information for example a status, metric, a measurement or ID.
** INSERT ER DIAGRAM HERE **
The schemas are coded in linkML format. We have chosen linkML becuase:
- schemas can be used with DataHarmonizer, a browser spreadsheet editor locally and offline
- Data can validated through command line tools locally and offline
- linkML supports object-like inheritance
- Supports mapping for establish onotologies
PCGL data submission uses multiple components of the Overture platform. Lectern manages schemas while Lyric manages data ingestion and validation.
Lectern utilizes a custom JSON formatted syntax that requires conversion from linkML format to Lectern accepted. We keep schemas in linkML format due to the previously mentioned strengths. For more details on downsides see restrictions/README.md.
| Folder | Purpose |
|---|---|
| Base | Contains linkML files for core data model entities |
| Extension | Sub-divided per project, contains YAML files that extend base entities |
| Custom | Sub-divided per project, contains 3 YAMLs |
| Scripts | Scripts for aggregating schemas and exporting into various types. See README.md within folder for more details |
| Lectern | Sub-divided per project, JSON schema files containing aggregated entities into a signle schema |
| Restrictions | Sub-divided per project,JSON schema files containing specialized restrictions for entities. |
| Test_data | Sub-divided per project, contains examples of good and bad data for testing. |
| CSV | Sub-divided per project, contains the flattened CSV version of custom YAML |
| DataHarmonizer | Sub-divided per project, contains the zip packaged dataharmonizer for local offline validation. |
| Typescript_export | Sub-divided per project, contains the export typescript used for data harmonizer. |
Update to any of the following schema will require a full regeneration of resource:
- Base Schema (e.g.
base/participant.yaml) - Extension Schema (e.g.
extension/example/participant.yaml)
- Update Custom Schema (e.g.
extension/custom/participant.yaml) - Use
scripts/generateCustomLinkmlFromReference.pyto generateextension/custom/example_dh.yamlandextension/custom/example_full.yaml
- Use
scripts/generateFlatCsvFromFullLinkml.pyandextension/custom/example_full.yamlto generatecsv/example/example.yaml - Use
scripts/generateLecternJsonFromCustomLinkmlandextension/custom/example_full.yamlto generatelectern/example/example.json - Register
lectern/example/example.jsonin lectern per project - Register Lectern provided IDs in Lyric
- Pull latest version of https://github.com/cidgoh/DataHarmonizer locally
- Run
scripts/dh-validate.pyfromDataHarmonizerfolder onextension/custom/example_dh.yamlto generateweb/templates/examples/schema.json - Copy
typescript_export/example/export.jstoweb/templates/examples - Compress folder and copy over to
dataHarmonizer/example/example.tar.gz
