-
Notifications
You must be signed in to change notification settings - Fork 19
02. Clinical Data
Please note that tMDataLoader is Study ID case-insensitive. Study IDs ABCEFD, ABcDeFD, abcdefd, etc. are treated as the same Study ID.
tMDataLoader will load multiple categorical values for the same variable for the same subject ID (e.g. Subject1 - Diagnosis - Osteoarthritis; Subject1 - Diagnosis - COPD). Multiple numerical values for the same variable for the same subject are not allowed and will generate error.
Before uploading, the studies need to be placed in the corresponding directory on the server, that you specified in the tMDataLoader configuration file. It should have the structure following the example below. Nested folders are supported.
The first level of the directory defines the top category in the Dataset Explorer tree. The last level (before Clinical/Expression folders) defines the study name that will be used in the tree. All folders in between define intermediate folders in the Dataset Explorer tree.
To load Clinical Data you should have the following folders and files in your study directory:
<STUDY_NAME>_<STUDY_ID>
...
\-ClinicalData (or ClinicalDataToUpload)
\-<STUDY_NAME>_<STUDY_ID>_Mapping_File.txt
\-<CLINICAL_DATA_FILE1>.txt
...
\-<CLINICAL_DATA_FILEN>.txt
...
Clinical Data Layout (sample)
Mapping file (sample)
The mapping file should contain 5 required columns (filename, category_cd, col_nbr, data_label, data_label_source) and may contain special optional "baseline" column for "Timestamp" data loading and variable_type column for Timepoint and Timestamp data loading. variable_type and additional validation_rules columns are also used for data validation.
- column 1: filename - references to data file
- column 2: category_cd - multi-level variable category, separated by '+' symbol
- column 3: col_nbr - column number in the data file that contains values for the variable being loaded
- column 4: data_label - name of the variable, optional, could be empty
NOTE: There are 3 controlled terminology data_labels required by the tranSMART Summary Statistics functionality
- Age or AGE or Age (AGE)
- Sex or SEX or Sex (SEX)
- Race or RACE or Race (RACE)
If different terminology used or these labels are included into category_cd instead of data_label column, Summary Statistics for age, race and sex for the selected subsets will not be generated automatically.
If there is more than one "Age" Data_Label anywhere in the study, higher age value will be used for Summary Statistics.
- column 5: data_label_source - in conjunction with special value (
\) in data_label may be used to form dynamic variable name - column 6: baseline (special column used for Timestamp data load)
- column 7: variable_type - type of variable (Numerical, Text, Date, Categorical, Timepoint, Timestamp) (see "Summary Statistic" and Serial data loading section bellow)
- column 8: validation_rules - list of variable specifications separated by
;(see "Summary Statistic" bellow)
Usually it contains just a name of variable, but it also can hold some special values:
- STUDY_ID - the marker for column with study identifier, you should use same study ID for all clinical data files. If the STUDY_ID is missing from the mapping file, the data from the affected data file is not loaded and no error is generated.
- SUBJ_ID - the marker for column with subject identifier
- VISIT_NAME - the marker for column with visit name. Optional. If not specified, then visit name is ignored. If VISIT_NAME is used, all SUBJ_ID must have a value in this column (e.g. Screening, Visit 1, Follow Up, etc.)
- OMIT - may occurs multiple times, used to mark a column as omitted (data is not loaded from this column)
NOTE: You can also omit columns from loading by not including them in the mapping file. tMDataLoader allows to skip columns, list column in any order and load data from the same column multiple times using different category_cd. If data from the same data column is loaded more then once, tMDataLoader will give a warning "... column X is already mapped...".
Category_cd contains path to variable separated by +. Target variable path will be combined from the study path, category_cd, data_label, data_value (for non-numeric variables) and visit_name. Empty values will be omitted from path.
Effectively result path is concatenation of the followings: STUDY_PATH+CATEGORY_CD+DATALABEL+VISITNAME+DATAVALUE.
I.e., if study loaded to /Test Studies/Test Study_GSE0/, data_label = 'Sex', data_value = 'Female', visit_name = 'v1', category_cd = 'Clinical_Data+Demographics', then result path will be '/Test Studies/Test Study_GSE0/Clinical Data/Demographics/Sex/Female/v1/'.
You can change rules for category_cd building.
If you enable visitNameFirst option either in config or with --visit-name-first command line switcher, then VISITNAME and DATAVALUE will be reordered: STUDY_PATH+CATEGORY_CD+VISITNAME+DATALABEL+DATAVALUE. For the sample above you will get '/Test Studies/Test Study_GSE0/Clinical Data/Demographics/Sex/v1/Female/'.
If you specify VISITNAME, DATALABEL or DATAVALUE in category_cd, then they will not be added to the end of path. For example, it may be useful if you want to place VISITNAME before DATALABEL, then you can just use the following value for category_cd: 'Clinical Data+Demographics+VISITNAME+DATALABEL+DATAVALUE' or just: 'Clinical Data+Demographics+VISITNAME'.
Both variants will produce the same variable path: '/Test Studies/Test Study_GSE0/Clinical Data/Demographics/v1/Sex/Female/'.
If you want to use + in your category_cd, then read '+ special handling' section.
For advanced path building using multiple variables read 'Using tags' section.
To prevent auto-expanding category_cd with DATALABEL, DATAVALUE and VISITNAME read 'Using terminator symbol' section.
NOTE: category_cd only affects the variable's path, not the variable's value. If you change DATAVALUE location in path it will be still associated with the the same variable value.
By default, if subject has only one visit, then it will be removed from variable paths to avoid unnecessary nesting, but in some cases it is an undesirable behavior. In this case you can use either always-set-visit-name command line switcher or alwaysSetVisitName config option. It will prevent visit name removing even if it is just one.
Some of symbols will be pre-processed while loading:
-
|in data value replaced with-if an occurrence in the middle, removed otherwise. -
|in data label replaced with, - standalone
(,)or empty '()' in data value removed -
%in data label, data value and variable path (category_cd) replaced with Pct -
&,+in both data label and data value replaced with' and ' -
_in variable path (category_cd) and data_label replaced with' ' -
(plus)in data label, data value and variable path replaced with+
Because + in category_cd used as separator for categories and pre-processed by default in data values and data labels it has special placeholder (plus) to keep it as plus sign.
I.e. for category_cd=Clinical Data+A(plus)B, data_label=D+E(plus)F, data_value=A+B(plus)C following variable path will be produced: /Clinical Data/A+B/D and E+F/A and B+C/
In some cases may be useful to use other variables for variable path. A typical sample of such usage is the visit name variable. It comes from other column, but participate in variable path. This behavior may be extended to as much variables as needed using tags. Tag is a reference to other variable with a syntax $$other_variable_data_label.
For example, if you want to group Sex variable by Race, then you can use following category_cd for Sex variable:
Clinical Data+$$Race.
For Male Asian subject we will get the following variable path: /Clinical Data/Asian/Sex/Male/.
If you want to fix category_cd and disable auto-appending DATALABEL, DATAVALUE and VISITNAME to end of the variable path you have option to use terminator symbol. It looks like +$ at the end of category_cd.
For example, to prevent adding VISITNAME you can write concept_cd as Clinical Data+DATALABEL+DATAVALUE+$ and it will produce variable path with just DATALABEL and DATAVALUE. +$ will be removed from the variable path.
If we have visitname=v1, data_label=Sex, 'data_value'=Female, then you will get: /Clinical Data/Sex/Female/.
Another example, if we want only keep DATAVALUE in path, then for concept_cd=Clinical Data+DATAVALUE+$ and the same data as in previous sample you will get: /Clinical Data/Female/.
If you upload Clinical Data for same study again, then, by default, old previously loaded variables will be removed and new variables will be loaded instead. But you can set merge mode in mapping file to change this behavior.
It can be done in head lines comments (see samples here):
#MERGE_MODE: <REPLACE|UPDATE|UPDATE_VARIABLE|APPEND>
The following modes are available:
- REPLACE - default behavior, no head line comment needed. If an existing study was not deleted first, it will be deleted by the ETL and replaced with the new copy of the same study being loaded (if same study ID, name, path). An attempt to load a new copy of the study with the same ID into a different path will generate an error.
- UPDATE - all clinical data for patients from the new version of study that exist in the previously loaded version will be removed and new data for these patients will be loaded. If a variable is not included in the UPDATE data set, all values for this variable for patients included in the UPDATE data set will be removed. Data for subjects not included in the UPDATE data set will not be affected.
Note: most of the time you would want to use UPDATE_VARIABLE. Just UPDATE mode helpful when you want to replace all or a lot of data for a subset of subjects.
- UPDATE_VARIABLE - old values for variables for patients in the "UPDATE_VARIABLE" data set are removed and replaced with the new ones. Other variables for these patients are not affected.
- APPEND - New variables (not present in the existing study) will be appended to the tree according to the APPEND data set mapping file. Categorical values for variables in the APPEND data set that are already present in the existing study will not be affected (e.g Existing study Subject 1 - Sex - Female; APPEND Subject 1 - Sex - Male. Both, female and male will be associated with Subject 1). Old numerical values for existing variables that are included in the APPEND data set will be rewritten.
If variable_type is defined as either Timepoint or Timestamp, tMDataLoader will create the required record that will allow to plot loaded data on the right scale using Advanced Workflow Line Graph. Timepoint can have formats such as "1 hour 30 min", "2 h", "Day 1", "Day 01". Timestamp is used in a standard format "5/16/13 19:10". Timestamp functionality requires two columns in the data file: actual timestamp for the measurement/observation and baseline timestamp to calculate timepoint (time interval). Links to examples are below.
Test Study Link for Serial LDD data Timepoint
Test Study Link for Serial LDD data Timestamp
For each clinical data upload the ETL generates SummaryStatistic.txt report in ClinicalData directory. It contains variable metrics and requested validations results. Output depends on variable_type and validation_rules specified in mapping file. By default all variables has 'Text' type and SUBJ_ID has an special variable type: 'ID'.
The SummaryStatistic.txt has following columns:
- File - clinical data file
- Variable - name of a variable
- Variable Type - a variable type (as defined in the mapping file)
- N - non empty values count
- null - empty values count
- Mean - mean value. Only for
Numericalvariables - Median - median value. Only for
Numericalvariables - IQR - inter quartile range. Only for
Numericalvariables - Min - minimal value. Only for
Numericalvariables - Max - maximal value. Only for
Numericalvariables - SD - standard deviation. Only for
Numericalvariables - Count - different values counts (i.e. Male: 20, Female: 100). Only for
Categoricalvariables - Required - flags if the variable marked as required
- Validation rule - list of range constraints on variable. Only for
Numericalvariables - QC missing data - lists subject ids (
IDvariable's values) where value is missing for required variables - QC data range - lists subject ids grouped by violated constraints (if any specified in Validation Rule)
- ID - special variable type for SUBJ_ID. Used to identify records with validation errors
- Text - default variable type. No special behavior defined
- Date - semantic variable type. No special behavior defined. In future may be used in validation rules
- Numerical - variable type for numeric variables. Different metrics (like Mean, SD, IQR) calculated for these variables. Range validation rules may be applied.
- Categorical - variable type for categorical variables.
Countcolumn contains information about values distribution
List of validation rules should contains in validation_rules column in the mapping file separated by ;. Validation rule should be from following list (X, Y should be substituted with numbers). All rules only applies to Numerical variables and rules violations reported in QC data range if no otherwise specified:
- Required - may be applied to any type of variable. Validates if variable value present. Subjects with missing values for the variable will be reported in 'QC missing data' column in the SummaryStatistic.txt.
-
X, Greater thanX- variable's value should be greater thanX -
=
X, Greater than or equal toX- variable's value should be greater than or equal toX - <
X, Lesser thanX- variable's value should be lesser thanX - <=
X, Lesser than or equal toX- variable's value should be lesser than or equal toX -
X-Y, BetweenXtoY- variable's value should be betweenXandY(inclusive)
Example: >10; <=60; Between 50 to 100
Any rule can be specialized by additional condition on related variable using when .. is keyword. Validation rule should be followed by ', when "related_variable" is condition_on_related_variable'. The rule should be applied only to variable's values when condition on related variable for same subject satisfied.
condition_on_related_variable should be from following list (S should be substituted with a string):
- blank - related variable value should be blank
- present - related variable value should present
- equal to "
S" - related variable value should be equal toS - not equal to "
S" - related variable value should not be equal toS - any range condition from previous list
Example: >10, when "Age">=90; <=50, when "Sex" is equal to "Female"; 10-20, when "Sex" is present; 20-30, when "Sex" is blank
For Postgres (in wrk_clinical_data.sql)
- study_id character varying(25)
- site_id character varying(50)
- subject_id character varying(100)
- visit_name character varying(100)
- data_label character varying(500)
- data_value character varying(500)
- category_cd character varying(250)
- category_path character varying(1000)
For Oracle (in wrk_clinical_data.sql)
- STUDY_ID VARCHAR2(25 BYTE)
- SITE_ID VARCHAR2(50 BYTE) SUBJECT_ID VARCHAR2(100 BYTE)
- VISIT_NAME VARCHAR2(100 BYTE)
- DATA_LABEL VARCHAR2(500 BYTE)
- DATA_VALUE VARCHAR2(500 BYTE)
- CATEGORY_CD VARCHAR2(250 BYTE)
- ETL_JOB_ID NUMBER(220)
- ETL_DATE DATE
- USUBJID VARCHAR2(200 BYTE)
- CATEGORY_PATH VARCHAR2(1000 BYTE)