Skip to content
Stephen Wicks edited this page Apr 17, 2020 · 27 revisions

Clinical Data Loading Instructions

Please note that tMDataLoader is Study ID case-insensitive. Study IDs ABCEFD, ABcDeFD, abcdefd, etc. are treated as the same Study ID.

tMDataLoader will load multiple categorical values for the same variable for the same subject ID (e.g. Subject1 - Diagnosis - Osteoarthritis; Subject1 - Diagnosis - COPD). Multiple numerical values for the same variable for the same subject are not allowed and will generate error.

Preparing data for upload

Before uploading, the studies need to be placed in the corresponding directory on the server, that you specified in the tMDataLoader configuration file. It should have the structure following the example below. Nested folders are supported.

The first level of the directory defines the top category in the Dataset Explorer tree. The last level (before Clinical/Expression folders) defines the study name that will be used in the tree. All folders in between define intermediate folders in the Dataset Explorer tree.

To load Clinical Data you should have the following folders and files in your study directory:

<STUDY_NAME>_<STUDY_ID>
...
\-ClinicalData (or ClinicalDataToUpload)
  \-<STUDY_NAME>_<STUDY_ID>_Mapping_File.txt
  \-<CLINICAL_DATA_FILE1>.txt
  ...
  \-<CLINICAL_DATA_FILEN>.txt
...

Clinical Data Layout (sample)

Mapping file (sample)

Columns

The mapping file should contain 5 required columns (filename, category_cd, col_nbr, data_label, data_label_source) and may contain special optional "baseline" column for "Timestamp" data loading and variable_type column for Timepoint and Timestamp data loading. variable_type and additional validation_rules columns are also used for data validation.

  • column 1: filename - references to data file
  • column 2: category_cd - multi-level variable category, separated by '+' symbol
  • column 3: col_nbr - column number in the data file that contains values for the variable being loaded
  • column 4: data_label - name of the variable, optional, could be empty
NOTE: There are 3 controlled terminology data_labels required by the tranSMART Summary Statistics functionality
  1. Age or AGE or Age (AGE)
  2. Sex or SEX or Sex (SEX)
  3. Race or RACE or Race (RACE)

If different terminology used or these labels are included into category_cd instead of data_label column, Summary Statistics for age, race and sex for the selected subsets will not be generated automatically.
If there is more than one "Age" Data_Label anywhere in the study, higher age value will be used for Summary Statistics.

  • column 5: data_label_source - in conjunction with special value (\) in data_label may be used to form dynamic variable name
  • column 6: baseline (special column used for Timestamp data load)
  • column 7: variable_type - type of variable (Numerical, Text, Date, Categorical, Timepoint, Timestamp) (see "Summary Statistic" and Serial data loading section bellow)
  • column 8: validation_rules - list of variable specifications separated by ; (see "Summary Statistic" bellow)

data_label in details

Usually it contains just a name of variable, but it also can hold some special values:

  • STUDY_ID - the marker for column with study identifier, you should use same study ID for all clinical data files. If the STUDY_ID is missing from the mapping file, the data from the affected data file is not loaded and no error is generated.
  • SUBJ_ID - the marker for column with subject identifier
  • VISIT_NAME - the marker for column with visit name. Optional. If not specified, then visit name is ignored. If VISIT_NAME is used, all SUBJ_ID must have a value in this column (e.g. Screening, Visit 1, Follow Up, etc.)
  • OMIT - may occurs multiple times, used to mark a column as omitted (data is not loaded from this column)
NOTE: You can also omit columns from loading by not including them in the mapping file. tMDataLoader allows to skip columns, list column in any order and load data from the same column multiple times using different category_cd. If data from the same data column is loaded more then once, tMDataLoader will give a warning "... column X is already mapped...".

category_cd in details

Category_cd contains path to variable separated by +. Target variable path will be combined from the study path, category_cd, data_label, data_value (for non-numeric variables) and visit_name. Empty values will be omitted from path.

Effectively result path is concatenation of the followings: STUDY_PATH+CATEGORY_CD+DATALABEL+VISITNAME+DATAVALUE.
I.e., if study loaded to /Test Studies/Test Study_GSE0/, data_label = 'Sex', data_value = 'Female', visit_name = 'v1', category_cd = 'Clinical_Data+Demographics', then result path will be '/Test Studies/Test Study_GSE0/Clinical Data/Demographics/Sex/Female/v1/'.

You can change rules for category_cd building.

If you enable visitNameFirst option either in config or with --visit-name-first command line switcher, then VISITNAME and DATAVALUE will be reordered: STUDY_PATH+CATEGORY_CD+VISITNAME+DATALABEL+DATAVALUE. For the sample above you will get '/Test Studies/Test Study_GSE0/Clinical Data/Demographics/Sex/v1/Female/'.

If you specify VISITNAME, DATALABEL or DATAVALUE in category_cd, then they will not be added to the end of path. For example, it may be useful if you want to place VISITNAME before DATALABEL, then you can just use the following value for category_cd: 'Clinical Data+Demographics+VISITNAME+DATALABEL+DATAVALUE' or just: 'Clinical Data+Demographics+VISITNAME'.
Both variants will produce the same variable path: '/Test Studies/Test Study_GSE0/Clinical Data/Demographics/v1/Sex/Female/'.

If you want to use + in your category_cd, then read '+ special handling' section.

For advanced path building using multiple variables read 'Using tags' section.

To prevent auto-expanding category_cd with DATALABEL, DATAVALUE and VISITNAME read 'Using terminator symbol' section.

NOTE: category_cd only affects the variable's path, not the variable's value. If you change DATAVALUE location in path it will be still associated with the the same variable value.

Single visit handling

By default, if subject has only one visit, then it will be removed from variable paths to avoid unnecessary nesting, but in some cases it is an undesirable behavior. In this case you can use either always-set-visit-name command line switcher or alwaysSetVisitName config option. It will prevent visit name removing even if it is just one.

Special symbols handling (+, _, ), (, |)

Some of symbols will be pre-processed while loading:

  • | in data value replaced with - if an occurrence in the middle, removed otherwise.
  • | in data label replaced with ,
  • standalone (, ) or empty '()' in data value removed
  • % in data label, data value and variable path (category_cd) replaced with Pct
  • &, + in both data label and data value replaced with ' and '
  • _ in variable path (category_cd) and data_label replaced with ' '
  • (plus) in data label, data value and variable path replaced with +

+ special handling

Because + in category_cd used as separator for categories and pre-processed by default in data values and data labels it has special placeholder (plus) to keep it as plus sign.
I.e. for category_cd=Clinical Data+A(plus)B, data_label=D+E(plus)F, data_value=A+B(plus)C following variable path will be produced: /Clinical Data/A+B/D and E+F/A and B+C/

Using tags

In some cases may be useful to use other variables for variable path. A typical sample of such usage is the visit name variable. It comes from other column, but participate in variable path. This behavior may be extended to as much variables as needed using tags. Tag is a reference to other variable with a syntax $$other_variable_data_label.

For example, if you want to group Sex variable by Race, then you can use following category_cd for Sex variable: Clinical Data+$$Race.
For Male Asian subject we will get the following variable path: /Clinical Data/Asian/Sex/Male/.

Using terminator symbol

If you want to fix category_cd and disable auto-appending DATALABEL, DATAVALUE and VISITNAME to end of the variable path you have option to use terminator symbol. It looks like +$ at the end of category_cd.

For example, to prevent adding VISITNAME you can write concept_cd as Clinical Data+DATALABEL+DATAVALUE+$ and it will produce variable path with just DATALABEL and DATAVALUE. +$ will be removed from the variable path.
If we have visitname=v1, data_label=Sex, 'data_value'=Female, then you will get: /Clinical Data/Sex/Female/.

Another example, if we want only keep DATAVALUE in path, then for concept_cd=Clinical Data+DATAVALUE+$ and the same data as in previous sample you will get: /Clinical Data/Female/.

Merge modes

If you upload Clinical Data for same study again, then, by default, old previously loaded variables will be removed and new variables will be loaded instead. But you can set merge mode in mapping file to change this behavior.

It can be done in head lines comments (see samples here):

#MERGE_MODE: <REPLACE|UPDATE|UPDATE_VARIABLE|APPEND>

The following modes are available:

  • REPLACE - default behavior, no head line comment needed. If an existing study was not deleted first, it will be deleted by the ETL and replaced with the new copy of the same study being loaded (if same study ID, name, path). An attempt to load a new copy of the study with the same ID into a different path will generate an error.
  • UPDATE - all clinical data for patients from the new version of study that exist in the previously loaded version will be removed and new data for these patients will be loaded. If a variable is not included in the UPDATE data set, all values for this variable for patients included in the UPDATE data set will be removed. Data for subjects not included in the UPDATE data set will not be affected.
Note: most of the time you would want to use UPDATE_VARIABLE. Just UPDATE mode helpful when you want to replace all or a lot of data for a subset of subjects.
  • UPDATE_VARIABLE - old values for variables for patients in the "UPDATE_VARIABLE" data set are removed and replaced with the new ones. Other variables for these patients are not affected.
  • APPEND - New variables (not present in the existing study) will be appended to the tree according to the APPEND data set mapping file. Categorical values for variables in the APPEND data set that are already present in the existing study will not be affected (e.g Existing study Subject 1 - Sex - Female; APPEND Subject 1 - Sex - Male. Both, female and male will be associated with Subject 1). Old numerical values for existing variables that are included in the APPEND data set will be rewritten.

Serial data loading

If variable_type is defined as either Timepoint or Timestamp, tMDataLoader will create the required record that will allow to plot loaded data on the right scale using Advanced Workflow Line Graph. Timepoint can have formats such as "1 hour 30 min", "2 h", "Day 1", "Day 01". Timestamp is used in a standard format "5/16/13 19:10". Timestamp functionality requires two columns in the data file: actual timestamp for the measurement/observation and baseline timestamp to calculate timepoint (time interval). Links to examples are below.

Test Study Link for Serial LDD data Timepoint

Test Study Link for Serial LDD data Timestamp

Summary Statistic

For each clinical data upload the ETL generates SummaryStatistic.txt report in ClinicalData directory. It contains variable metrics and requested validations results. Output depends on variable_type and validation_rules specified in mapping file. By default all variables has 'Text' type and SUBJ_ID has an special variable type: 'ID'.

Columns

The SummaryStatistic.txt has following columns:

  • File - clinical data file
  • Variable - name of a variable
  • Variable Type - a variable type (as defined in the mapping file)
  • N - non empty values count
  • null - empty values count
  • Mean - mean value. Only for Numerical variables
  • Median - median value. Only for Numerical variables
  • IQR - inter quartile range. Only for Numerical variables
  • Min - minimal value. Only for Numerical variables
  • Max - maximal value. Only for Numerical variables
  • SD - standard deviation. Only for Numerical variables
  • Count - different values counts (i.e. Male: 20, Female: 100). Only for Categorical variables
  • Required - flags if the variable marked as required
  • Validation rule - list of range constraints on variable. Only for Numerical variables
  • QC missing data - lists subject ids (ID variable's values) where value is missing for required variables
  • QC data range - lists subject ids grouped by violated constraints (if any specified in Validation Rule)

Variable Types

  • ID - special variable type for SUBJ_ID. Used to identify records with validation errors
  • Text - default variable type. No special behavior defined
  • Date - semantic variable type. No special behavior defined. In future may be used in validation rules
  • Numerical - variable type for numeric variables. Different metrics (like Mean, SD, IQR) calculated for these variables. Range validation rules may be applied.
  • Categorical - variable type for categorical variables. Count column contains information about values distribution

Validation Rules

List of validation rules should contains in validation_rules column in the mapping file separated by ;. Validation rule should be from following list (X, Y should be substituted with numbers). All rules only applies to Numerical variables and rules violations reported in QC data range if no otherwise specified:

  • Required - may be applied to any type of variable. Validates if variable value present. Subjects with missing values for the variable will be reported in 'QC missing data' column in the SummaryStatistic.txt.
  • X, Greater than X - variable's value should be greater than X

  • =X, Greater than or equal to X - variable's value should be greater than or equal to X

  • <X, Lesser than X - variable's value should be lesser than X
  • <=X, Lesser than or equal to X - variable's value should be lesser than or equal to X
  • X-Y, Between X to Y - variable's value should be between X and Y (inclusive)

Example: >10; <=60; Between 50 to 100

Any rule can be specialized by additional condition on related variable using when .. is keyword. Validation rule should be followed by ', when "related_variable" is condition_on_related_variable'. The rule should be applied only to variable's values when condition on related variable for same subject satisfied.

condition_on_related_variable should be from following list (S should be substituted with a string):

  • blank - related variable value should be blank
  • present - related variable value should present
  • equal to "S" - related variable value should be equal to S
  • not equal to "S" - related variable value should not be equal to S
  • any range condition from previous list

Example: >10, when "Age">=90; <=50, when "Sex" is equal to "Female"; 10-20, when "Sex" is present; 20-30, when "Sex" is blank

Number of characters


  • study_id character varying(25)
  • site_id character varying(50)
  • subject_id character varying(100)
  • visit_name character varying(100)
  • data_label character varying(500)
  • data_value character varying(500)
  • category_cd character varying(250)
  • category_path character varying(1000)

  • STUDY_ID VARCHAR2(25 BYTE)
  • SITE_ID VARCHAR2(50 BYTE) SUBJECT_ID VARCHAR2(100 BYTE)
  • VISIT_NAME VARCHAR2(100 BYTE)
  • DATA_LABEL VARCHAR2(500 BYTE)
  • DATA_VALUE VARCHAR2(500 BYTE)
  • CATEGORY_CD VARCHAR2(250 BYTE)
  • ETL_JOB_ID NUMBER(220)
  • ETL_DATE DATE
  • USUBJID VARCHAR2(200 BYTE)
  • CATEGORY_PATH VARCHAR2(1000 BYTE)

Clone this wiki locally