To write SQL join between two tables you need to select tables that you want to join. Access to left and right tables is provided through variables $table1 и $table2. Use it to get access to the table and select concrete fields of these tables.
Simple example of JOIN
$table1.id = $table2.patient
Advanced example of JOIN
$table1.id = $table2.patient AND $table2.type='Patient'
By default Unison platform uses only INNER JOIN.
You should select SQL Join as a linked table. Then you can use $table variable to get access to the fields of chosen joined tables.
Simple example of Data Extraction Rule
$table.id
Advanced example of Data Extraction Rule
(
SELECT CASE
WHEN $table.gender = ‘M’ THEN ‘Male’
WHEN $table.gender = ‘F’ THEN ‘Female’
else $table.gender
END
)
PostgreSQL only
Constant extraction example
If you want to use string constant, write it in single quotes, like ‘ETH’. For number constant or null, write number without quotes or null
'ETH'
Table data
Data type transformation example
$table.birthdate::timestamp
PostgreSQL only
By writing CTE you can use full functionality of SQL to select data and create intermediate tables for any transformation goals. Be careful if you use this CTE with JOINS and then with Data Extraction Rules: Unison does not store this table at the database. It is just a virtual table that will be created every time you try to query data.
Simple example of CTE
select
row_number() over() as id_int,
tt.patient,
min(tt.started_at::date) as observation_period_start_date,
max(tt.started_at::date) as observation_period_end_date
from allergies.encounters tt
group by patient
For example, let's review the situation with conditions. User may write UQL (Unison query language) query to select participants with 2 diagnoses: first and second
Unison platform will generate 2 JOINs of tables with diagnoses in this case. And Unison will generate 2 uniq aliases in each join and replace $table with alias in each case.
The same situation with $table1 and $table2 and other entities. User may query patients with 2 different measurements, so they should have unique aliases
Users may write any complex UQL expression using any column of the table ($table) in SQL syntax used in Biobank’s database engine.
In examples below examples for PostgreSQL are shown:
- Concatenation: $table.name || $table2.description
- Type casting: $table1.date::date
- Year/Month/day extraction: (EXTRACT (‘month’ FROM $table.date_field::date))::int
You can use the default translator from string to integer at the Unison platform. Translator is available at the Data Extraction Rule definitions section. This transformation works with unique strings and converts them to unique numeric values.
Manually this task may be solved by using sorting all string IDs and enumerating them. To do this we should create a lookup table to translate each string ID to int ID. After this we should join the lookup table and translate string to int.
Example: let's assume we have table persons with a string ID field named UUID varchar(64). To create lookup table we should write SQL (example with PostgreSQL)
SELECT row_number() over (order by uuid) as id_int, uuid from persons
This query will return a 2-columns table: uuid and id_int. So it’s actually lookup table
To join in we should create JOIN with these settings:
- $table1 =
persons - $table2 =
(SELECT row_number() over (order by uuid) as id_int, uuid from persons) - ON expression:
$table1.uuid = $table2.uuid
In data extraction rule we should choose created join in “table joined” field and use SQL expression $table.id_int
❗This example based on tutorial Allergy dataset. You can add to your account and look at all the code directly at the Unison Platform.
location_id field from 2 tables: allergies.patients and allergies.organizations with linked table location.id
This example looks like the previous one with one exception: we should enumerate the results of 2 source tables: organisation and patients.
To make end-to-end enumeration, we should UNION 2 sources tables and generate enumeration for the result. After this we will use result with enumeration as lookup table to translate patient.id or organization.id to location.id
As patient.id and organization.id may intersect, let's add additional field to distinguish them:
(
select row_number() over (order by id, type) as id_int, type, id from (
select id, ‘persons’ as type from allergies.patients
union select id, ‘org’ from allergies.organizations
) as tmp
)
Result of this query: 3-columns table:
- end-to-end enumeration
- type (persons or organisations)
- ID (persons or organisations)
After we should use this query in 3 joins:
- To translate patient.id to location_id
- To translate organization.id to location_id
- To generate location.id by data from patients
- To generate location.id by data from organisations
Sometimes in databases diagnoses, analyses or other entities may be presented not as rows, but as columns. But as OMOP requires storing them as rows, we should convert the table. In the example above we should convert one row with 20 diagnoses to 20 rows. To do this we should create another temporary table with 20 rows and join it. So we will multiply each row in the source table to 20 rows. After this we should create another column with content from first, second, third and so on column in first, second, third and so on rows accordingly. Lets do it!
Let's assume a table with 20 diagnose columns is called hes_apc.
- Create a join. In PostgreSQL we may use the function generate_series(1, 20) to generate a table with 1 column and 20 rows. So:
Name = diagnoses_in_rows
$table1 =hes_apc
$table2 =(generate_series(1, 20) as diag_num)
On =true - Create extraction rule for condition_occurrence.condition_concept_id:
Table joined=diagnoses_in_rows
SQL expression =(ARRAY[$table1.diag1, $table1.diag2, $table1.diag3, …, $table1.diag20])[$table2.diag_num]
In this case we create an array with all 20 diagnoses, and then take N’s element from the array. So:
- Each source rows will be multiplied by 20 times, as we joined in with a table with 20 rows without any condition
- We created an expression with the array of 20 elements and took N’s element from the array in N’s row. So the first row will contain diag1, second - diag2, …