cv_terms stores now, key-value pairs where the key is clearly and string, but the value could be string, boolean, double, etc. Right now everything is written as string:string which means that we have to write in the parquet doubles and int as string.
I have read a bit about the best representation in terms of performance, speed and also compression about something like:
message Schema {
required group my_field (LIST) {
repeated group list {
required group element {
required binary key (UTF8);
required group value (UNION) {
optional binary string_value (UTF8);
optional double double_value;
optional int64 int_value;
}
}
}
}
}
This means that in the schema we can have a UNION of null, float, int, etc. Can we evaluate @zprobot if that is better? @lazear do you have an opinion on this?
Here, is how it should look like in pyarrow:
import pyarrow as pa
import pyarrow.parquet as pq
# Define the schema
field_schema = pa.list_(pa.struct([
('key', pa.string()),
('value', pa.union([
pa.field('string_value', pa.string()),
pa.field('double_value', pa.float64()),
pa.field('int_value', pa.int64())
]))
]))
# Create the full schema
schema = pa.schema([
('my_field', field_schema)
])
# Example data
data = [
{'my_field': [
{'key': 'centroid', 'value': 'yes'},
{'key': 'ibaq_value', 'value': 49.0},
{'key': 'consensus_support', 'value': 4},
{'key': 'software', 'value': 'maxquant'}
]}
]
# Create a Table
table = pa.Table.from_pylist(data, schema=schema)
# Write to Parquet
pq.write_table(table, 'example.parquet')
# Read from Parquet
read_table = pq.read_table('example.parquet')
print(read_table.schema)
print(read_table.to_pylist())
I think another suggestion could be:
message Schema {
required group my_field (LIST) {
repeated group list {
required group element {
required binary key (UTF8);
required group value (UNION) {
optional binary string_value (UTF8);
optional double double_value;
optional int64 int_value;
}
}
}
}
}
cv_terms stores now, key-value pairs where the key is clearly and string, but the value could be string, boolean, double, etc. Right now everything is written as string:string which means that we have to write in the parquet doubles and int as string.
I have read a bit about the best representation in terms of performance, speed and also compression about something like:
This means that in the schema we can have a
UNION of null, float, int, etc. Can we evaluate @zprobot if that is better? @lazear do you have an opinion on this?Here, is how it should look like in pyarrow:
I think another suggestion could be: