Skip to content

Standardisation of cv_terms in parquet.  #79

@ypriverol

Description

@ypriverol

cv_terms stores now, key-value pairs where the key is clearly and string, but the value could be string, boolean, double, etc. Right now everything is written as string:string which means that we have to write in the parquet doubles and int as string.

I have read a bit about the best representation in terms of performance, speed and also compression about something like:

message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}

This means that in the schema we can have a UNION of null, float, int, etc. Can we evaluate @zprobot if that is better? @lazear do you have an opinion on this?

Here, is how it should look like in pyarrow:

import pyarrow as pa
import pyarrow.parquet as pq

# Define the schema
field_schema = pa.list_(pa.struct([
    ('key', pa.string()),
    ('value', pa.union([
        pa.field('string_value', pa.string()),
        pa.field('double_value', pa.float64()),
        pa.field('int_value', pa.int64())
    ]))
]))

# Create the full schema
schema = pa.schema([
    ('my_field', field_schema)
])

# Example data
data = [
    {'my_field': [
        {'key': 'centroid', 'value': 'yes'},
        {'key': 'ibaq_value', 'value': 49.0},
        {'key': 'consensus_support', 'value': 4},
        {'key': 'software', 'value': 'maxquant'}
    ]}
]

# Create a Table
table = pa.Table.from_pylist(data, schema=schema)

# Write to Parquet
pq.write_table(table, 'example.parquet')

# Read from Parquet
read_table = pq.read_table('example.parquet')
print(read_table.schema)
print(read_table.to_pylist())

I think another suggestion could be:

message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions