Standardisation of cv_terms in parquet. 

cv_terms stores now, key-value pairs where the key is clearly and string, but the value could be string, boolean, double, etc. Right now everything is written as string:string which means that we have to write in the parquet doubles and int as string. 

I have read a bit about the best representation in terms of performance, speed and also compression about something like: 

```
message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}
```

This means that in the schema we can have a `UNION of null, float, int`, etc. Can we evaluate @zprobot if that is better? @lazear do you have an opinion on this? 

Here, is how it should look like in  pyarrow: 

```
import pyarrow as pa
import pyarrow.parquet as pq

# Define the schema
field_schema = pa.list_(pa.struct([
    ('key', pa.string()),
    ('value', pa.union([
        pa.field('string_value', pa.string()),
        pa.field('double_value', pa.float64()),
        pa.field('int_value', pa.int64())
    ]))
]))

# Create the full schema
schema = pa.schema([
    ('my_field', field_schema)
])

# Example data
data = [
    {'my_field': [
        {'key': 'centroid', 'value': 'yes'},
        {'key': 'ibaq_value', 'value': 49.0},
        {'key': 'consensus_support', 'value': 4},
        {'key': 'software', 'value': 'maxquant'}
    ]}
]

# Create a Table
table = pa.Table.from_pylist(data, schema=schema)

# Write to Parquet
pq.write_table(table, 'example.parquet')

# Read from Parquet
read_table = pq.read_table('example.parquet')
print(read_table.schema)
print(read_table.to_pylist())
```

I think another suggestion could be: 

```
message Schema {
  required group my_field (LIST) {
    repeated group list {
      required group element {
        required binary key (UTF8);
        required group value (UNION) {
          optional binary string_value (UTF8);
          optional double double_value;
          optional int64 int_value;
        }
      }
    }
  }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardisation of cv_terms in parquet. #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Standardisation of cv_terms in parquet. #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions