From a2c8fff42c5027f22d0015aecc88e721ab36ce2b Mon Sep 17 00:00:00 2001 From: Rok Mihevc Date: Wed, 15 May 2024 19:41:59 +0200 Subject: [PATCH 1/7] Add FIXED_SIZE_LIST --- LogicalTypes.md | 12 ++++++++++++ src/main/thrift/parquet.thrift | 20 +++++++++++--------- 2 files changed, 23 insertions(+), 9 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index e7a0ce046..dfe1dc59e 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -256,6 +256,18 @@ The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`. The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. +### FIXED_SIZE_LIST + +The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements +of a fixed-width data type. It must annotate an N-byte fixed length binary +where N is the number of elements in the list times bit width of the element +data type. + +The `fixed_len_byte_array` data is interpreted as a sequence of elements of +the same fixed-width data type. + +The sort order used for `FIXED_SIZE_LIST` is undefined. + ## Temporal Types ### DATE diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index a9e62cc0d..277a75aca 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -319,6 +319,7 @@ struct ListType {} // see LogicalTypes.md struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 struct DateType {} // allowed for INT32 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md) +struct FixedSizeListType {} // see LogicalTypes.md /** * Logical type to annotate a column that is always null. @@ -485,15 +486,16 @@ union LogicalType { 8: TimestampType TIMESTAMP // 9: reserved for INTERVAL - 10: IntType INTEGER // use ConvertedType INT_* or UINT_* - 11: NullType UNKNOWN // no compatible ConvertedType - 12: JsonType JSON // use ConvertedType JSON - 13: BsonType BSON // use ConvertedType BSON - 14: UUIDType UUID // no compatible ConvertedType - 15: Float16Type FLOAT16 // no compatible ConvertedType - 16: VariantType VARIANT // no compatible ConvertedType - 17: GeometryType GEOMETRY // no compatible ConvertedType - 18: GeographyType GEOGRAPHY // no compatible ConvertedType + 10: IntType INTEGER // use ConvertedType INT_* or UINT_* + 11: NullType UNKNOWN // no compatible ConvertedType + 12: JsonType JSON // use ConvertedType JSON + 13: BsonType BSON // use ConvertedType BSON + 14: UUIDType UUID // no compatible ConvertedType + 15: Float16Type FLOAT16 // no compatible ConvertedType + 16: VariantType VARIANT // no compatible ConvertedType + 17: GeometryType GEOMETRY // no compatible ConvertedType + 18: GeographyType GEOGRAPHY // no compatible ConvertedType + 19: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType } /** From 0df224a5695095a7d3091a2d83a6ba00ca7f481d Mon Sep 17 00:00:00 2001 From: Rok Mihevc Date: Wed, 15 May 2024 23:52:57 +0200 Subject: [PATCH 2/7] Review feedback --- LogicalTypes.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index dfe1dc59e..8e799ce04 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -259,12 +259,10 @@ The sort order for `FLOAT16` is signed (with special handling of NANs and signed ### FIXED_SIZE_LIST The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements -of a fixed-width data type. It must annotate an N-byte fixed length binary -where N is the number of elements in the list times bit width of the element -data type. +of a primitive data type. It must annotate a `binary` primitive type. The `fixed_len_byte_array` data is interpreted as a sequence of elements of -the same fixed-width data type. +the same primitive data type. The sort order used for `FIXED_SIZE_LIST` is undefined. From ee67f4846c58a2968c26d00797335efdb433806f Mon Sep 17 00:00:00 2001 From: Rok Mihevc Date: Tue, 4 Jun 2024 19:47:33 +0200 Subject: [PATCH 3/7] Update LogicalTypes.md Co-authored-by: Ed Seidl --- LogicalTypes.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 8e799ce04..3ded17c2a 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -261,7 +261,7 @@ The sort order for `FLOAT16` is signed (with special handling of NANs and signed The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements of a primitive data type. It must annotate a `binary` primitive type. -The `fixed_len_byte_array` data is interpreted as a sequence of elements of +The `binary` data is interpreted as a sequence of elements of the same primitive data type. The sort order used for `FIXED_SIZE_LIST` is undefined. From 2739536ab481f5d64d400af4013f889958628205 Mon Sep 17 00:00:00 2001 From: Rok Mihevc Date: Wed, 5 Jun 2024 03:48:29 +0200 Subject: [PATCH 4/7] Review feedback, split into FixedSizeListType and VariableSizeListType --- LogicalTypes.md | 14 +++++++++++--- src/main/thrift/parquet.thrift | 29 ++++++++++++++++++----------- 2 files changed, 29 insertions(+), 14 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 3ded17c2a..b63419b75 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -259,13 +259,21 @@ The sort order for `FLOAT16` is signed (with special handling of NANs and signed ### FIXED_SIZE_LIST The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements -of a primitive data type. It must annotate a `binary` primitive type. +of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. -The `binary` data is interpreted as a sequence of elements of -the same primitive data type. +The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of +elements of the same primitive data type. The sort order used for `FIXED_SIZE_LIST` is undefined. +### VARIABLE_SIZE_LIST + +The `VARIABLE_SIZE_LIST` annotation represents a variable-size list of elements +of a primitive data type. It must annotate a `BYTE_ARRAY` primitive type. + +The `BYTE_ARRAY` data is interpreted as a variable size sequence of elements of +the same primitive data type. + ## Temporal Types ### DATE diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 277a75aca..af2915dd2 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -319,7 +319,13 @@ struct ListType {} // see LogicalTypes.md struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 struct DateType {} // allowed for INT32 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md) -struct FixedSizeListType {} // see LogicalTypes.md +struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY[num_values * width of type], + 1: required Type type; // see LogicalTypes.md + 2: required i32 num_values; +} +struct VariableSizeListType { // allowed for BYTE_ARRAY, see LogicalTypes.md + 1: required Type type; +} /** * Logical type to annotate a column that is always null. @@ -486,16 +492,17 @@ union LogicalType { 8: TimestampType TIMESTAMP // 9: reserved for INTERVAL - 10: IntType INTEGER // use ConvertedType INT_* or UINT_* - 11: NullType UNKNOWN // no compatible ConvertedType - 12: JsonType JSON // use ConvertedType JSON - 13: BsonType BSON // use ConvertedType BSON - 14: UUIDType UUID // no compatible ConvertedType - 15: Float16Type FLOAT16 // no compatible ConvertedType - 16: VariantType VARIANT // no compatible ConvertedType - 17: GeometryType GEOMETRY // no compatible ConvertedType - 18: GeographyType GEOGRAPHY // no compatible ConvertedType - 19: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType + 10: IntType INTEGER // use ConvertedType INT_* or UINT_* + 11: NullType UNKNOWN // no compatible ConvertedType + 12: JsonType JSON // use ConvertedType JSON + 13: BsonType BSON // use ConvertedType BSON + 14: UUIDType UUID // no compatible ConvertedType + 15: Float16Type FLOAT16 // no compatible ConvertedType + 16: VariantType VARIANT // no compatible ConvertedType + 17: GeometryType GEOMETRY // no compatible ConvertedType + 18: GeographyType GEOGRAPHY // no compatible ConvertedType + 19: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType + 20: VariableSizeListType VARIABLE_SIZE_LIST // no compatible ConvertedType } /** From f71350600dec31ad55e32b7137dc7b47b52b4ff6 Mon Sep 17 00:00:00 2001 From: Rok Mihevc Date: Mon, 24 Jun 2024 11:07:01 +0200 Subject: [PATCH 5/7] Removing VariableSizeListType --- LogicalTypes.md | 8 -------- src/main/thrift/parquet.thrift | 4 ---- 2 files changed, 12 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index b63419b75..85d7c46a1 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -266,14 +266,6 @@ elements of the same primitive data type. The sort order used for `FIXED_SIZE_LIST` is undefined. -### VARIABLE_SIZE_LIST - -The `VARIABLE_SIZE_LIST` annotation represents a variable-size list of elements -of a primitive data type. It must annotate a `BYTE_ARRAY` primitive type. - -The `BYTE_ARRAY` data is interpreted as a variable size sequence of elements of -the same primitive data type. - ## Temporal Types ### DATE diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index af2915dd2..15df4a2f1 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -323,9 +323,6 @@ struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY[num_values 1: required Type type; // see LogicalTypes.md 2: required i32 num_values; } -struct VariableSizeListType { // allowed for BYTE_ARRAY, see LogicalTypes.md - 1: required Type type; -} /** * Logical type to annotate a column that is always null. @@ -502,7 +499,6 @@ union LogicalType { 17: GeometryType GEOMETRY // no compatible ConvertedType 18: GeographyType GEOGRAPHY // no compatible ConvertedType 19: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType - 20: VariableSizeListType VARIABLE_SIZE_LIST // no compatible ConvertedType } /** From d0d3567f6bd8740e9f67ad13c219015db9229a0a Mon Sep 17 00:00:00 2001 From: Rok Mihevc Date: Mon, 24 Jun 2024 19:16:47 +0200 Subject: [PATCH 6/7] Review feedback --- LogicalTypes.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 85d7c46a1..2bab369d8 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -259,10 +259,10 @@ The sort order for `FLOAT16` is signed (with special handling of NANs and signed ### FIXED_SIZE_LIST The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements -of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. +of a non-array primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of -elements of the same primitive data type. +elements of the same primitive data type encoded with plain encoding. The sort order used for `FIXED_SIZE_LIST` is undefined. From bc3df18b190e4a792522e91b4c6504c190aabdc6 Mon Sep 17 00:00:00 2001 From: Rok Mihevc Date: Tue, 3 Mar 2026 00:48:00 +0100 Subject: [PATCH 7/7] make more verbose --- LogicalTypes.md | 48 ++++++++++++++++++++++++++++++---- src/main/thrift/parquet.thrift | 7 ++--- 2 files changed, 47 insertions(+), 8 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 2bab369d8..c55701a36 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -258,11 +258,49 @@ The sort order for `FLOAT16` is signed (with special handling of NANs and signed ### FIXED_SIZE_LIST -The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements -of a non-array primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. - -The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of -elements of the same primitive data type encoded with plain encoding. +The `FIXED_SIZE_LIST` annotation represents a fixed-size sequence of elements +of the same non-array primitive physical type. It must annotate a +`FIXED_LEN_BYTE_ARRAY` primitive type and uses the `FixedSizeListType` +parameters: +* `type`: the primitive physical type of each element +* `num_values`: the number of elements in each list value + +`num_values` must be a positive integer. `type` must be a fixed-width primitive +physical type and must not be `BOOLEAN`, `INT96`, or `BYTE_ARRAY`. +Writers must not emit `FIXED_SIZE_LIST` metadata that violates these +constraints. Readers must treat violating metadata as invalid. + +The annotated field's `type_length` must equal the encoded size, in bytes, of +`num_values` elements using the PLAIN representation of `type`: +* `INT32` or `FLOAT`: `type_length` = `num_values` * 4 +* `INT64` or `DOUBLE`: `type_length` = `num_values` * 8 +* `FIXED_LEN_BYTE_ARRAY`: `type_length` must be divisible by `num_values`; + each element is `type_length` / `num_values` bytes +Writers must not emit `FIXED_SIZE_LIST` metadata where `type_length` does not +match these rules. Readers must treat violating metadata as invalid. + +For example, a column of 128-element float vectors: + + optional fixed_len_byte_array(512) embeddings (FIXED_SIZE_LIST(type=FLOAT, num_values=128)); + +Each `FIXED_LEN_BYTE_ARRAY` value stores one fixed-size list value. +`FIXED_SIZE_LIST` is intentionally represented as a primitive leaf and does not +use the 3-level `LIST` structure. + +This annotation defines only the intra-value element layout. The surrounding +column encoding is unchanged: any encoding that supports +`FIXED_LEN_BYTE_ARRAY` may be used. Note that `BYTE_STREAM_SPLIT` operates at +the full `type_length` width, creating `type_length` byte-streams; this +naturally groups corresponding element bytes across rows. + +If the annotated field is `optional`, list values may be null. Individual +elements are always non-null and are not represented with their own definition +or repetition levels. Nested element types are not supported by +`FIXED_SIZE_LIST`; use `LIST` for nested or element-nullable data. + +Writers must validate that `type` is not `BOOLEAN`, `INT96`, or `BYTE_ARRAY` +and that `type_length` matches the expected size for the given `type` and +`num_values`. Readers should reject files that violate these constraints. The sort order used for `FIXED_SIZE_LIST` is undefined. diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 15df4a2f1..a30844cee 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -319,9 +319,10 @@ struct ListType {} // see LogicalTypes.md struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 struct DateType {} // allowed for INT32 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md) -struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY[num_values * width of type], - 1: required Type type; // see LogicalTypes.md - 2: required i32 num_values; +struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY; see LogicalTypes.md + 1: required Type type; // element type (fixed-width primitive; must not be BOOLEAN, INT96, or BYTE_ARRAY) + 2: required i32 num_values; // number of elements in each value; must be > 0 + // Writers must not emit violating values. Readers must treat violating metadata as invalid. } /**