diff --git a/LogicalTypes.md b/LogicalTypes.md index e7a0ce04..c55701a3 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -256,6 +256,54 @@ The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`. The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. +### FIXED_SIZE_LIST + +The `FIXED_SIZE_LIST` annotation represents a fixed-size sequence of elements +of the same non-array primitive physical type. It must annotate a +`FIXED_LEN_BYTE_ARRAY` primitive type and uses the `FixedSizeListType` +parameters: +* `type`: the primitive physical type of each element +* `num_values`: the number of elements in each list value + +`num_values` must be a positive integer. `type` must be a fixed-width primitive +physical type and must not be `BOOLEAN`, `INT96`, or `BYTE_ARRAY`. +Writers must not emit `FIXED_SIZE_LIST` metadata that violates these +constraints. Readers must treat violating metadata as invalid. + +The annotated field's `type_length` must equal the encoded size, in bytes, of +`num_values` elements using the PLAIN representation of `type`: +* `INT32` or `FLOAT`: `type_length` = `num_values` * 4 +* `INT64` or `DOUBLE`: `type_length` = `num_values` * 8 +* `FIXED_LEN_BYTE_ARRAY`: `type_length` must be divisible by `num_values`; + each element is `type_length` / `num_values` bytes +Writers must not emit `FIXED_SIZE_LIST` metadata where `type_length` does not +match these rules. Readers must treat violating metadata as invalid. + +For example, a column of 128-element float vectors: + + optional fixed_len_byte_array(512) embeddings (FIXED_SIZE_LIST(type=FLOAT, num_values=128)); + +Each `FIXED_LEN_BYTE_ARRAY` value stores one fixed-size list value. +`FIXED_SIZE_LIST` is intentionally represented as a primitive leaf and does not +use the 3-level `LIST` structure. + +This annotation defines only the intra-value element layout. The surrounding +column encoding is unchanged: any encoding that supports +`FIXED_LEN_BYTE_ARRAY` may be used. Note that `BYTE_STREAM_SPLIT` operates at +the full `type_length` width, creating `type_length` byte-streams; this +naturally groups corresponding element bytes across rows. + +If the annotated field is `optional`, list values may be null. Individual +elements are always non-null and are not represented with their own definition +or repetition levels. Nested element types are not supported by +`FIXED_SIZE_LIST`; use `LIST` for nested or element-nullable data. + +Writers must validate that `type` is not `BOOLEAN`, `INT96`, or `BYTE_ARRAY` +and that `type_length` matches the expected size for the given `type` and +`num_values`. Readers should reject files that violate these constraints. + +The sort order used for `FIXED_SIZE_LIST` is undefined. + ## Temporal Types ### DATE diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index a9e62cc0..a30844ce 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -319,6 +319,11 @@ struct ListType {} // see LogicalTypes.md struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 struct DateType {} // allowed for INT32 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md) +struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY; see LogicalTypes.md + 1: required Type type; // element type (fixed-width primitive; must not be BOOLEAN, INT96, or BYTE_ARRAY) + 2: required i32 num_values; // number of elements in each value; must be > 0 + // Writers must not emit violating values. Readers must treat violating metadata as invalid. +} /** * Logical type to annotate a column that is always null. @@ -485,15 +490,16 @@ union LogicalType { 8: TimestampType TIMESTAMP // 9: reserved for INTERVAL - 10: IntType INTEGER // use ConvertedType INT_* or UINT_* - 11: NullType UNKNOWN // no compatible ConvertedType - 12: JsonType JSON // use ConvertedType JSON - 13: BsonType BSON // use ConvertedType BSON - 14: UUIDType UUID // no compatible ConvertedType - 15: Float16Type FLOAT16 // no compatible ConvertedType - 16: VariantType VARIANT // no compatible ConvertedType - 17: GeometryType GEOMETRY // no compatible ConvertedType - 18: GeographyType GEOGRAPHY // no compatible ConvertedType + 10: IntType INTEGER // use ConvertedType INT_* or UINT_* + 11: NullType UNKNOWN // no compatible ConvertedType + 12: JsonType JSON // use ConvertedType JSON + 13: BsonType BSON // use ConvertedType BSON + 14: UUIDType UUID // no compatible ConvertedType + 15: Float16Type FLOAT16 // no compatible ConvertedType + 16: VariantType VARIANT // no compatible ConvertedType + 17: GeometryType GEOMETRY // no compatible ConvertedType + 18: GeographyType GEOGRAPHY // no compatible ConvertedType + 19: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType } /**