Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,54 @@ The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.

The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.

### FIXED_SIZE_LIST
Copy link
Contributor

@JFinis JFinis May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting choice to annotate a binary primitive field instead of a repeated group field. I see pros and cons with this design:

PROs:

  • Guarantees zero-copy, as the layout is defined to be just bytes. In contrast, would this annotate a group, a writer could decide to use a fancy per-value encoding (e.g., dictionary) and thus create a list that first has to be "decoded" before it can be used.
  • Guarantees that a list is always contained on one page instead of being split over multiple pages. Again, this helps in keeping decoders easy and guaranteeing zero copy.
  • This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.

CONs:

  • Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.
  • Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.
  • Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.

I think the PROs outweigh the CONs here, so I think this is fine with me. I just want everyone to be aware about the ramifications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tustvold, as you also brought up this point. I agree that having a new property of a repeated group would be more flexible, but it also comes at some cost, as outlined above. Also, it couldn't be just a logical type in this case, as a logical type cannot change the handling of R-Levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size] is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector having data below?

1. [1, 1, 1], [null, 1, 1] <-- data with null
2. null, [1, 1, 1] <-- null vector

And would vector contains a "nested" vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.

This is the main reason I'd like to propose this type, see apache/arrow#34510.

  • Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.

Lack of composability is a downside, but I think it's still worth the compromise. I've not seen need for fixed_size_list(struct) in tensor computing, but that's probably just because it's not available.

  • Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.

In tensor computation this is usually addressed with bitmasks, which can be stored as a fixed_size_list(binary, num_values).

  • Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.

Perhaps we should call this type FixedSizeArray to disambiguate?

I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size] is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector having data below?

1. [1, 1, 1], [null, 1, 1] <-- data with null
2. null, [1, 1, 1] <-- null vector

And would vector contains a "nested" vector?

I think case 2. is ok, but case 1. should be expressed with a separate null bitmask that's not part of the type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not even sure what a "fixed sized list of structs" even means. Would it mean that each struct has a known size (so that each element is fixed size 🤔 ). How would that work to have a fixed size list of structs where one of the structs was a (non fixed size) list 🤔

In other words, I am not sure the composeability of fixed size list into different element types makes a lot of sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I am not sure the composeability of fixed size list into different element types makes a lot of sense

+1 to this. I think this comes up as a theoretical compatibility with arrow things, where Arrow places no such limitations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


The `FIXED_SIZE_LIST` annotation represents a fixed-size sequence of elements
of the same non-array primitive physical type. It must annotate a
`FIXED_LEN_BYTE_ARRAY` primitive type and uses the `FixedSizeListType`
parameters:
* `type`: the primitive physical type of each element
* `num_values`: the number of elements in each list value

`num_values` must be a positive integer. `type` must be a fixed-width primitive
physical type and must not be `BOOLEAN`, `INT96`, or `BYTE_ARRAY`.
Writers must not emit `FIXED_SIZE_LIST` metadata that violates these
constraints. Readers must treat violating metadata as invalid.

The annotated field's `type_length` must equal the encoded size, in bytes, of
`num_values` elements using the PLAIN representation of `type`:
* `INT32` or `FLOAT`: `type_length` = `num_values` * 4
* `INT64` or `DOUBLE`: `type_length` = `num_values` * 8
* `FIXED_LEN_BYTE_ARRAY`: `type_length` must be divisible by `num_values`;
each element is `type_length` / `num_values` bytes
Writers must not emit `FIXED_SIZE_LIST` metadata where `type_length` does not
match these rules. Readers must treat violating metadata as invalid.

For example, a column of 128-element float vectors:

optional fixed_len_byte_array(512) embeddings (FIXED_SIZE_LIST(type=FLOAT, num_values=128));

Each `FIXED_LEN_BYTE_ARRAY` value stores one fixed-size list value.
`FIXED_SIZE_LIST` is intentionally represented as a primitive leaf and does not
use the 3-level `LIST` structure.

This annotation defines only the intra-value element layout. The surrounding
column encoding is unchanged: any encoding that supports
`FIXED_LEN_BYTE_ARRAY` may be used. Note that `BYTE_STREAM_SPLIT` operates at
the full `type_length` width, creating `type_length` byte-streams; this
naturally groups corresponding element bytes across rows.

If the annotated field is `optional`, list values may be null. Individual
elements are always non-null and are not represented with their own definition
or repetition levels. Nested element types are not supported by
`FIXED_SIZE_LIST`; use `LIST` for nested or element-nullable data.

Writers must validate that `type` is not `BOOLEAN`, `INT96`, or `BYTE_ARRAY`
and that `type_length` matches the expected size for the given `type` and
`num_values`. Readers should reject files that violate these constraints.

The sort order used for `FIXED_SIZE_LIST` is undefined.

## Temporal Types

### DATE
Expand Down
24 changes: 15 additions & 9 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,11 @@ struct ListType {} // see LogicalTypes.md
struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct DateType {} // allowed for INT32
struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md)
struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY; see LogicalTypes.md
1: required Type type; // element type (fixed-width primitive; must not be BOOLEAN, INT96, or BYTE_ARRAY)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to introduce a new enum for the list element types. The Type enum does not distinguish smaller integer types, signed/unsigned types or the float16 type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, decimal is another type we'd lose annotation for. To avoid a new enum, how about optional LogicalType:

struct FixedSizeListType {
    1: required Type type;        // element type (fixed-width primitive)
    2: required i32 num_values;
    3: optional LogicalType element_logical_type; // optional semantic annotation of elements, 
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a logical type could work, and it would then even support nested lists or matrices. It's not immediately obvious, but Type could not support that since the length of FIXED_LEN_BYTE_ARRAY is stored in SchemaElement.

What I don't like is that here, the logical type is used to influence the physical layout, where as elsewhere, a PLAIN encoded INT32 with logical type INT_8 would still be stored using 4 bytes.

Hm, thinking out loud a bit, the physical width is already defined by type_length of FIXED_LEN_BYTE_ARRAY / num_values. The logical type should then be enough to interpret these bytes, without the Type field. The only blocker for that is that there is no logical type annotation to indicate FLOAT or DOUBLE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only blocker for that is that there is no logical type annotation to indicate FLOAT or DOUBLE.

Yes, I think we need either Type and Enum as you originally suggested or Type and optional LogicalType. I slightly prefer LogicalType because we already define it. Shall I update the language to sketch the LogicalType path?

2: required i32 num_values; // number of elements in each value; must be > 0
// Writers must not emit violating values. Readers must treat violating metadata as invalid.
}

/**
* Logical type to annotate a column that is always null.
Expand Down Expand Up @@ -485,15 +490,16 @@ union LogicalType {
8: TimestampType TIMESTAMP

// 9: reserved for INTERVAL
10: IntType INTEGER // use ConvertedType INT_* or UINT_*
11: NullType UNKNOWN // no compatible ConvertedType
12: JsonType JSON // use ConvertedType JSON
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: VariantType VARIANT // no compatible ConvertedType
17: GeometryType GEOMETRY // no compatible ConvertedType
18: GeographyType GEOGRAPHY // no compatible ConvertedType
10: IntType INTEGER // use ConvertedType INT_* or UINT_*
11: NullType UNKNOWN // no compatible ConvertedType
12: JsonType JSON // use ConvertedType JSON
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
15: Float16Type FLOAT16 // no compatible ConvertedType
16: VariantType VARIANT // no compatible ConvertedType
17: GeometryType GEOMETRY // no compatible ConvertedType
18: GeographyType GEOGRAPHY // no compatible ConvertedType
19: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType
}

/**
Expand Down