-
Notifications
You must be signed in to change notification settings - Fork 473
PARQUET-2474: Add FIXED_SIZE_LIST logical type #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
a2c8fff
0df224a
ee67f48
2739536
f713506
d0d3567
bc3df18
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -319,6 +319,11 @@ struct ListType {} // see LogicalTypes.md | |
| struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 | ||
| struct DateType {} // allowed for INT32 | ||
| struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md) | ||
| struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY; see LogicalTypes.md | ||
| 1: required Type type; // element type (fixed-width primitive; must not be BOOLEAN, INT96, or BYTE_ARRAY) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might make sense to introduce a new enum for the list element types. The
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, decimal is another type we'd lose annotation for. To avoid a new enum, how about struct FixedSizeListType {
1: required Type type; // element type (fixed-width primitive)
2: required i32 num_values;
3: optional LogicalType element_logical_type; // optional semantic annotation of elements,
}There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding a logical type could work, and it would then even support nested lists or matrices. It's not immediately obvious, but What I don't like is that here, the logical type is used to influence the physical layout, where as elsewhere, a PLAIN encoded INT32 with logical type INT_8 would still be stored using 4 bytes. Hm, thinking out loud a bit, the physical width is already defined by
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I think we need either |
||
| 2: required i32 num_values; // number of elements in each value; must be > 0 | ||
| // Writers must not emit violating values. Readers must treat violating metadata as invalid. | ||
| } | ||
|
|
||
| /** | ||
| * Logical type to annotate a column that is always null. | ||
|
|
@@ -485,15 +490,16 @@ union LogicalType { | |
| 8: TimestampType TIMESTAMP | ||
|
|
||
| // 9: reserved for INTERVAL | ||
| 10: IntType INTEGER // use ConvertedType INT_* or UINT_* | ||
| 11: NullType UNKNOWN // no compatible ConvertedType | ||
| 12: JsonType JSON // use ConvertedType JSON | ||
| 13: BsonType BSON // use ConvertedType BSON | ||
| 14: UUIDType UUID // no compatible ConvertedType | ||
| 15: Float16Type FLOAT16 // no compatible ConvertedType | ||
| 16: VariantType VARIANT // no compatible ConvertedType | ||
| 17: GeometryType GEOMETRY // no compatible ConvertedType | ||
| 18: GeographyType GEOGRAPHY // no compatible ConvertedType | ||
| 10: IntType INTEGER // use ConvertedType INT_* or UINT_* | ||
| 11: NullType UNKNOWN // no compatible ConvertedType | ||
| 12: JsonType JSON // use ConvertedType JSON | ||
| 13: BsonType BSON // use ConvertedType BSON | ||
| 14: UUIDType UUID // no compatible ConvertedType | ||
| 15: Float16Type FLOAT16 // no compatible ConvertedType | ||
| 16: VariantType VARIANT // no compatible ConvertedType | ||
| 17: GeometryType GEOMETRY // no compatible ConvertedType | ||
| 18: GeographyType GEOGRAPHY // no compatible ConvertedType | ||
| 19: FixedSizeListType FIXED_SIZE_LIST // no compatible ConvertedType | ||
| } | ||
|
|
||
| /** | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting choice to annotate a binary primitive field instead of a repeated group field. I see pros and cons with this design:
PROs:
CONs:
I think the PROs outweigh the CONs here, so I think this is fine with me. I just want everyone to be aware about the ramifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tustvold, as you also brought up this point. I agree that having a new property of a repeated group would be more flexible, but it also comes at some cost, as outlined above. Also, it couldn't be just a logical type in this case, as a logical type cannot change the handling of R-Levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now feeling that maybe wrapping a
Vector[PrimitiveType, Size]is also ok, but currently representing this is a bitweird in the model. May I ask would aVectorhaving data below?And would vector contains a "nested" vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main reason I'd like to propose this type, see apache/arrow#34510.
Lack of composability is a downside, but I think it's still worth the compromise. I've not seen need for fixed_size_list(struct) in tensor computing, but that's probably just because it's not available.
In tensor computation this is usually addressed with bitmasks, which can be stored as a
fixed_size_list(binary, num_values).Perhaps we should call this type
FixedSizeArrayto disambiguate?I think case 2. is ok, but case 1. should be expressed with a separate null bitmask that's not part of the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not even sure what a "fixed sized list of structs" even means. Would it mean that each struct has a known size (so that each element is fixed size 🤔 ). How would that work to have a fixed size list of structs where one of the structs was a (non fixed size) list 🤔
In other words, I am not sure the composeability of fixed size list into different element types makes a lot of sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this. I think this comes up as a theoretical compatibility with arrow things, where Arrow places no such limitations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1