Conversation
…nd row range Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers. Solution: introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions. Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers. Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns. Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns. Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation. Tradeoffs and rationale: chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl. Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths. read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior. Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).
Apply formatter-driven layout updates in Parquet read-options code and related tests. No behavior change; this commit is formatting-only after lint/format checks.
|
I tried to implemented read options with backward compatibility as a core constraint, so existing functions and the default behavior should remain unchanged while adding the new feature. Let me know if the test cases make sense or if I need more. |
|
|
||
| data ParquetTimestampPolicy | ||
| = PreserveTimestampPrecision | ||
| | CoerceTimestampToDay |
There was a problem hiding this comment.
This doesn't seem as fundamental. Let's hold off on it.
| deriving (Eq, Show) | ||
|
|
||
| data ParquetReadOptions = ParquetReadOptions | ||
| { selectedColumns :: Maybe [T.Text] |
There was a problem hiding this comment.
Parquet is useful for predicate as read options. so as you're reading a file (or series of files) you can do some filtering.
let x = F.col @(Maybe Text) "x"
let opts = defaultParquetReadOpts
{ predicate = x ./= Nothing .&& (x .<= F.lit (Just 100)) }
D.readParquetWithOptsThis will be extremely useful for reading globs.
| Nothing -> True | ||
| Just selected -> | ||
| let fullPath = T.intercalate "." (map T.pack colPath) | ||
| in colName `S.member` selected || fullPath `S.member` selected |
There was a problem hiding this comment.
Let's not worry about nested fields for now. The reader doesn't even have a good way to support them.
| pure (applyRowRange opts (mconcat dfs)) | ||
|
|
||
| applyRowRange :: ParquetReadOptions -> DataFrame -> DataFrame | ||
| applyRowRange opts df = case rowRange opts of |
There was a problem hiding this comment.
nit:
fmap (DS.range df) (rowRange opts)
Or use <$>.
| TestCase | ||
| ( assertEqual | ||
| "rowRangeWithOpts" | ||
| (D.range (2, 5) allTypes) |
There was a problem hiding this comment.
This is circular since if the range function is broken/produces the wrong result this will still pass. Should just test against the expect dimensions.
| ) | ||
| ) | ||
|
|
||
| missingSelectedColumnWithOpts :: Test |
There was a problem hiding this comment.
Thanks. This was a good implementation.
Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers.
Solution:
Tradeoffs and rationale:
Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).