Skip to content

feat(arrow-avro): HeaderInfo to expose OCF header#9548

Merged
alamb merged 5 commits intoapache:mainfrom
mzabaluev:avro-reader-builder-with-header-info
Mar 18, 2026
Merged

feat(arrow-avro): HeaderInfo to expose OCF header#9548
alamb merged 5 commits intoapache:mainfrom
mzabaluev:avro-reader-builder-with-header-info

Conversation

@mzabaluev
Copy link
Contributor

@mzabaluev mzabaluev commented Mar 13, 2026

Which issue does this PR close?

Rationale for this change

Rework of #9462 along the lines proposed in #9462 (comment).

What changes are included in this PR?

Add HeaderInfo as a cheaply cloneable value to expose header information parsed from an Avro OCF file.

Add read_header_info function to the reader module, and its async counterpart to the reader::async_reader module, to read the header from the file reader and return HeaderInfo.

Add build_with_header method to async reader builder to enable reuse of the header with multiple readers.

Are these changes tested?

Added a test for the async reader.

Are there any user-facing changes?

New API in arrow-avro:

  • reader::HeaderInfo
  • reader::read_header_info and reader::async_reader::read_header_info
  • build_with_header method of AvroAsyncFileReader's builder.

Add HeaderInfo to expose OCF header information such as
the writer schema and sync marker.

Add read_header_info function to the reader module, and
its async counterpart to the reader::async_reader module,
to read the header from the file reader and return HeaderInfo.

Add build_with_header method to async reader builder to enable
reuse of the header with multiple readers.
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Mar 13, 2026
@mzabaluev mzabaluev changed the title feat(arrow-avro): HeaderInfo to expose OCF header feat(arrow-avro): HeaderInfo to expose OCF header Mar 13, 2026
@mzabaluev mzabaluev marked this pull request as ready for review March 14, 2026 10:43
@mzabaluev
Copy link
Contributor Author

I have not added a method to the sync reader builder yet, because I have some doubts about the API. The implementation uses the sequential BufRead, which means that the header length information is not useful: the only supported case is to read the header from the same BufRead object first, making sure that the read position is immediately after the header.

When some seeking capabilities are added to the sync reader to work with ranges, this may need to be revisited.

@alamb
Copy link
Contributor

alamb commented Mar 17, 2026

@jecsand838 can you help review this PR?

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzabaluev LMGTM!

Just left a few comments regarding the doc comments and some duplicated code.

Comment on lines 186 to 209
let (header, header_len) =
read_header(&mut self.reader, self.file_size, self.header_size_hint).await?;
self.build_internal(&header, header_len)
}

/// Build the asynchronous Avro reader with the provided header.
///
/// This allows initializing the reader with pre-parsed header information.
/// Note that this method is not async because it does not need to perform any I/O operations.
pub fn build_with_header(
self,
header_info: HeaderInfo,
) -> Result<AsyncAvroFileReader<R>, AvroError> {
self.build_internal(header_info.header(), header_info.header_len())
}

fn build_internal(
self,
header: &Header,
header_len: u64,
) -> Result<AsyncAvroFileReader<R>, AvroError> {
let writer_schema = {
let raw = header.get(SCHEMA_METADATA_KEY).ok_or_else(|| {
AvroError::ParseError("No Avro schema present in file header".to_string())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You maybe able to get rid of the duplicated writer_schema code by doing something like this:

Suggested change
let (header, header_len) =
read_header(&mut self.reader, self.file_size, self.header_size_hint).await?;
self.build_internal(&header, header_len)
}
/// Build the asynchronous Avro reader with the provided header.
///
/// This allows initializing the reader with pre-parsed header information.
/// Note that this method is not async because it does not need to perform any I/O operations.
pub fn build_with_header(
self,
header_info: HeaderInfo,
) -> Result<AsyncAvroFileReader<R>, AvroError> {
self.build_internal(header_info.header(), header_info.header_len())
}
fn build_internal(
self,
header: &Header,
header_len: u64,
) -> Result<AsyncAvroFileReader<R>, AvroError> {
let writer_schema = {
let raw = header.get(SCHEMA_METADATA_KEY).ok_or_else(|| {
AvroError::ParseError("No Avro schema present in file header".to_string())
let header_info =
read_header_info(&mut self.reader, self.file_size, self.header_size_hint).await?;
self.build_internal(&header_info)
}
/// Build the asynchronous Avro reader with the provided header.
///
/// This allows initializing the reader with pre-parsed header information.
/// Note that this method is not async because it does not need to perform any I/O operations.
pub fn build_with_header(
self,
header_info: HeaderInfo,
) -> Result<AsyncAvroFileReader<R>, AvroError> {
self.build_internal(&header_info)
}
fn build_internal(
self,
header: &HeaderInfo,
) -> Result<AsyncAvroFileReader<R>, AvroError> {
let writer_schema = header.writer_schema()?;

I know it adds the Arc allocation in HeaderInfo::new when the caller doesn't need to share the header, but that seems like a negligible cost for a once-per-file operation. Otherwise you could abstract the writer_schema() logic onto Header for re-use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this will simplify things for sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4e57460

Comment on lines +198 to +199
) -> Result<AsyncAvroFileReader<R>, AvroError> {
self.build_internal(header_info.header(), header_info.header_len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you think we should add the empty file check here as well?

Suggested change
) -> Result<AsyncAvroFileReader<R>, AvroError> {
self.build_internal(header_info.header(), header_info.header_len())
) -> Result<AsyncAvroFileReader<R>, AvroError> {
if self.file_size == 0 {
return Err(AvroError::InvalidArgument("File size cannot be 0".into()));
}
self.build_internal(header_info.header(), header_info.header_len())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it guards a degenerate case in try_build where we'd attempt to read the header even though the file size is 0.

If we receive the header, we trust the length anyway and if the actual length is 0, the reading will fail elsewhere.

@@ -1273,7 +1275,7 @@ impl ReaderBuilder {
/// the discovered writer (and optional reader) schema, and prepares to iterate blocks,
/// decompressing if necessary.
pub fn build<R: BufRead>(self, mut reader: R) -> Result<Reader<R>, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think there's value in adding the HeaderInfo to the sync Reader, but that can be added in the future.

mzabaluev and others added 2 commits March 18, 2026 12:38
Co-authored-by: Connor Sanders <170039284+jecsand838@users.noreply.github.com>
Specifically, use the read_header_info function and reuse
the writer_schema method in the async reader's builder,
sacrificing some negligible performance loss on allocating and
constructing a temporary Arc in the HeaderInfo.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mzabaluev and @jecsand838

@alamb alamb merged commit c4b43bb into apache:main Mar 18, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose Avro writer schema when building the reader

4 participants