Skip to content

feat: add ORC format reader/writer utilities#122

Open
lucasfang wants to merge 1 commit into
apache:mainfrom
lucasfang:migrate_8
Open

feat: add ORC format reader/writer utilities#122
lucasfang wants to merge 1 commit into
apache:mainfrom
lucasfang:migrate_8

Conversation

@lucasfang

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: No linked issue

This change adds ORC format reader and writer utilities for batch processing Paimon data files.

Included changes:

  • ORC File Batch Reader:

    • Adds OrcFileBatchReader header and implementation for batch reading ORC files
    • Adds comprehensive test coverage in orc_file_batch_reader_test.cpp
  • ORC Format Writer:

    • Adds OrcFormatWriter header and implementation for writing ORC format files
    • Adds test coverage in orc_format_writer_test.cpp
  • ORC Reader Wrapper:

    • Adds OrcReaderWrapper header and implementation as a wrapper utility for ORC readers
    • Adds test coverage in orc_reader_wrapper_test.cpp

Tests

Not run. Local compile, CMake, and gtest environment checks are not part of this PR description.

Test coverage included in this change:

  • OrcFileBatchReaderTest
  • OrcFormatWriterTest
  • OrcReaderWrapperTest

API and Format

No public API, storage format, or protocol changes.

All added files are under src/paimon/format/orc/ (internal implementation). No headers added under include/.

Documentation

No documentation changes required.

Generative AI tooling

Migrate-by: Aone Copilot (Qwen3.7-Max)

Copilot AI review requested due to automatic review settings June 25, 2026 06:29

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces internal ORC format support under src/paimon/format/orc/, adding an ORC batch reader, ORC writer, and a reader wrapper, along with unit tests intended to validate reading/writing and prefetch-related behaviors.

Changes:

  • Add OrcFileBatchReader for batch reading ORC files (prefetch-capable reader interface).
  • Add OrcFormatWriter for writing ORC files with configurable options/metrics.
  • Add OrcReaderWrapper helper around ORC reader APIs, plus new gtest suites for all components.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/paimon/format/orc/orc_reader_wrapper.h Declares a wrapper around ORC reader/row-reader to support batch reads + prefetch helpers.
src/paimon/format/orc/orc_reader_wrapper.cpp Implements wrapper operations (seek, schema setup, next batch read).
src/paimon/format/orc/orc_reader_wrapper_test.cpp Adds unit test for wrapper row-position tracking.
src/paimon/format/orc/orc_format_writer.h Declares ORC FormatWriter implementation and option helpers.
src/paimon/format/orc/orc_format_writer.cpp Implements ORC writer creation, batch conversion, flushing/finishing, and option handling.
src/paimon/format/orc/orc_format_writer_test.cpp Adds unit tests for writing ORC and validating writer options/compression mapping.
src/paimon/format/orc/orc_file_batch_reader.h Declares ORC PrefetchFileBatchReader implementation and schema/predicate setup.
src/paimon/format/orc/orc_file_batch_reader.cpp Implements ORC reader creation, schema validation, predicate wiring, and metrics.
src/paimon/format/orc/orc_file_batch_reader_test.cpp Adds extensive unit tests for schema selection, dictionaries, complex types, timestamps, etc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +83 to +86
OrcAdapter::AppendBatch(target_type_, orc_batch.get(), arrow_pool_.get()));
PAIMON_RETURN_NOT_OK_FROM_ARROW(arrow::ExportArray(*array, c_array.get(), c_schema.get()));
next_row_ = GetRowNumber() + orc_batch->numElements;
guard.Release();
Comment on lines +157 to +160
EXPECT_OK_AND_ASSIGN(
auto orc_batch_reader,
OrcFileBatchReader::Create(std::move(in_stream), pool_, options, batch_size))
EXPECT_TRUE(orc_batch_reader);
Comment on lines +457 to +459
auto f2 = arrow::field(
"f2", arrow::struct_({field("sub1", arrow::int64()), field("sub2", arrow::binary()),
field("sub3", arrow::utf8())}));
Comment on lines +79 to +87
Result<uint64_t> GetEstimateLength() const;
Status ExpandBatch(uint64_t expect_size);

static Result<::orc::WriterOptions> PrepareWriterOptions(
const std::map<std::string, std::string>& options, const std::string& file_compression,
const std::shared_ptr<arrow::DataType>& data_type);
static Result<::orc::CompressionKind> ToOrcCompressionKind(const std::string& file_compression);

private:
Comment on lines +65 to +77
Result<ReadBatch> NextBatch() override;

Result<uint64_t> GetPreviousBatchFirstRowNumber() const override {
return reader_->GetRowNumber();
}

Result<uint64_t> GetNumberOfRows() const override {
return reader_->GetNumberOfRows();
}

uint64_t GetNextRowToRead() const override {
return reader_->GetNextRowToRead();
}
Comment on lines +359 to +365
auto orc_batch_reader =
PrepareOrcFileBatchReader(file_name, &read_schema, batch_size, natural_read_size);
ASSERT_EQ(orc_batch_reader->GetPreviousBatchFirstRowNumber().value(), -1);
ASSERT_OK_AND_ASSIGN(auto result_array, paimon::test::ReadResultCollector::CollectResult(
orc_batch_reader.get()));
ASSERT_EQ(orc_batch_reader->GetPreviousBatchFirstRowNumber().value(), 8);
orc_batch_reader->Close();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants