feat: add ORC format reader/writer utilities#122
Open
lucasfang wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces internal ORC format support under src/paimon/format/orc/, adding an ORC batch reader, ORC writer, and a reader wrapper, along with unit tests intended to validate reading/writing and prefetch-related behaviors.
Changes:
- Add
OrcFileBatchReaderfor batch reading ORC files (prefetch-capable reader interface). - Add
OrcFormatWriterfor writing ORC files with configurable options/metrics. - Add
OrcReaderWrapperhelper around ORC reader APIs, plus new gtest suites for all components.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/orc/orc_reader_wrapper.h | Declares a wrapper around ORC reader/row-reader to support batch reads + prefetch helpers. |
| src/paimon/format/orc/orc_reader_wrapper.cpp | Implements wrapper operations (seek, schema setup, next batch read). |
| src/paimon/format/orc/orc_reader_wrapper_test.cpp | Adds unit test for wrapper row-position tracking. |
| src/paimon/format/orc/orc_format_writer.h | Declares ORC FormatWriter implementation and option helpers. |
| src/paimon/format/orc/orc_format_writer.cpp | Implements ORC writer creation, batch conversion, flushing/finishing, and option handling. |
| src/paimon/format/orc/orc_format_writer_test.cpp | Adds unit tests for writing ORC and validating writer options/compression mapping. |
| src/paimon/format/orc/orc_file_batch_reader.h | Declares ORC PrefetchFileBatchReader implementation and schema/predicate setup. |
| src/paimon/format/orc/orc_file_batch_reader.cpp | Implements ORC reader creation, schema validation, predicate wiring, and metrics. |
| src/paimon/format/orc/orc_file_batch_reader_test.cpp | Adds extensive unit tests for schema selection, dictionaries, complex types, timestamps, etc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+83
to
+86
| OrcAdapter::AppendBatch(target_type_, orc_batch.get(), arrow_pool_.get())); | ||
| PAIMON_RETURN_NOT_OK_FROM_ARROW(arrow::ExportArray(*array, c_array.get(), c_schema.get())); | ||
| next_row_ = GetRowNumber() + orc_batch->numElements; | ||
| guard.Release(); |
Comment on lines
+157
to
+160
| EXPECT_OK_AND_ASSIGN( | ||
| auto orc_batch_reader, | ||
| OrcFileBatchReader::Create(std::move(in_stream), pool_, options, batch_size)) | ||
| EXPECT_TRUE(orc_batch_reader); |
Comment on lines
+457
to
+459
| auto f2 = arrow::field( | ||
| "f2", arrow::struct_({field("sub1", arrow::int64()), field("sub2", arrow::binary()), | ||
| field("sub3", arrow::utf8())})); |
Comment on lines
+79
to
+87
| Result<uint64_t> GetEstimateLength() const; | ||
| Status ExpandBatch(uint64_t expect_size); | ||
|
|
||
| static Result<::orc::WriterOptions> PrepareWriterOptions( | ||
| const std::map<std::string, std::string>& options, const std::string& file_compression, | ||
| const std::shared_ptr<arrow::DataType>& data_type); | ||
| static Result<::orc::CompressionKind> ToOrcCompressionKind(const std::string& file_compression); | ||
|
|
||
| private: |
Comment on lines
+65
to
+77
| Result<ReadBatch> NextBatch() override; | ||
|
|
||
| Result<uint64_t> GetPreviousBatchFirstRowNumber() const override { | ||
| return reader_->GetRowNumber(); | ||
| } | ||
|
|
||
| Result<uint64_t> GetNumberOfRows() const override { | ||
| return reader_->GetNumberOfRows(); | ||
| } | ||
|
|
||
| uint64_t GetNextRowToRead() const override { | ||
| return reader_->GetNextRowToRead(); | ||
| } |
Comment on lines
+359
to
+365
| auto orc_batch_reader = | ||
| PrepareOrcFileBatchReader(file_name, &read_schema, batch_size, natural_read_size); | ||
| ASSERT_EQ(orc_batch_reader->GetPreviousBatchFirstRowNumber().value(), -1); | ||
| ASSERT_OK_AND_ASSIGN(auto result_array, paimon::test::ReadResultCollector::CollectResult( | ||
| orc_batch_reader.get())); | ||
| ASSERT_EQ(orc_batch_reader->GetPreviousBatchFirstRowNumber().value(), 8); | ||
| orc_batch_reader->Close(); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: No linked issue
This change adds ORC format reader and writer utilities for batch processing Paimon data files.
Included changes:
ORC File Batch Reader:
OrcFileBatchReaderheader and implementation for batch reading ORC filesorc_file_batch_reader_test.cppORC Format Writer:
OrcFormatWriterheader and implementation for writing ORC format filesorc_format_writer_test.cppORC Reader Wrapper:
OrcReaderWrapperheader and implementation as a wrapper utility for ORC readersorc_reader_wrapper_test.cppTests
Not run. Local compile, CMake, and gtest environment checks are not part of this PR description.
Test coverage included in this change:
OrcFileBatchReaderTestOrcFormatWriterTestOrcReaderWrapperTestAPI and Format
No public API, storage format, or protocol changes.
All added files are under
src/paimon/format/orc/(internal implementation). No headers added underinclude/.Documentation
No documentation changes required.
Generative AI tooling
Migrate-by: Aone Copilot (Qwen3.7-Max)