feat(format): add ORC format adapter, stream implementations and builder utilities#121
Open
lucasfang wants to merge 1 commit into
Open
feat(format): add ORC format adapter, stream implementations and builder utilities#121lucasfang wants to merge 1 commit into
lucasfang wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds ORC format support to the Paimon C++ format layer by introducing an Arrow↔ORC adapter, ORC-specific Input/OutputStream implementations, reader/writer builder utilities, and file format + factory wiring.
Changes:
- Introduces
OrcAdapterto convert between ORC column batches/types and Arrow arrays/schemas. - Adds ORC stream wrappers (
OrcInputStreamImpl,OrcOutputStreamImpl) and corresponding reader/writer builders. - Registers an ORC
FileFormat+FileFormatFactory, and adds ORC-specific constants, metrics keys, and a memory-pool bridge.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/orc/orc_writer_builder.h | ORC writer builder wiring schema/options into an ORC format writer. |
| src/paimon/format/orc/orc_reader_builder.h | ORC reader builder creating ORC batch readers from InputStream. |
| src/paimon/format/orc/orc_output_stream_impl.h | ORC OutputStream wrapper interface over Paimon OutputStream. |
| src/paimon/format/orc/orc_output_stream_impl.cpp | ORC output wrapper implementation (length/write/close). |
| src/paimon/format/orc/orc_input_stream_impl.h | ORC InputStream wrapper interface over Paimon InputStream (+ metrics hook). |
| src/paimon/format/orc/orc_input_stream_impl.cpp | ORC input wrapper implementation (sync + async read paths). |
| src/paimon/format/orc/orc_input_output_stream_test.cpp | Stream-level unit tests for ORC input/output wrappers. |
| src/paimon/format/orc/orc_metrics.h | ORC metrics key constants. |
| src/paimon/format/orc/orc_memory_pool.h | Bridges ORC MemoryPool to Paimon MemoryPool. |
| src/paimon/format/orc/orc_format_defs.h | ORC option keys + defaults for read/write behavior. |
| src/paimon/format/orc/orc_file_format.h | Implements ORC FileFormat (reader/writer/stats extractor creation). |
| src/paimon/format/orc/orc_file_format_factory.h | Declares ORC FileFormatFactory. |
| src/paimon/format/orc/orc_file_format_factory.cpp | Implements and registers the ORC format factory. |
| src/paimon/format/orc/orc_adapter.h | Declares Arrow↔ORC adapter surface. |
| src/paimon/format/orc/orc_adapter.cpp | Implements Arrow↔ORC conversions (types + batch read/write). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string data = "hello"; | ||
| out_stream->write(data.data(), data.length()); | ||
| // noted that OrcOutputStreamImpl::close() api do nothing | ||
| ASSERT_OK(out_stream->output_stream_->Close()); |
Comment on lines
+21
to
+25
| #include <memory> | ||
|
|
||
| #include "orc/MemoryPool.hh" | ||
| #include "paimon/common/utils/concurrent_hash_map.h" | ||
| #include "paimon/memory/memory_pool.h" |
|
|
||
| #pragma once | ||
|
|
||
| #include <map> |
Comment on lines
+52
to
+54
| void OrcOutputStreamImpl::write(const void* buf, size_t length) { | ||
| Result<int32_t> write_len = output_stream_->Write(static_cast<const char*>(buf), length); | ||
| if (!write_len.ok()) { |
Comment on lines
+61
to
+66
| void OrcInputStreamImpl::read(void* buf, uint64_t length, uint64_t offset) { | ||
| if (metrics_) { | ||
| metrics_->IOCount.fetch_add(1); | ||
| } | ||
|
|
||
| Result<int32_t> read_bytes = input_stream_->Read(static_cast<char*>(buf), length, offset); |
| arrow::internal::checked_cast<const arrow::StructType&>(type).fields(); | ||
| for (auto& arrow_field : arrow_fields) { | ||
| std::string field_name = arrow_field->name(); | ||
| ARROW_ASSIGN_OR_RAISE(auto orc_subtype, GetOrcType(*arrow_field->type())); |
| arrow::internal::checked_cast<const arrow::MapType&>(type).key_field(); | ||
| const auto& item_field = | ||
| arrow::internal::checked_cast<const arrow::MapType&>(type).item_field(); | ||
| ARROW_ASSIGN_OR_RAISE(auto key_orc_type, GetOrcType(*key_field->type())); |
| const auto& item_field = | ||
| arrow::internal::checked_cast<const arrow::MapType&>(type).item_field(); | ||
| ARROW_ASSIGN_OR_RAISE(auto key_orc_type, GetOrcType(*key_field->type())); | ||
| ARROW_ASSIGN_OR_RAISE(auto item_orc_type, GetOrcType(*item_field->type())); |
Comment on lines
+1547
to
+1549
| PAIMON_ASSIGN_OR_RAISE_FROM_ARROW(std::unique_ptr<::orc::Type> orc_subtype, | ||
| paimon::orc::GetOrcType(*field->type())); | ||
| SetAttributes(field, orc_subtype.get()); |
Comment on lines
+914
to
+918
| Result<std::shared_ptr<arrow::Array>> OrcAdapter::AppendBatch( | ||
| const std::shared_ptr<arrow::DataType>& type, ::orc::ColumnVectorBatch* batch, | ||
| arrow::MemoryPool* pool) { | ||
| PAIMON_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ArrayBuilder> builder, | ||
| MakeArrowBuilder(type, batch, pool)); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: No linked issue
This change adds ORC format support including adapter, stream implementations, builder utilities and format factory.
Included changes:
orc_adapter.handorc_adapter.cppproviding core ORC read/write adaptation logicorc_input_stream_impl.h/.cppfor ORC input stream handlingorc_output_stream_impl.h/.cppfor ORC output stream handlingorc_reader_builder.hfor constructing ORC readersorc_writer_builder.hfor constructing ORC writersorc_file_format.hdefining ORC file format interfaceorc_file_format_factory.h/.cppfor ORC format instantiationorc_format_defs.hwith ORC format constants and definitionsorc_memory_pool.hfor ORC memory managementorc_metrics.hfor ORC operation metricsorc_input_output_stream_test.cppcovering input/output stream implementationsTests
Not run. Local compile, CMake, and gtest environment checks are not part of this PR description.
Test coverage included in this change:
orc_input_output_stream_test.cppAPI and Format
No public API, storage format, or protocol changes.
Documentation
No documentation changes required.
Generative AI tooling
Migrate-by: Aone Copilot (Qwen3.7-Max)