Skip to content

feat(format): add ORC format adapter, stream implementations and builder utilities#121

Open
lucasfang wants to merge 1 commit into
apache:mainfrom
lucasfang:migrate_7
Open

feat(format): add ORC format adapter, stream implementations and builder utilities#121
lucasfang wants to merge 1 commit into
apache:mainfrom
lucasfang:migrate_7

Conversation

@lucasfang

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: No linked issue

This change adds ORC format support including adapter, stream implementations, builder utilities and format factory.

Included changes:

  • ORC Adapter:
    • Adds orc_adapter.h and orc_adapter.cpp providing core ORC read/write adaptation logic
  • Stream Implementations:
    • Adds orc_input_stream_impl.h/.cpp for ORC input stream handling
    • Adds orc_output_stream_impl.h/.cpp for ORC output stream handling
  • Builder Utilities:
    • Adds orc_reader_builder.h for constructing ORC readers
    • Adds orc_writer_builder.h for constructing ORC writers
  • Format Infrastructure:
    • Adds orc_file_format.h defining ORC file format interface
    • Adds orc_file_format_factory.h/.cpp for ORC format instantiation
    • Adds orc_format_defs.h with ORC format constants and definitions
    • Adds orc_memory_pool.h for ORC memory management
    • Adds orc_metrics.h for ORC operation metrics
  • Test Coverage:
    • Adds orc_input_output_stream_test.cpp covering input/output stream implementations

Tests

Not run. Local compile, CMake, and gtest environment checks are not part of this PR description.

Test coverage included in this change:

  • orc_input_output_stream_test.cpp

API and Format

No public API, storage format, or protocol changes.

Documentation

No documentation changes required.

Generative AI tooling

Migrate-by: Aone Copilot (Qwen3.7-Max)

Copilot AI review requested due to automatic review settings June 25, 2026 06:25

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds ORC format support to the Paimon C++ format layer by introducing an Arrow↔ORC adapter, ORC-specific Input/OutputStream implementations, reader/writer builder utilities, and file format + factory wiring.

Changes:

  • Introduces OrcAdapter to convert between ORC column batches/types and Arrow arrays/schemas.
  • Adds ORC stream wrappers (OrcInputStreamImpl, OrcOutputStreamImpl) and corresponding reader/writer builders.
  • Registers an ORC FileFormat + FileFormatFactory, and adds ORC-specific constants, metrics keys, and a memory-pool bridge.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/paimon/format/orc/orc_writer_builder.h ORC writer builder wiring schema/options into an ORC format writer.
src/paimon/format/orc/orc_reader_builder.h ORC reader builder creating ORC batch readers from InputStream.
src/paimon/format/orc/orc_output_stream_impl.h ORC OutputStream wrapper interface over Paimon OutputStream.
src/paimon/format/orc/orc_output_stream_impl.cpp ORC output wrapper implementation (length/write/close).
src/paimon/format/orc/orc_input_stream_impl.h ORC InputStream wrapper interface over Paimon InputStream (+ metrics hook).
src/paimon/format/orc/orc_input_stream_impl.cpp ORC input wrapper implementation (sync + async read paths).
src/paimon/format/orc/orc_input_output_stream_test.cpp Stream-level unit tests for ORC input/output wrappers.
src/paimon/format/orc/orc_metrics.h ORC metrics key constants.
src/paimon/format/orc/orc_memory_pool.h Bridges ORC MemoryPool to Paimon MemoryPool.
src/paimon/format/orc/orc_format_defs.h ORC option keys + defaults for read/write behavior.
src/paimon/format/orc/orc_file_format.h Implements ORC FileFormat (reader/writer/stats extractor creation).
src/paimon/format/orc/orc_file_format_factory.h Declares ORC FileFormatFactory.
src/paimon/format/orc/orc_file_format_factory.cpp Implements and registers the ORC format factory.
src/paimon/format/orc/orc_adapter.h Declares Arrow↔ORC adapter surface.
src/paimon/format/orc/orc_adapter.cpp Implements Arrow↔ORC conversions (types + batch read/write).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

std::string data = "hello";
out_stream->write(data.data(), data.length());
// noted that OrcOutputStreamImpl::close() api do nothing
ASSERT_OK(out_stream->output_stream_->Close());
Comment on lines +21 to +25
#include <memory>

#include "orc/MemoryPool.hh"
#include "paimon/common/utils/concurrent_hash_map.h"
#include "paimon/memory/memory_pool.h"

#pragma once

#include <map>
Comment on lines +52 to +54
void OrcOutputStreamImpl::write(const void* buf, size_t length) {
Result<int32_t> write_len = output_stream_->Write(static_cast<const char*>(buf), length);
if (!write_len.ok()) {
Comment on lines +61 to +66
void OrcInputStreamImpl::read(void* buf, uint64_t length, uint64_t offset) {
if (metrics_) {
metrics_->IOCount.fetch_add(1);
}

Result<int32_t> read_bytes = input_stream_->Read(static_cast<char*>(buf), length, offset);
arrow::internal::checked_cast<const arrow::StructType&>(type).fields();
for (auto& arrow_field : arrow_fields) {
std::string field_name = arrow_field->name();
ARROW_ASSIGN_OR_RAISE(auto orc_subtype, GetOrcType(*arrow_field->type()));
arrow::internal::checked_cast<const arrow::MapType&>(type).key_field();
const auto& item_field =
arrow::internal::checked_cast<const arrow::MapType&>(type).item_field();
ARROW_ASSIGN_OR_RAISE(auto key_orc_type, GetOrcType(*key_field->type()));
const auto& item_field =
arrow::internal::checked_cast<const arrow::MapType&>(type).item_field();
ARROW_ASSIGN_OR_RAISE(auto key_orc_type, GetOrcType(*key_field->type()));
ARROW_ASSIGN_OR_RAISE(auto item_orc_type, GetOrcType(*item_field->type()));
Comment on lines +1547 to +1549
PAIMON_ASSIGN_OR_RAISE_FROM_ARROW(std::unique_ptr<::orc::Type> orc_subtype,
paimon::orc::GetOrcType(*field->type()));
SetAttributes(field, orc_subtype.get());
Comment on lines +914 to +918
Result<std::shared_ptr<arrow::Array>> OrcAdapter::AppendBatch(
const std::shared_ptr<arrow::DataType>& type, ::orc::ColumnVectorBatch* batch,
arrow::MemoryPool* pool) {
PAIMON_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ArrayBuilder> builder,
MakeArrowBuilder(type, batch, pool));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants