feat: add table scan, split generation and read infrastructure by lszskye · Pull Request #131 · apache/paimon-cpp

lszskye · 2026-06-25T09:07:37Z

Purpose

`AppendOnlySplitGenerator`

SplitGenerator implementation for append-only tables. Packs data files into splits using bin-packing ordered by bucket mode, with configurable target split size and open file cost.

`AppendOnlyTableRead`

TableRead implementation for append-only tables. Creates BatchReader instances from data splits using a chain of SplitRead operations configured via InternalReadContext.

`DataEvolutionBatchScan`

AbstractTableScan implementation that wraps a DataTableBatchScan with global index evaluation. Evaluates global indexes to produce row ranges and scores, then wraps resulting splits as indexed splits for efficient filtered reading.

`DataEvolutionSplitGenerator`

SplitGenerator implementation for data evolution tables. Groups files by sequence number and packs them into splits respecting target split size, ensuring files from the same evolution epoch stay together.

`DataTableBatchScan`

AbstractTableScan implementation for batch planning. Uses a StartingScanner to determine the initial snapshot, generates splits via SnapshotReader, and supports push-down limit optimization to reduce the number of splits returned.

`DataTableStreamScan`

AbstractTableScan implementation for streaming planning. Performs an initial scan via StartingScanner, then incrementally follows new snapshots using FollowUpScanner to produce delta plans for continuous processing.

`FallbackTableRead`

TableRead decorator that tries a main table read first and falls back to an alternative table read if the main one fails. Used for backward compatibility when reading older data formats.

`KeyValueTableRead`

TableRead implementation for primary-key tables with merge semantics. Creates readers that apply merge engine logic (deduplication, partial update, aggregation) across sorted runs. Supports ForceKeepDelete to retain delete records in output.

`MergeTreeSplitGenerator`

SplitGenerator implementation for merge-tree (primary-key) tables. Organizes files into sorted runs by level and sequence number, then packs non-overlapping key ranges into splits. Respects deletion vector settings and merge engine configuration.

`SnapshotReader`

Reads data files from a snapshot and generates splits. Orchestrates FileStoreScan for file discovery, SplitGenerator for split packing, and IndexFileHandler for deletion vector resolution. Supports scan mode, level filtering, value filtering, row range indexing, and real-bucket-only modes.

`StartingScanner`

Abstract helper for the first planning of a TableScan. Defines three result types: NoSnapshot (wait for data), CurrentSnapshot (initial plan with snapshot id), and NextSnapshot (skip current, start from next). Concrete implementations include full, static-from-snapshot, static-from-tag, continuous-from-snapshot, and continuous-latest scanners.

`FollowUpScanner`

Abstract helper for incremental streaming planning. Determines whether a snapshot needs scanning and produces delta plans for new data. Concrete implementations include delta follow-up scanner.

Snapshot Scanner Implementations

FullStartingScanner: Scans all data from the latest snapshot for initial full read.
StaticFromSnapshotStartingScanner: Starts from a specific snapshot id for point-in-time reads.
StaticFromTagStartingScanner: Starts from a named tag for labeled snapshot reads.
ContinuousFromSnapshotStartingScanner: Starts from a snapshot and continues streaming deltas.
ContinuousFromSnapshotFullStartingScanner: Full scan from a snapshot then continues with deltas.
ContinuousLatestStartingScanner: Starts from the latest snapshot and streams forward.
DeltaFollowUpScanner: Produces delta plans between consecutive snapshots for streaming consumption.

feat: add table scan, split generation and read infrastructure

e89a43a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add table scan, split generation and read infrastructure#131

feat: add table scan, split generation and read infrastructure#131
lszskye wants to merge 1 commit into
apache:mainfrom
lszskye:p12-7

lszskye commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lszskye commented Jun 25, 2026

Purpose

AppendOnlySplitGenerator

AppendOnlyTableRead

DataEvolutionBatchScan

DataEvolutionSplitGenerator

DataTableBatchScan

DataTableStreamScan

FallbackTableRead

KeyValueTableRead

MergeTreeSplitGenerator

SnapshotReader

StartingScanner

FollowUpScanner

Snapshot Scanner Implementations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`AppendOnlySplitGenerator`

`AppendOnlyTableRead`

`DataEvolutionBatchScan`

`DataEvolutionSplitGenerator`

`DataTableBatchScan`

`DataTableStreamScan`

`FallbackTableRead`

`KeyValueTableRead`

`MergeTreeSplitGenerator`

`SnapshotReader`

`StartingScanner`

`FollowUpScanner`