Skip to content

feat: add table scan, split generation and read infrastructure#131

Open
lszskye wants to merge 1 commit into
apache:mainfrom
lszskye:p12-7
Open

feat: add table scan, split generation and read infrastructure#131
lszskye wants to merge 1 commit into
apache:mainfrom
lszskye:p12-7

Conversation

@lszskye

@lszskye lszskye commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Purpose

AppendOnlySplitGenerator

  • SplitGenerator implementation for append-only tables. Packs data files into splits using bin-packing ordered by bucket mode, with configurable target split size and open file cost.

AppendOnlyTableRead

  • TableRead implementation for append-only tables. Creates BatchReader instances from data splits using a chain of SplitRead operations configured via InternalReadContext.

DataEvolutionBatchScan

  • AbstractTableScan implementation that wraps a DataTableBatchScan with global index evaluation. Evaluates global indexes to produce row ranges and scores, then wraps resulting splits as indexed splits for efficient filtered reading.

DataEvolutionSplitGenerator

  • SplitGenerator implementation for data evolution tables. Groups files by sequence number and packs them into splits respecting target split size, ensuring files from the same evolution epoch stay together.

DataTableBatchScan

  • AbstractTableScan implementation for batch planning. Uses a StartingScanner to determine the initial snapshot, generates splits via SnapshotReader, and supports push-down limit optimization to reduce the number of splits returned.

DataTableStreamScan

  • AbstractTableScan implementation for streaming planning. Performs an initial scan via StartingScanner, then incrementally follows new snapshots using FollowUpScanner to produce delta plans for continuous processing.

FallbackTableRead

  • TableRead decorator that tries a main table read first and falls back to an alternative table read if the main one fails. Used for backward compatibility when reading older data formats.

KeyValueTableRead

  • TableRead implementation for primary-key tables with merge semantics. Creates readers that apply merge engine logic (deduplication, partial update, aggregation) across sorted runs. Supports ForceKeepDelete to retain delete records in output.

MergeTreeSplitGenerator

  • SplitGenerator implementation for merge-tree (primary-key) tables. Organizes files into sorted runs by level and sequence number, then packs non-overlapping key ranges into splits. Respects deletion vector settings and merge engine configuration.

SnapshotReader

  • Reads data files from a snapshot and generates splits. Orchestrates FileStoreScan for file discovery, SplitGenerator for split packing, and IndexFileHandler for deletion vector resolution. Supports scan mode, level filtering, value filtering, row range indexing, and real-bucket-only modes.

StartingScanner

  • Abstract helper for the first planning of a TableScan. Defines three result types: NoSnapshot (wait for data), CurrentSnapshot (initial plan with snapshot id), and NextSnapshot (skip current, start from next). Concrete implementations include full, static-from-snapshot, static-from-tag, continuous-from-snapshot, and continuous-latest scanners.

FollowUpScanner

  • Abstract helper for incremental streaming planning. Determines whether a snapshot needs scanning and produces delta plans for new data. Concrete implementations include delta follow-up scanner.

Snapshot Scanner Implementations

  • FullStartingScanner: Scans all data from the latest snapshot for initial full read.
  • StaticFromSnapshotStartingScanner: Starts from a specific snapshot id for point-in-time reads.
  • StaticFromTagStartingScanner: Starts from a named tag for labeled snapshot reads.
  • ContinuousFromSnapshotStartingScanner: Starts from a snapshot and continues streaming deltas.
  • ContinuousFromSnapshotFullStartingScanner: Full scan from a snapshot then continues with deltas.
  • ContinuousLatestStartingScanner: Starts from the latest snapshot and streams forward.
  • DeltaFollowUpScanner: Produces delta plans between consecutive snapshots for streaming consumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant