feat: add table scan, split generation and read infrastructure#131
Open
lszskye wants to merge 1 commit into
Open
feat: add table scan, split generation and read infrastructure#131lszskye wants to merge 1 commit into
lszskye wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
AppendOnlySplitGeneratorSplitGeneratorimplementation for append-only tables. Packs data files into splits using bin-packing ordered by bucket mode, with configurable target split size and open file cost.AppendOnlyTableReadTableReadimplementation for append-only tables. CreatesBatchReaderinstances from data splits using a chain ofSplitReadoperations configured viaInternalReadContext.DataEvolutionBatchScanAbstractTableScanimplementation that wraps aDataTableBatchScanwith global index evaluation. Evaluates global indexes to produce row ranges and scores, then wraps resulting splits as indexed splits for efficient filtered reading.DataEvolutionSplitGeneratorSplitGeneratorimplementation for data evolution tables. Groups files by sequence number and packs them into splits respecting target split size, ensuring files from the same evolution epoch stay together.DataTableBatchScanAbstractTableScanimplementation for batch planning. Uses aStartingScannerto determine the initial snapshot, generates splits viaSnapshotReader, and supports push-down limit optimization to reduce the number of splits returned.DataTableStreamScanAbstractTableScanimplementation for streaming planning. Performs an initial scan viaStartingScanner, then incrementally follows new snapshots usingFollowUpScannerto produce delta plans for continuous processing.FallbackTableReadTableReaddecorator that tries a main table read first and falls back to an alternative table read if the main one fails. Used for backward compatibility when reading older data formats.KeyValueTableReadTableReadimplementation for primary-key tables with merge semantics. Creates readers that apply merge engine logic (deduplication, partial update, aggregation) across sorted runs. SupportsForceKeepDeleteto retain delete records in output.MergeTreeSplitGeneratorSplitGeneratorimplementation for merge-tree (primary-key) tables. Organizes files into sorted runs by level and sequence number, then packs non-overlapping key ranges into splits. Respects deletion vector settings and merge engine configuration.SnapshotReaderFileStoreScanfor file discovery,SplitGeneratorfor split packing, andIndexFileHandlerfor deletion vector resolution. Supports scan mode, level filtering, value filtering, row range indexing, and real-bucket-only modes.StartingScannerTableScan. Defines three result types:NoSnapshot(wait for data),CurrentSnapshot(initial plan with snapshot id), andNextSnapshot(skip current, start from next). Concrete implementations include full, static-from-snapshot, static-from-tag, continuous-from-snapshot, and continuous-latest scanners.FollowUpScannerSnapshot Scanner Implementations
FullStartingScanner: Scans all data from the latest snapshot for initial full read.StaticFromSnapshotStartingScanner: Starts from a specific snapshot id for point-in-time reads.StaticFromTagStartingScanner: Starts from a named tag for labeled snapshot reads.ContinuousFromSnapshotStartingScanner: Starts from a snapshot and continues streaming deltas.ContinuousFromSnapshotFullStartingScanner: Full scan from a snapshot then continues with deltas.ContinuousLatestStartingScanner: Starts from the latest snapshot and streams forward.DeltaFollowUpScanner: Produces delta plans between consecutive snapshots for streaming consumption.