Antalya 26.5: Parallelize reads from a single Parquet file in StorageFile#1970
Open
zvonand wants to merge 2 commits into
Open
Antalya 26.5: Parallelize reads from a single Parquet file in StorageFile#1970zvonand wants to merge 2 commits into
zvonand wants to merge 2 commits into
Conversation
…next commit) --- Original cherry-pick message follows: Merge pull request #1806 from Altinity/feature/antalya-26.3/ClickHouse-ClickHouse-pr-104251 Antalya 26.3: Parallelize reads from a single Parquet file in StorageFile # Conflicts: # src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp # src/Processors/Formats/Impl/ParquetBlockInputFormat.h
Both ParquetBlockInputFormat.cpp and ParquetBlockInputFormat.h were new to antalya-26.5 (they existed on antalya-26.3 but had not been ported yet). Git reported a conflict because the cherry-pick diff tried to modify an existing file at line 1418, but the file was absent on the target branch. The resolution takes the full file content from the merge commit (antalya-26.3 base + PR#1806 additions), which is the correct outcome: - splitToBucketsByCount is added to both files (bucket-1: in the source PR's first-parent diff). - filterByMatchingRowGroups is preserved (the first-parent diff does NOT remove it; only the feature-branch-vs-base diff does; keeping it is required because IBucketSplitter::filterByMatchingRowGroups is pure virtual on antalya-26.5).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Reading a single large local Parquet file via
file()/Fileengine is now parallelised across multiple sources, each handling a subset of row groups. This eliminates aResize 1 → Nbottleneck in the pipeline and brings single-file ClickBench performance close to the partitioned variant — Q23 goes from ~1.4s to ~0.55s, Q22 from ~0.9s to ~0.48s, Q27 from ~1.6s to ~0.54s on 96 vCPUs (ClickHouse#104251 by @alexey-milovidov).Cherry-picked from ClickHouse#104251.
On ClickBench, single-file Parquet runs are 3–9× slower than the 100-file partitioned runs on the same data (e.g. on
c7a.metal-48xl, Q23 is8.90svs0.99s, Q221.82svs0.41s, Q271.21svs0.45s). The cause is inStorageFile: when reading a single splittable file it creates exactly oneParquetV3BlockInputFormatsource, so the pipeline becomesFile 0 → 1followed byResize 1 → 96. That fan-out is a serialization point — every chunk has to leave the single source through onereadbefore any of the 96 aggregators can touch it, so most cores sit idle.The bucket-splitting machinery (
ParquetBucketSplitter,setBucketsToRead,FileBucketInfo) already existed for cluster mode but was never wired intoStorageFile. This PR wires it in:IBucketSplitter::splitToBucketsByCountreturning roughly N contiguous row-group ranges; Parquet implements it.FormatFactory::checkFormatHasSplitterso callers can probe without throwing.StorageFile::ReadFromFile::initializePipeline, when reading exactly one local splittable file, asks the splitter formax_num_streamsbuckets and creates oneStorageFileSourceper bucket. Each source carriesfixed_file_path+file_bucket_infoand skips the sharedFilesIterator.ParquetV3BlockInputFormat::readhonoursbuckets_to_readin the trivial-count path so each bucket only reports its own row count.Pipeline becomes
File × N 0 → 1straight into the aggregators, matching the partitioned variant (#1806 by @zvonand).Cherry-picked from #1806.
Results
96-vCPU box,
hits.parquet(14 GiB, 226 row groups):CPU utilisation on Q23 jumped from ~6× to ~18× of 96 cores. Aggregate results (
count,sum(UserID),sum(length(URL)), Q21, Q23) match the partitioned variant exactly. The remaining ~1.3× gap to partitioned is per-source initialization overhead: each bucket source still reads the 14 GB file's footer separately. Sharing parsed metadata for local files is the obvious next step but a much bigger change.Documentation entry for user-facing changes