store: split dump batches by byte size to avoid 2 GiB Arrow column overflow (#6609)#6646
Open
cargopete wants to merge 1 commit into
Open
Conversation
|
hey, sure will test it out, we are still in the process of moving some subgraphs around, but we are using what Vincent made right now and it works fine once we are done with migrations we can try on the old db :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #6609.
graphman dumppanics on large tables. Each fetched batch is turnedinto a single Arrow
RecordBatchwhose string/binary columns use the 32-bit-offsetUtf8/Binarytypes, which overflow once a single column accumulates more than~2 GiB. The
VidBatchersizes fetches by elapsed time with no byte ceiling, so on alarge table a string/binary column crosses the limit and Arrow panics on offset
overflow.
Fix
Decouple the Arrow batch from the SQL fetch: split each fetched batch into
contiguous byte-bounded sub-slices (≤ 256 MiB each) and write one
RecordBatchper slice. Because a batch's total payload stays well under 2 GiB, no individual
column can reach the i32 offset limit, and Postgres' ~1 GiB per-field cap keeps even
a single oversized row safe. The on-disk Parquet format and the restore/reader path
are unchanged — a dump simply produces more row groups.
parquet/convert.rs:estimate_row_bytes— conservative per-row payload estimate.relational/dump.rs:MAX_RECORD_BATCH_BYTES(256 MiB) +byte_bounded_slices;the entity-table dump loop writes one
RecordBatchper byte-bounded slice.The clamp loops (
vid+block_range_end, fixed-width) and the tinydata_sources$table are not subject to the 2 GiB limit and are left unchanged.
Testing
byte_bounded_slices_split_and_preserve_rows(packing, no-data-loss,per-slice cap, oversized-single-row).
cargo fmt,cargo clippy --all-targets, andcargo check --releaseclean forgraph-store-postgres.a large populated database. @mindstyle85 this is the case you hit; a test run against
your large subgraph would be much appreciated.