Skip to content

store: split dump batches by byte size to avoid 2 GiB Arrow column overflow (#6609)#6646

Open
cargopete wants to merge 1 commit into
graphprotocol:masterfrom
cargopete:fix/6609-dump-large-subgraph-2gb-overflow
Open

store: split dump batches by byte size to avoid 2 GiB Arrow column overflow (#6609)#6646
cargopete wants to merge 1 commit into
graphprotocol:masterfrom
cargopete:fix/6609-dump-large-subgraph-2gb-overflow

Conversation

@cargopete

Copy link
Copy Markdown
Contributor

Summary

Fixes #6609. graphman dump panics on large tables. Each fetched batch is turned
into a single Arrow RecordBatch whose string/binary columns use the 32-bit-offset
Utf8/Binary types, which overflow once a single column accumulates more than
~2 GiB. The VidBatcher sizes fetches by elapsed time with no byte ceiling, so on a
large table a string/binary column crosses the limit and Arrow panics on offset
overflow.

Fix

Decouple the Arrow batch from the SQL fetch: split each fetched batch into
contiguous byte-bounded sub-slices (≤ 256 MiB each) and write one RecordBatch
per slice. Because a batch's total payload stays well under 2 GiB, no individual
column can reach the i32 offset limit, and Postgres' ~1 GiB per-field cap keeps even
a single oversized row safe. The on-disk Parquet format and the restore/reader path
are unchanged — a dump simply produces more row groups.

  • parquet/convert.rs: estimate_row_bytes — conservative per-row payload estimate.
  • relational/dump.rs: MAX_RECORD_BATCH_BYTES (256 MiB) + byte_bounded_slices;
    the entity-table dump loop writes one RecordBatch per byte-bounded slice.

The clamp loops (vid + block_range_end, fixed-width) and the tiny data_sources$
table are not subject to the 2 GiB limit and are left unchanged.

Testing

  • New unit test byte_bounded_slices_split_and_preserve_rows (packing, no-data-loss,
    per-slice cap, oversized-single-row).
  • Existing parquet dump/restore module tests still pass (29).
  • cargo fmt, cargo clippy --all-targets, and cargo check --release clean for
    graph-store-postgres.

⚠️ Not yet validated end-to-end against a real >2 GiB-per-column table — that requires
a large populated database. @mindstyle85 this is the case you hit; a test run against
your large subgraph would be much appreciated.

@mindstyle85

Copy link
Copy Markdown

hey, sure will test it out, we are still in the process of moving some subgraphs around, but we are using what Vincent made right now and it works fine

once we are done with migrations we can try on the old db :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Graphman dump/restore cant handle larger subgraphs

2 participants