store: split dump batches by byte size to avoid 2 GiB Arrow column overflow (#6609) by cargopete · Pull Request #6646 · graphprotocol/graph-node

cargopete · 2026-06-25T11:23:39Z

Summary

Fixes #6609. graphman dump panics on large tables. Each fetched batch is turned
into a single Arrow RecordBatch whose string/binary columns use the 32-bit-offset
Utf8/Binary types, which overflow once a single column accumulates more than
~2 GiB. The VidBatcher sizes fetches by elapsed time with no byte ceiling, so on a
large table a string/binary column crosses the limit and Arrow panics on offset
overflow.

Fix

Decouple the Arrow batch from the SQL fetch: split each fetched batch into
contiguous byte-bounded sub-slices (≤ 256 MiB each) and write one RecordBatch
per slice. Because a batch's total payload stays well under 2 GiB, no individual
column can reach the i32 offset limit, and Postgres' ~1 GiB per-field cap keeps even
a single oversized row safe. The on-disk Parquet format and the restore/reader path
are unchanged — a dump simply produces more row groups.

parquet/convert.rs: estimate_row_bytes — conservative per-row payload estimate.
relational/dump.rs: MAX_RECORD_BATCH_BYTES (256 MiB) + byte_bounded_slices;
the entity-table dump loop writes one RecordBatch per byte-bounded slice.

The clamp loops (vid + block_range_end, fixed-width) and the tiny data_sources$
table are not subject to the 2 GiB limit and are left unchanged.

Testing

New unit test byte_bounded_slices_split_and_preserve_rows (packing, no-data-loss,
per-slice cap, oversized-single-row).
Existing parquet dump/restore module tests still pass (29).
cargo fmt, cargo clippy --all-targets, and cargo check --release clean for
graph-store-postgres.

⚠️ Not yet validated end-to-end against a real >2 GiB-per-column table — that requires
a large populated database. @mindstyle85 this is the case you hit; a test run against
your large subgraph would be much appreciated.

…flow

mindstyle85 · 2026-06-25T11:32:11Z

hey, sure will test it out, we are still in the process of moving some subgraphs around, but we are using what Vincent made right now and it works fine

once we are done with migrations we can try on the old db :)

store: split dump batches by byte size to avoid 2GB Arrow column over…

e22a220

…flow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

store: split dump batches by byte size to avoid 2 GiB Arrow column overflow (#6609)#6646

store: split dump batches by byte size to avoid 2 GiB Arrow column overflow (#6609)#6646
cargopete wants to merge 1 commit into
graphprotocol:masterfrom
cargopete:fix/6609-dump-large-subgraph-2gb-overflow

cargopete commented Jun 25, 2026

Uh oh!

mindstyle85 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cargopete commented Jun 25, 2026

Summary

Fix

Testing

Uh oh!

mindstyle85 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants