Skip to content

Summa: fix broadcast-root world rank to include the grid offset#571

Merged
evaleev merged 1 commit into
masterfrom
evaleev/fix/summa-bcast-root-offset
Jun 30, 2026
Merged

Summa: fix broadcast-root world rank to include the grid offset#571
evaleev merged 1 commit into
masterfrom
evaleev/fix/summa-bcast-root-offset

Conversation

@evaleev

@evaleev evaleev commented Jun 30, 2026

Copy link
Copy Markdown
Member

Problem

Summa::get_row_group_root/get_col_group_root reconstruct the broadcast root's world rank by hand as rank_row*proc_cols + group_root (row; symmetric for col), omitting ProcGrid::rank_offset_. The sparse group factory make_group inserts that root via ProcGrid::map_col/map_row, which do add rank_offset_. On an offset grid (proc_h_ > 1 — the h-grouped 3-D batched SUMMA), rank_offset_ > 0, so the hand-computed world rank is not the value stored in the group: madness::Group::rank() returns -1, tripping MADNESS_ASSERT(group_root >= 0) in WorldGopInterface::bcast → SIGABRT.

The broken branch is only reached when the broadcast group is pruned below proc_cols/proc_rows (a block-sparse operand), so the bug needs both a pruned sparse group and a nonzero offset (batched 3-D grid). Plain 2-D dense grids have rank_offset_ == 0, which is why it stayed latent (and why np=1,2 CI never caught it). Encountered in MPQC PNO-CCSD (block-sparse (g·C) ToT×dense SUMMA) on 8 ranks.

Fix

Compute world_root via the same map_col/map_row primitive make_group uses, so the root is guaranteed to be the in-group owner; TA_ASSERT(group_root >= 0) to fail at the source. Applied to both row and col helpers.

Verification

  • Reverting only the fix reproduces MADNESS ASSERTION FAILED at np=8 in general_product_distributed_suite/dist_sparse (passes at np≤2 where proc_h_=1).
  • With the fix, dist_sparse and the full general_product_distributed_suite pass at np=4 and np=8; no regressions in dist_eval_contraction_eval/expressions/expressions_sparse at np=2,8.
  • New regression: tests/proc_grid.cpp::summa_bcast_root_offset (np=1/2/8) — validates that on an offset ProcGrid, map_col/map_row include rank_offset_ so the root the fix uses is the in-group world rank.

get_row_group_root/get_col_group_root reconstructed the broadcast root's world
rank by hand as rank_row*proc_cols + group_root (row) / symmetric (col),
omitting ProcGrid::rank_offset_. make_group inserts the root via
ProcGrid::map_col/map_row, which DO add rank_offset_, so on an offset grid
(proc_h_ > 1, the h-grouped 3-d batched SUMMA) the hand-computed world rank is
not the value stored in the group: madness::Group::rank() returns -1, tripping
MADNESS_ASSERT(group_root >= 0) in WorldGopInterface::bcast and aborting.

The branch is only reached when the broadcast group is pruned below proc_cols/
proc_rows (a block-sparse operand), so the bug needs both sparsity AND a nonzero
offset (batched 3-d grid) -- e.g. MPQC PNO-CCSD block-sparse (g.C) ToT*dense
SUMMA on >= 4 ranks. Ordinary 2-d grids have rank_offset_ == 0, so it stayed
latent.

Compute world_root via the same map_col/map_row primitive make_group uses, so
the root is guaranteed to be the in-group owner; TA_ASSERT(group_root >= 0) to
fail at the source. Adds tests/proc_grid.cpp regression summa_bcast_root_offset.
evaleev added a commit to ValeevGroup/SeQuant that referenced this pull request Jun 30, 2026
…fix)

Pulls in the fix for get_row_group_root/get_col_group_root omitting
ProcGrid::rank_offset_ when reconstructing the broadcast root's world rank, which
aborted block-sparse batched (proc_h_>1) SUMMA contractions on >=4 ranks.
TiledArray PR ValeevGroup/tiledarray#571.
@evaleev evaleev merged commit 35c1d52 into master Jun 30, 2026
16 of 17 checks passed
@evaleev evaleev deleted the evaleev/fix/summa-bcast-root-offset branch June 30, 2026 10:53
evaleev added a commit to ValeevGroup/SeQuant that referenced this pull request Jun 30, 2026
…x merged)

TiledArray PR ValeevGroup/tiledarray#571 (broadcast-root rank_offset_ fix) is
merged; track master tip instead of the feature-branch commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant