Summa: fix broadcast-root world rank to include the grid offset#571
Merged
Conversation
get_row_group_root/get_col_group_root reconstructed the broadcast root's world rank by hand as rank_row*proc_cols + group_root (row) / symmetric (col), omitting ProcGrid::rank_offset_. make_group inserts the root via ProcGrid::map_col/map_row, which DO add rank_offset_, so on an offset grid (proc_h_ > 1, the h-grouped 3-d batched SUMMA) the hand-computed world rank is not the value stored in the group: madness::Group::rank() returns -1, tripping MADNESS_ASSERT(group_root >= 0) in WorldGopInterface::bcast and aborting. The branch is only reached when the broadcast group is pruned below proc_cols/ proc_rows (a block-sparse operand), so the bug needs both sparsity AND a nonzero offset (batched 3-d grid) -- e.g. MPQC PNO-CCSD block-sparse (g.C) ToT*dense SUMMA on >= 4 ranks. Ordinary 2-d grids have rank_offset_ == 0, so it stayed latent. Compute world_root via the same map_col/map_row primitive make_group uses, so the root is guaranteed to be the in-group owner; TA_ASSERT(group_root >= 0) to fail at the source. Adds tests/proc_grid.cpp regression summa_bcast_root_offset.
evaleev
added a commit
to ValeevGroup/SeQuant
that referenced
this pull request
Jun 30, 2026
…fix) Pulls in the fix for get_row_group_root/get_col_group_root omitting ProcGrid::rank_offset_ when reconstructing the broadcast root's world rank, which aborted block-sparse batched (proc_h_>1) SUMMA contractions on >=4 ranks. TiledArray PR ValeevGroup/tiledarray#571.
evaleev
added a commit
to ValeevGroup/SeQuant
that referenced
this pull request
Jun 30, 2026
…x merged) TiledArray PR ValeevGroup/tiledarray#571 (broadcast-root rank_offset_ fix) is merged; track master tip instead of the feature-branch commit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Summa::get_row_group_root/get_col_group_rootreconstruct the broadcast root's world rank by hand asrank_row*proc_cols + group_root(row; symmetric for col), omittingProcGrid::rank_offset_. The sparse group factorymake_groupinserts that root viaProcGrid::map_col/map_row, which do addrank_offset_. On an offset grid (proc_h_ > 1— the h-grouped 3-D batched SUMMA),rank_offset_ > 0, so the hand-computed world rank is not the value stored in the group:madness::Group::rank()returns -1, trippingMADNESS_ASSERT(group_root >= 0)inWorldGopInterface::bcast→ SIGABRT.The broken branch is only reached when the broadcast group is pruned below
proc_cols/proc_rows(a block-sparse operand), so the bug needs both a pruned sparse group and a nonzero offset (batched 3-D grid). Plain 2-D dense grids haverank_offset_ == 0, which is why it stayed latent (and why np=1,2 CI never caught it). Encountered in MPQC PNO-CCSD (block-sparse(g·C)ToT×dense SUMMA) on 8 ranks.Fix
Compute
world_rootvia the samemap_col/map_rowprimitivemake_groupuses, so the root is guaranteed to be the in-group owner;TA_ASSERT(group_root >= 0)to fail at the source. Applied to both row and col helpers.Verification
MADNESS ASSERTION FAILEDat np=8 ingeneral_product_distributed_suite/dist_sparse(passes at np≤2 whereproc_h_=1).dist_sparseand the fullgeneral_product_distributed_suitepass at np=4 and np=8; no regressions indist_eval_contraction_eval/expressions/expressions_sparseat np=2,8.tests/proc_grid.cpp::summa_bcast_root_offset(np=1/2/8) — validates that on an offsetProcGrid,map_col/map_rowincluderank_offset_so the root the fix uses is the in-group world rank.