Add NCCL GIN / symmetric memory tutorial#3932
Draft
d4l3k wants to merge 3 commits into
Draft
Conversation
Adds an unstable tutorial covering GPU-initiated networking with NCCL and PyTorch symmetric memory: enabling the NCCL backend, device-initiated one-shot all-reduce, one-sided put/signal operations, writing custom communication kernels in Python with CuTe DSL, multi-node GIN requirements, and pointers to the NCCL device API for custom C++ kernels. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3932
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1d236c3 with merge base cb473bc ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replaces the C++-only GIN section with Python examples: nccl4py exposes the NCCL device API (including GIN put/wait_signal) to CuTe DSL kernels, so GPU-initiated RDMA is now reachable from Python. Adapted from the nccl4py cute example, using torch.distributed for bootstrap and nccl.torch.empty for NCCL-allocated tensors. Also adds a section on combining symmetric memory with nccl4py: wrapping the process group's communicator via _comm_ptr(), registering symm_mem tensors as NCCL windows, and the reverse register_external_nccl_comm bridge. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
010903c to
1d236c3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new unstable tutorial on GPU-Initiated Networking (GIN) with NCCL and PyTorch distributed, via the NCCL backend of
torch.distributed._symmetric_memory.Contents
symm_mem.set_backend("NCCL")/TORCH_SYMMMEM=NCCL), including eager process group init and the warm-up collective requirementone_shot_all_reduceon symmetric tensorstorch.ops.symm_mem.nccl_put_with_signal/nccl_wait_for_signal, plus notes onnccl_put/nccl_getand the handle-levelput_signal/wait_signal(NCCL 2.29+)hdl.get_buffer), adapted from the NVIDIA CUTLASS distributed examplesncclGinC++ device API for extension authorsSince GIN itself is a device-side API not directly exposed in Python, the tutorial is framed around symmetric memory as the user-facing API, with GIN presented as the transport that services window operations across nodes.
All Python API calls in the examples were verified against pytorch/pytorch main (op registrations in
nccl_extension.cu, pybind signatures, and patterns fromtest/distributed/test_nccl.py); the CuTe DSL example is adapted from the official NVIDIA CUTLASSexamples/python/CuTeDSLdistributed examples.The tutorial is a static
.rstinunstable_source/(multi-GPU torchrun code cannot execute in the docs build), registered with a card and toctree entry inunstable_index.rst. Validated withmake html-noplot(clean build, no new warnings) andlintrunner.🤖 Generated with Claude Code