From e600eed5a769388f9f7d37e484d749a806d26363 Mon Sep 17 00:00:00 2001 From: Tristan Rice Date: Wed, 1 Jul 2026 23:25:11 +0000 Subject: [PATCH 1/3] Add NCCL GIN / symmetric memory tutorial Adds an unstable tutorial covering GPU-initiated networking with NCCL and PyTorch symmetric memory: enabling the NCCL backend, device-initiated one-shot all-reduce, one-sided put/signal operations, writing custom communication kernels in Python with CuTe DSL, multi-node GIN requirements, and pointers to the NCCL device API for custom C++ kernels. Co-Authored-By: Claude Fable 5 --- unstable_index.rst | 8 + unstable_source/nccl_gin_tutorial.rst | 519 ++++++++++++++++++++++++++ 2 files changed, 527 insertions(+) create mode 100644 unstable_source/nccl_gin_tutorial.rst diff --git a/unstable_index.rst b/unstable_index.rst index 6e3cfd4364c..38b1666d8ae 100644 --- a/unstable_index.rst +++ b/unstable_index.rst @@ -122,6 +122,13 @@ decide if we want to upgrade the level of commitment or to fail fast. :link: unstable/context_parallel.html :tags: Distributed, Context Parallel +.. customcarditem:: + :header: GPU-Initiated Networking with NCCL and PyTorch Symmetric Memory + :card_description: Program GPU-initiated communication with the NCCL backend of PyTorch symmetric memory + :image: _static/img/thumbnails/cropped/generic-pytorch-logo.png + :link: unstable/nccl_gin_tutorial.html + :tags: Distributed, Symmetric Memory, NCCL + .. Integration .. customcarditem:: :header: Out-of-tree extension autoloading in Python @@ -150,6 +157,7 @@ decide if we want to upgrade the level of commitment or to fail fast. unstable/context_parallel unstable/flight_recorder_tutorial + unstable/nccl_gin_tutorial unstable/inductor_cpp_wrapper_tutorial unstable/inductor_windows unstable/vmap_recipe diff --git a/unstable_source/nccl_gin_tutorial.rst b/unstable_source/nccl_gin_tutorial.rst new file mode 100644 index 00000000000..1c3145da9da --- /dev/null +++ b/unstable_source/nccl_gin_tutorial.rst @@ -0,0 +1,519 @@ +GPU-Initiated Networking with NCCL and PyTorch Symmetric Memory +=============================================================== + +.. note:: + |edit| View and edit this tutorial in `GitHub `__. + +.. grid:: 2 + + .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn + :class-card: card-prerequisites + + * What GPU-Initiated Networking (GIN) is and how it relates to the + NCCL device API + * How to allocate and exchange symmetric memory tensors with the + NCCL backend of ``torch.distributed._symmetric_memory`` + * How to use one-sided put and signal operations between ranks + * How to run device-initiated collectives such as + ``one_shot_all_reduce`` + * How to write a custom communication kernel in Python with the + CuTe DSL + + .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites + :class-card: card-prerequisites + + * PyTorch 2.11 or later (nightly recommended) + * NCCL 2.28 or later (2.28.7 or later for GIN over the network) + * A host with two or more CUDA GPUs + * For multi-node GIN: RDMA-capable NICs (ConnectX-4 or newer) with + GPUDirect RDMA + * For the custom kernel section: ``nvidia-cutlass-dsl`` 4.5 or + later + * Familiarity with `PyTorch Distributed `__ + +Introduction +------------ + +In the traditional PyTorch distributed model, communication is +*host-initiated*: the CPU enqueues each collective (for example, +``dist.all_reduce``) onto a CUDA stream, and NCCL launches a kernel that +performs the communication. This model works well for large, structured +collectives, but it adds CPU launch latency to every operation and makes +it hard to fuse communication with computation inside a single kernel. + +NCCL 2.28 introduced a *device API* that turns this model around: +communication can be initiated directly from GPU code, without a round +trip through the CPU. The device API has three building blocks: + +* **LSA (Load/Store Accessible)**: peers reachable over NVLink or PCIe + P2P are accessed with direct loads and stores. +* **Multimem**: uses NVLink SHARP multicast on supported hardware. +* **GIN (GPU-Initiated Networking)**: a CUDA kernel initiates RDMA + transfers over the network to remote nodes. This is the piece that + extends device-initiated communication beyond a single machine. + +PyTorch exposes this functionality through `Symmetric Memory +`__. +A symmetric memory tensor is allocated with the same size on every rank +and registered with NCCL as a *window*, which makes it remotely +accessible by all peers. Once a window is established, PyTorch can run +device-initiated collectives and one-sided operations (put, get, +signal) on it. Within a node, transfers use LSA; across nodes, NCCL +uses GIN under the hood. + +.. note:: + ``torch.distributed._symmetric_memory`` is an unstable API. Names + and signatures may change between releases. The examples in this + tutorial were written against PyTorch nightly builds. + +Enabling the NCCL backend for symmetric memory +---------------------------------------------- + +Symmetric memory supports several allocation backends (``CUDA``, +``NVSHMEM``, and ``NCCL``). To use the NCCL device API, select the +``NCCL`` backend before allocating any tensors: + +.. code:: python + + import torch.distributed._symmetric_memory as symm_mem + + symm_mem.set_backend("NCCL") + +Alternatively, set the environment variable ``TORCH_SYMMMEM=NCCL`` +before starting the process. + +The NCCL backend requires an eagerly initialized NCCL communicator. +Pass ``device_id`` to ``init_process_group`` so that the communicator +is created up front, and issue one warm-up collective before the first +symmetric memory allocation: + +.. code:: python + + dist.init_process_group(backend="nccl", device_id=device) + dist.all_reduce(torch.ones(1, device=device)) + +A first example: device-initiated all-reduce +-------------------------------------------- + +The following script allocates a symmetric memory tensor on each rank, +establishes the NCCL windows through ``rendezvous``, and runs a +one-shot all-reduce. A one-shot all-reduce is a single fused kernel: +each rank reads its peers' buffers directly and reduces them locally, +with no separate communication kernel launched by the host. Save this +program as ``symm_mem_all_reduce.py``: + +.. code:: python + + # file: symm_mem_all_reduce.py + import os + + import torch + import torch.distributed as dist + import torch.distributed._symmetric_memory as symm_mem + + + def main(): + rank = int(os.environ["RANK"]) + local_rank = int(os.environ["LOCAL_RANK"]) + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + + dist.init_process_group(backend="nccl", device_id=device) + symm_mem.set_backend("NCCL") + + # Warm up the NCCL communicator before the first allocation. + dist.all_reduce(torch.ones(1, device=device)) + group_name = dist.group.WORLD.group_name + + # Allocate a symmetric tensor. Every rank must allocate the + # same size, and the allocation must happen on all ranks. + t = symm_mem.empty(4096, dtype=torch.float32, device=device) + t.fill_(rank + 1) + + # Establish the symmetric memory windows. This is a collective + # operation and returns a handle for one-sided operations. + symm_mem.rendezvous(t, group=group_name) + + # Device-initiated all-reduce over the symmetric tensor. + res = torch.ops.symm_mem.one_shot_all_reduce(t, "sum", group_name) + + expected = sum(range(1, dist.get_world_size() + 1)) + assert res.eq(expected).all().item() + if rank == 0: + print(f"one_shot_all_reduce OK, every element == {expected}") + + dist.destroy_process_group() + + + if __name__ == "__main__": + main() + +Run it with ``torchrun`` on a machine with at least two GPUs: + +.. code:: shell + + torchrun --nnodes=1 --nproc_per_node=2 symm_mem_all_reduce.py + +You should see: + +.. code:: shell + + one_shot_all_reduce OK, every element == 3 + +Two things distinguish this from a regular ``dist.all_reduce``: + +1. The input tensor must be a rendezvoused symmetric memory tensor — + the kernel relies on every rank's buffer being remotely accessible. +2. The whole operation is one kernel that both communicates and + reduces. There is no separate NCCL collective launch, which makes + the operation cheap for small messages and easy to capture in a + CUDA graph. + +One-sided communication: put and signal +--------------------------------------- + +Collectives are symmetric by nature: every rank participates in the +same operation at the same time. The device API also enables +*one-sided* operations, where a rank writes into a peer's buffer and +notifies it, without the peer posting a matching receive. This is the +communication style used by MoE token dispatch, pipeline transfers, and +other irregular patterns. + +The ``torch.ops.symm_mem`` namespace provides NCCL put/get operations +that run as device kernels. In the following example, ranks are paired: +each odd rank writes its buffer into the previous even rank's symmetric +tensor and raises a signal; the even rank blocks until the signal +arrives, then reads the delivered data. Save it as +``symm_mem_put_signal.py``: + +.. code:: python + + # file: symm_mem_put_signal.py + import os + + import torch + import torch.distributed as dist + import torch.distributed._symmetric_memory as symm_mem + + + def main(): + rank = int(os.environ["RANK"]) + local_rank = int(os.environ["LOCAL_RANK"]) + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + + dist.init_process_group(backend="nccl", device_id=device) + symm_mem.set_backend("NCCL") + dist.all_reduce(torch.ones(1, device=device)) + group_name = dist.group.WORLD.group_name + + t = symm_mem.empty(1024, dtype=torch.float32, device=device) + t.fill_(rank) + hdl = symm_mem.rendezvous(t, group=group_name) + + # Make sure all ranks finished writing their initial values + # before any peer starts writing remotely. + dist.barrier() + + signal_val = 1 + if rank % 2 == 1: + # Write our buffer into the peer's symmetric tensor and + # raise its signal so it knows the data arrived. + peer = rank - 1 + torch.ops.symm_mem.nccl_put_with_signal(t, signal_val, peer) + elif rank + 1 < hdl.world_size: + # Block until the peer's put has been delivered. + peer = rank + 1 + torch.ops.symm_mem.nccl_wait_for_signal(t, signal_val) + torch.cuda.synchronize() + assert t.eq(peer).all().item() + print(f"rank {rank}: received data from rank {peer}") + + dist.barrier() + dist.destroy_process_group() + + + if __name__ == "__main__": + main() + +Run it the same way: + +.. code:: shell + + torchrun --nnodes=1 --nproc_per_node=2 symm_mem_put_signal.py + +Expected output: + +.. code:: shell + + rank 0: received data from rank 1 + +A few notes on the one-sided primitives: + +* ``torch.ops.symm_mem.nccl_put(tensor, peer)`` writes the local + symmetric tensor into ``peer``'s corresponding buffer, and + ``torch.ops.symm_mem.nccl_get(tensor, peer)`` reads a peer's buffer + into the local one. ``nccl_put_with_signal`` additionally raises a + signal on the destination rank after the data lands, and + ``nccl_wait_for_signal`` blocks the stream until that signal arrives. +* With NCCL 2.29 or later, the handle also exposes host-initiated + one-sided signaling: ``hdl.put_signal(dst_rank=peer)`` and + ``hdl.wait_signal(src_rank=peer)``. +* Because the operations are one-sided, the target rank does not post + a matching receive. You are responsible for ordering remote writes + against local reads, using signals or a barrier. + +Going multi-node: where GIN takes over +-------------------------------------- + +Nothing in the code above changes when you scale from one node to +several. When a peer is reachable over NVLink or PCIe, the NCCL backend +services these operations with direct loads and stores (LSA). When a +peer lives on another node, NCCL services the same window operations +with GIN: the GPU posts RDMA operations to the NIC directly from the +kernel, without waking up the CPU. + +Launch across two nodes the usual way: + +.. code:: shell + + # On node 0 + torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \ + --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \ + symm_mem_ring_put.py + + # On node 1 + torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \ + --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \ + symm_mem_ring_put.py + +For GIN to be available, the cluster must satisfy NCCL's requirements: + +* NCCL 2.28.7 or later +* CUDA 12.2 or later, driver 510.40.3 or later +* ConnectX-4 or newer NICs with ``rdma-core`` 44 or later +* GPUDirect RDMA, via DMA-BUF (Linux kernel 6.1 or later) or the + ``nvidia-peermem`` module +* Full NIC connectivity between all rails; ``NCCL_CROSS_NIC=0`` is not + supported with the device API + +NCCL chooses between two GIN transports automatically: GDAKI +(GPUDirect Async Kernel-Initiated), where the GPU drives the NIC +doorbells directly, and a CPU proxy transport that works on a wider +range of NICs by relaying GPU requests through lock-free queues. Run +with ``NCCL_DEBUG=INFO`` to see which transport was selected. If +symmetric window registration is not available (for example, an old +NCCL or missing RDMA support), rendezvous or the device-initiated ops +raise an error; check the debug log for the reason. + +Writing custom communication kernels in Python with CuTe DSL +------------------------------------------------------------ + +The operations used so far are prebuilt kernels that ship with +PyTorch. Symmetric memory also lets you write your *own* communication +kernels, and with the `CuTe DSL +`__ +(``nvidia-cutlass-dsl``) you can do it in Python. The key enabler is +``hdl.get_buffer(peer, shape, dtype)``: it returns a regular CUDA +tensor whose data pointer is the mapped address of a *peer's* +symmetric buffer. A kernel that loads from that tensor performs a +remote read over NVLink or PCIe, so you can pass the peer buffers into +a CuTe DSL kernel like any local tensor. + +The following example implements a one-shot all-reduce as a custom +kernel: every rank reads all peers' buffers directly and accumulates +them into a local output. It is adapted from the `distributed CuTe DSL +examples +`__ +in the CUTLASS repository. Install the DSL with +``pip install nvidia-cutlass-dsl`` (Python 3.12 recommended) and save +the program as ``cute_all_reduce.py``: + +.. code:: python + + # file: cute_all_reduce.py + import os + + import torch + import torch.distributed as dist + import torch.distributed._symmetric_memory as symm_mem + + import cutlass + import cutlass.cute as cute + from cutlass.cute.runtime import from_dlpack + + + @cute.kernel + def all_reduce_kernel( + inputs: list[cute.Tensor], + gOut: cute.Tensor, + thr_layout: cute.Layout, + val_layout: cute.Layout, + ): + tidx, _, _ = cute.arch.thread_idx() + bidx, _, _ = cute.arch.block_idx() + + blk_coord = ((None, None), bidx) + local_tile_out = gOut[blk_coord] + local_tile_list = [t[blk_coord] for t in inputs] + + copy_atom = cute.make_copy_atom( + cute.nvgpu.CopyUniversalOp(), inputs[0].element_type + ) + tiled_copy = cute.make_tiled_copy_tv(copy_atom, thr_layout, val_layout) + thr_copy = tiled_copy.get_slice(tidx) + + thr_tensor_list = [thr_copy.partition_S(t) for t in local_tile_list] + thr_out = thr_copy.partition_D(local_tile_out) + frg_acc = cute.make_fragment_like(thr_out) + frg_acc.fill(0.0) + + # Each iteration loads the same tile from a different rank's + # buffer. Loads from peer tensors are remote reads. + for thr in thr_tensor_list: + frg = cute.make_fragment_like(thr) + cute.copy(copy_atom, thr, frg) + frg_acc.store(frg.load() + frg_acc.load()) + + cute.copy(copy_atom, frg_acc, thr_out) + + + @cute.jit + def all_reduce( + inputs: list[cute.Tensor], + output: cute.Tensor, + copy_bits: cutlass.Constexpr = 128, + ): + vector_size = copy_bits // inputs[0].element_type.width + thr_layout = cute.make_ordered_layout((4, 32), order=(1, 0)) + val_layout = cute.make_ordered_layout((4, vector_size), order=(1, 0)) + tiler_mn, tv_layout = cute.make_layout_tv(thr_layout, val_layout) + + divided_inputs = [cute.zipped_divide(t, tiler_mn) for t in inputs] + gOut = cute.zipped_divide(output, tiler_mn) + all_reduce_kernel(divided_inputs, gOut, thr_layout, val_layout).launch( + grid=[cute.size(gOut, mode=[1]), 1, 1], + block=[cute.size(tv_layout, mode=[0]), 1, 1], + ) + + + def main(): + rank = int(os.environ["RANK"]) + local_rank = int(os.environ["LOCAL_RANK"]) + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + + dist.init_process_group(backend="nccl", device_id=device) + symm_mem.set_backend("NCCL") + dist.all_reduce(torch.ones(1, device=device)) + world_size = dist.get_world_size() + + M, N = 1024, 512 + t = symm_mem.empty((M, N), dtype=torch.float32, device=device) + hdl = symm_mem.rendezvous(t, dist.group.WORLD) + t.random_(0, 100) + output = torch.zeros((M, N), device=device) + + # One tensor view per rank, each backed by that rank's + # symmetric buffer mapped into this process. + peer_tensors = [ + hdl.get_buffer(r, t.shape, t.dtype) for r in range(world_size) + ] + + compiled = cute.compile( + all_reduce, + [from_dlpack(p) for p in peer_tensors], + from_dlpack(output), + ) + + # Every rank must finish writing its input before any peer + # reads it, and no rank may exit before all reads finish. + dist.barrier() + compiled([from_dlpack(p) for p in peer_tensors], from_dlpack(output)) + dist.barrier() + + expected = t.clone() + dist.all_reduce(expected) + torch.testing.assert_close(output, expected) + if rank == 0: + print("custom CuTe DSL all_reduce OK") + + dist.destroy_process_group() + + + if __name__ == "__main__": + main() + +Run it with ``torchrun``: + +.. code:: shell + + torchrun --nnodes=1 --nproc_per_node=2 cute_all_reduce.py + +You should see: + +.. code:: shell + + custom CuTe DSL all_reduce OK + +Note the division of labor: PyTorch symmetric memory handles all the +setup (allocation, window registration, peer mapping), and the CuTe +DSL kernel is ordinary tile-based code — the only distributed aspect +is that some of its input tensors happen to live on other GPUs. +Custom Python kernels like this one can reach peers over NVLink or +PCIe (the LSA path) within a node. Cross-node RDMA through GIN is not +accessible from Python kernels today; that requires the C++ device +API described in the next section. + +Writing custom device kernels with GIN +-------------------------------------- + +The operations above are host-visible entry points to device-initiated +communication. The full power of GIN — issuing puts, gets, and signals +from *inside your own CUDA kernel*, interleaved with computation — is +exposed at the NCCL level through ``nccl_device.h``. A kernel obtains a +``ncclGin`` object from a device communicator and calls primitives such +as: + +.. code:: cpp + + // NCCL device API (C++/CUDA), sketch only + ncclGin gin(devComm, /*context=*/0); + gin.put(team, peer, remoteWindow, dstOffset, + localWindow, srcOffset, bytes, + ncclGin_SignalInc{signalIndex}); + gin.waitSignal(signalIndex, expectedValue); + gin.flush(); + +PyTorch does not yet expose these device-side primitives in Python. +Today they are the domain of C++/CUDA extension authors: you can obtain +the NCCL communicator backing a process group, create a device +communicator with GIN resources, and launch your own kernels against +the symmetric memory windows that PyTorch registered. See the +`NCCL device API documentation +`__ +for the complete device-side interface, and the +`kraken repository `__ for +examples of device-initiated communication kernels written against +PyTorch symmetric memory. + +Conclusion +---------- + +In this tutorial, we used the NCCL backend of PyTorch symmetric memory +to program GPU-initiated communication: we allocated symmetric tensors, +established NCCL windows with ``rendezvous``, ran a fused +device-initiated all-reduce, exchanged data with one-sided put and +signal primitives, and wrote a custom all-reduce kernel in Python with +the CuTe DSL. We also saw how the same code scales across nodes, where +NCCL's GPU-Initiated Networking (GIN) services window operations with +kernel-initiated RDMA. + +For further reading: + +* `PyTorch Symmetric Memory documentation `__ +* `PyTorch SymmetricMemory deep dive on dev-discuss `__ +* `NCCL device API documentation `__ +* `CuTe DSL documentation `__ and `distributed examples `__ +* `NVIDIA blog: Fusing Communication and Compute with the NCCL 2.28 Device API `__ +* `GPU-Initiated Networking for NCCL (paper) `__ From a736da5fddf55c5f5b7f17a2f4bd4652a1aaf16c Mon Sep 17 00:00:00 2001 From: Tristan Rice Date: Wed, 1 Jul 2026 23:57:33 +0000 Subject: [PATCH 2/3] Fix CuTe DSL documentation link Co-Authored-By: Claude Fable 5 --- unstable_source/nccl_gin_tutorial.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/unstable_source/nccl_gin_tutorial.rst b/unstable_source/nccl_gin_tutorial.rst index 1c3145da9da..c681f82a4d7 100644 --- a/unstable_source/nccl_gin_tutorial.rst +++ b/unstable_source/nccl_gin_tutorial.rst @@ -312,7 +312,7 @@ Writing custom communication kernels in Python with CuTe DSL The operations used so far are prebuilt kernels that ship with PyTorch. Symmetric memory also lets you write your *own* communication kernels, and with the `CuTe DSL -`__ +`__ (``nvidia-cutlass-dsl``) you can do it in Python. The key enabler is ``hdl.get_buffer(peer, shape, dtype)``: it returns a regular CUDA tensor whose data pointer is the mapped address of a *peer's* @@ -514,6 +514,6 @@ For further reading: * `PyTorch Symmetric Memory documentation `__ * `PyTorch SymmetricMemory deep dive on dev-discuss `__ * `NCCL device API documentation `__ -* `CuTe DSL documentation `__ and `distributed examples `__ +* `CuTe DSL documentation `__ and `distributed examples `__ * `NVIDIA blog: Fusing Communication and Compute with the NCCL 2.28 Device API `__ * `GPU-Initiated Networking for NCCL (paper) `__ From 1d236c301893480d9e95b86431e7d964dd6bb826 Mon Sep 17 00:00:00 2001 From: Tristan Rice Date: Thu, 2 Jul 2026 03:29:21 +0000 Subject: [PATCH 3/3] Add GIN put example using nccl4py CuTe DSL device API Replaces the C++-only GIN section with Python examples: nccl4py exposes the NCCL device API (including GIN put/wait_signal) to CuTe DSL kernels, so GPU-initiated RDMA is now reachable from Python. Adapted from the nccl4py cute example, using torch.distributed for bootstrap and nccl.torch.empty for NCCL-allocated tensors. Also adds a section on combining symmetric memory with nccl4py: wrapping the process group's communicator via _comm_ptr(), registering symm_mem tensors as NCCL windows, and the reverse register_external_nccl_comm bridge. Co-Authored-By: Claude Fable 5 --- unstable_source/nccl_gin_tutorial.rst | 293 ++++++++++++++++++++++---- 1 file changed, 253 insertions(+), 40 deletions(-) diff --git a/unstable_source/nccl_gin_tutorial.rst b/unstable_source/nccl_gin_tutorial.rst index c681f82a4d7..45b77c0dee5 100644 --- a/unstable_source/nccl_gin_tutorial.rst +++ b/unstable_source/nccl_gin_tutorial.rst @@ -16,8 +16,8 @@ GPU-Initiated Networking with NCCL and PyTorch Symmetric Memory * How to use one-sided put and signal operations between ranks * How to run device-initiated collectives such as ``one_shot_all_reduce`` - * How to write a custom communication kernel in Python with the - CuTe DSL + * How to write custom communication kernels in Python with the + CuTe DSL, including GIN puts via nccl4py .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites :class-card: card-prerequisites @@ -27,8 +27,8 @@ GPU-Initiated Networking with NCCL and PyTorch Symmetric Memory * A host with two or more CUDA GPUs * For multi-node GIN: RDMA-capable NICs (ConnectX-4 or newer) with GPUDirect RDMA - * For the custom kernel section: ``nvidia-cutlass-dsl`` 4.5 or - later + * For the custom kernel sections: ``nvidia-cutlass-dsl`` 4.5 or + later, and ``nccl4py`` 0.3 or later for the GIN example * Familiarity with `PyTorch Distributed `__ Introduction @@ -460,42 +460,252 @@ Note the division of labor: PyTorch symmetric memory handles all the setup (allocation, window registration, peer mapping), and the CuTe DSL kernel is ordinary tile-based code — the only distributed aspect is that some of its input tensors happen to live on other GPUs. -Custom Python kernels like this one can reach peers over NVLink or -PCIe (the LSA path) within a node. Cross-node RDMA through GIN is not -accessible from Python kernels today; that requires the C++ device -API described in the next section. +Kernels written this way reach peers with direct loads and stores +(the LSA path), so they work over NVLink and PCIe within a node. To +initiate *network* transfers from inside a Python kernel, you need the +GIN device API, which the next section covers. -Writing custom device kernels with GIN --------------------------------------- +Calling GIN from Python kernels with nccl4py +-------------------------------------------- + +`nccl4py `__, the official Python +binding for NCCL, exposes the NCCL device API — including GIN — to +CuTe DSL kernels through its ``nccl.core.device.cute`` module. This +lets a Python kernel issue RDMA puts and signal waits directly, +the same primitives a C++ kernel would use through ``nccl_device.h``. + +This path does not go through +``torch.distributed._symmetric_memory``: you create a NCCL +communicator with nccl4py, register windows on it yourself, and build +a device communicator with GIN resources. nccl4py provides PyTorch +interop, so the buffers can still be regular ``torch.Tensor`` objects: +``nccl.torch.empty`` allocates a tensor from NCCL's allocator (a +requirement for window registration), and ``register_window`` accepts +it directly. + +The following example transfers a buffer from rank 0 to rank 1 with a +single GIN put issued from inside a CuTe DSL kernel, then waits on the +delivery signal on the receiving side. It is adapted from the +`nccl4py CuTe DSL example +`__ +in the NCCL repository, with ``torch.distributed`` replacing MPI for +the bootstrap. Install nccl4py with ``pip install nccl4py[cu12]`` (or +``[cu13]``) and save the program as ``cute_gin_put.py``: + +.. code:: python + + # file: cute_gin_put.py + import os + + import torch + import torch.distributed as dist + + import cutlass + import cutlass.cute as cute + import nccl.core as nccl + import nccl.core.device.cute as nccl_cute + + NUM_ELEMS = 1024 * 1024 // 8 # 1 MiB of int64 + DST_RANK = 1 + SIGNAL_ID = 1 + + + @cute.kernel + def gin_put_kernel(dev_comm, send_win, recv_win): + dev_comm = nccl_cute.DevComm(dev_comm) + send_win = nccl_cute.Window(send_win) + recv_win = nccl_cute.Window(recv_win) + team = dev_comm.team_world + gin = dev_comm.gin(nccl_cute.GinBackendMask.ALL, 0) + coop = nccl_cute.cta() + + send = send_win.tensor(cutlass.Int64, cute.make_layout(NUM_ELEMS)) + recv = recv_win.tensor(cutlass.Int64, cute.make_layout(NUM_ELEMS)) + + if team.rank == 0: + # RDMA put issued by the GPU: write our send window into + # the peer's recv window and raise its signal on delivery. + gin.put( + team, + DST_RANK, + recv_win, recv, # destination window + tensor (remote) + send_win, send, # source window + tensor (local) + coop, + is_signal=True, + signal_id=SIGNAL_ID, + signal_op=0, + signal_op_arg=1, + ) + if team.rank == DST_RANK: + # Block inside the kernel until the put has landed. + gin.wait_signal(coop, signal=SIGNAL_ID, least=1) + + + @cute.jit + def gin_put(dev_comm: cutlass.Int64, + send_win: cutlass.Int64, + recv_win: cutlass.Int64): + gin_put_kernel(dev_comm, send_win, recv_win).launch( + grid=[1, 1, 1], block=[32, 1, 1], cooperative=True + ) + + + def main(): + rank = int(os.environ["RANK"]) + world_size = int(os.environ["WORLD_SIZE"]) + local_rank = int(os.environ["LOCAL_RANK"]) + torch.cuda.set_device(local_rank) + + # A CPU process group, used only to share the NCCL unique id. + dist.init_process_group(backend="gloo") + uid_bytes = [bytes(nccl.get_unique_id()) if rank == 0 else None] + dist.broadcast_object_list(uid_bytes, src=0) + uid = nccl.UniqueId.from_bytes(uid_bytes[0]) + comm = nccl.Communicator.init( + nranks=world_size, rank=rank, unique_id=uid + ) + + # NCCL-allocated torch tensors, registered as NCCL windows. + send_buf = nccl.torch.empty(NUM_ELEMS, dtype=torch.int64) + recv_buf = nccl.torch.empty(NUM_ELEMS, dtype=torch.int64) + if rank == 0: + send_buf.copy_(torch.arange(NUM_ELEMS)) + else: + send_buf.zero_() + recv_buf.zero_() + torch.cuda.synchronize() + + send_win = comm.register_window(send_buf) + recv_win = comm.register_window(recv_buf) + + # Request GIN resources when creating the device communicator. + reqs = nccl.NCCLDevCommRequirements( + gin_connection_type=nccl.NcclGinConnectionType.FULL, + gin_signal_count=SIGNAL_ID + 1, + ) + dev_comm = comm.create_dev_comm(requirements=reqs) + + gin_put(dev_comm.ptr, send_win.handle, recv_win.handle) + torch.cuda.synchronize() + + if rank == DST_RANK: + expected = torch.arange(NUM_ELEMS, device=recv_buf.device) + torch.testing.assert_close(recv_buf, expected) + print(f"rank {rank}: GIN put received correctly") + + dev_comm.close() + send_win.close() + recv_win.close() + comm.destroy() + dist.destroy_process_group() + + + if __name__ == "__main__": + main() + +Run it with two ranks: + +.. code:: shell + + torchrun --nnodes=1 --nproc_per_node=2 cute_gin_put.py + +You should see: + +.. code:: shell + + rank 1: GIN put received correctly + +Unlike the LSA examples, the transfer here goes through the NIC even +when both ranks share a machine, so the host must meet the GIN +hardware requirements listed in the previous section. Compared to the +symmetric memory path, this API is lower level: you manage the +communicator, windows, signal budget +(``gin_signal_count``), and synchronization yourself, but in exchange +a Python kernel can interleave computation with network communication +— the pattern behind MoE token dispatch and fused +communication-compute kernels. + +Combining symmetric memory with nccl4py +--------------------------------------- + +The previous example created its own NCCL communicator, separate from +the process group. You can also combine the two worlds: keep using +symmetric memory for allocation and for the prebuilt +``torch.ops.symm_mem`` operations, and use nccl4py to run custom GIN +kernels on the *same* buffers and the *same* communicator. + +Two bridges make this work: + +1. ``ProcessGroupNCCL`` exposes its NCCL communicator as an opaque + pointer through ``_comm_ptr()``, and ``nccl.Communicator`` accepts + such a pointer in its constructor. This avoids creating and + bootstrapping a second communicator. +2. Tensors from ``symm_mem.empty`` with the NCCL backend are allocated + with NCCL's allocator (``ncclMemAlloc``), which is exactly what + window registration requires, so they can be registered with + ``register_window`` directly. + +.. code:: python + + import torch + import torch.distributed as dist + import torch.distributed._symmetric_memory as symm_mem + import nccl.core as nccl + + device = torch.device("cuda", local_rank) + dist.init_process_group(backend="nccl", device_id=device) + symm_mem.set_backend("NCCL") + dist.all_reduce(torch.ones(1, device=device)) -The operations above are host-visible entry points to device-initiated -communication. The full power of GIN — issuing puts, gets, and signals -from *inside your own CUDA kernel*, interleaved with computation — is -exposed at the NCCL level through ``nccl_device.h``. A kernel obtains a -``ncclGin`` object from a device communicator and calls primitives such -as: - -.. code:: cpp - - // NCCL device API (C++/CUDA), sketch only - ncclGin gin(devComm, /*context=*/0); - gin.put(team, peer, remoteWindow, dstOffset, - localWindow, srcOffset, bytes, - ncclGin_SignalInc{signalIndex}); - gin.waitSignal(signalIndex, expectedValue); - gin.flush(); - -PyTorch does not yet expose these device-side primitives in Python. -Today they are the domain of C++/CUDA extension authors: you can obtain -the NCCL communicator backing a process group, create a device -communicator with GIN resources, and launch your own kernels against -the symmetric memory windows that PyTorch registered. See the -`NCCL device API documentation + # Symmetric tensor: usable with torch.ops.symm_mem.* as usual. + t = symm_mem.empty(NUM_ELEMS, dtype=torch.int64, device=device) + hdl = symm_mem.rendezvous(t, group=dist.group.WORLD.group_name) + + # Wrap the process group's existing NCCL communicator. Do NOT call + # destroy() on this wrapper -- the process group owns the comm. + pg_backend = dist.group.WORLD._get_backend(device) + comm = nccl.Communicator(pg_backend._comm_ptr()) + + # Register the symmetric tensor as a window and request GIN + # resources; from here the GIN kernel example above applies as is. + win = comm.register_window(t) + reqs = nccl.NCCLDevCommRequirements( + gin_connection_type=nccl.NcclGinConnectionType.FULL, + gin_signal_count=1, + ) + dev_comm = comm.create_dev_comm(requirements=reqs) + +The same buffer is now reachable three ways: host-initiated +collectives (``dist.all_reduce``), device-initiated symmetric memory +operations (``torch.ops.symm_mem.*``), and your own CuTe DSL GIN +kernels through the window handle. + +The reverse bridge also exists: +``torch.distributed._symmetric_memory.register_external_nccl_comm`` +publishes a communicator created outside PyTorch (for example with +nccl4py) into the symmetric memory registry under a group name, so +``symm_mem.rendezvous`` can use it. + +A few warnings for this pattern: + +* ``_comm_ptr()`` is a private API, and collectives launched on the + communicator outside the process group are not monitored by + PyTorch's watchdog. The wrapper must not outlive or destroy the + process group's communicator. +* ``register_window`` and ``create_dev_comm`` are collective calls: + every rank must make them in the same order. +* Windows registered this way are separate from the window that + ``rendezvous`` registers internally; close them (``win.close()``, + ``dev_comm.close()``) before ``destroy_process_group``. + +For C++/CUDA extension authors, the same primitives are available +natively through ``nccl_device.h`` (``ncclGin::put``, signals, +barriers); see the `NCCL device API documentation `__ -for the complete device-side interface, and the +for the complete interface, and the `kraken repository `__ for -examples of device-initiated communication kernels written against -PyTorch symmetric memory. +more examples of device-initiated communication kernels written +against PyTorch symmetric memory. Conclusion ---------- @@ -504,10 +714,12 @@ In this tutorial, we used the NCCL backend of PyTorch symmetric memory to program GPU-initiated communication: we allocated symmetric tensors, established NCCL windows with ``rendezvous``, ran a fused device-initiated all-reduce, exchanged data with one-sided put and -signal primitives, and wrote a custom all-reduce kernel in Python with -the CuTe DSL. We also saw how the same code scales across nodes, where -NCCL's GPU-Initiated Networking (GIN) services window operations with -kernel-initiated RDMA. +signal primitives, and wrote custom kernels in Python with the CuTe +DSL — a peer load/store all-reduce over symmetric memory, and a GIN +put issued from inside a kernel through nccl4py. We also saw how the +symmetric memory code scales across nodes, where NCCL's GPU-Initiated +Networking (GIN) services window operations with kernel-initiated +RDMA. For further reading: @@ -515,5 +727,6 @@ For further reading: * `PyTorch SymmetricMemory deep dive on dev-discuss `__ * `NCCL device API documentation `__ * `CuTe DSL documentation `__ and `distributed examples `__ +* `nccl4py `__ and its `CuTe DSL device API examples `__ * `NVIDIA blog: Fusing Communication and Compute with the NCCL 2.28 Device API `__ * `GPU-Initiated Networking for NCCL (paper) `__