SemiAnalysisAI · Oseltamivir · Jul 3, 2026
diff --git a/.github/workflows/collectivex-sweep.yml b/.github/workflows/collectivex-sweep.yml
diff --git a/experimental/CollectiveX/.gitignore b/experimental/CollectiveX/.gitignore
@@ -0,0 +1,15 @@
+__pycache__/
+*.pyc
+results/
+unsupported/
+.shards/
+.cx_workloads/
+.cx_backend/
+/matrix_full.json
+gpucore.*
+
+# Local plans and infrastructure inventory.
+goal.md
+notes.md
+configs/platforms.yaml
+private-infra.md
diff --git a/experimental/CollectiveX/README.md b/experimental/CollectiveX/README.md
@@ -0,0 +1,156 @@
+# CollectiveX
+
+<div align="center">
+
+**English** | [中文](./README_zh.md)
+
+</div>
+
+CollectiveX is an experimental MoE expert-parallel communication benchmark. It measures dispatch,
+combine, and paired roundtrip latency across EP libraries and accelerator systems.
+
+> Publication hold: historical schema 3-5 data is diagnostic. No current dataset is approved for
+> rankings, recommendations, or regression baselines.
+
+## v1 Execution Profile
+
+Every scheduled case is BF16 with backend-tuned resources and packed placement. The explicit mode
+selects one of two contracts:
+
+- Normal mode uses `layout-and-dispatch-v1`, rank-deduplicated token payloads, and activation-only
+  combine. Uniform core coverage and one Zipf sensitivity remain; EPLB is measured only as the Zipf
+  remedy.
+- Low-latency mode uses `expert-packed-weighted-combine-v1`, token-expert payloads, and gate-weighted
+  combine through genuine DeepEP V1 or UCCL low-latency APIs. It is decode-only and never shares a
+  ranking cohort with normal mode. Other backends are explicitly unsupported for this suite.
+
+Both modes use `fixed-512-v1`: 64 trials x 8 timed iterations with 32 synchronized full roundtrip
+warmups before each measured component at every trial/point. Roundtrip is measured first; each
+iteration takes the cross-rank maximum before nearest-rank p50/p90/p95/p99, and roundtrip p99 is the
+headline latency. A stdlib integer counter produces byte-identical routing and gate weights.
+
+The canonical matrix covers H100, H200, B200, B300, GB200, GB300, MI325X, and MI355X. It requests
+608 cases / 1,600 token points: 364 runnable cases / 940 points, emitted as 58 executable workflow
+shards/allocation cells, plus 244 explicit unsupported cases / 660 points. `sweep_matrix.py`
+materializes every token ladder and rejects missing, stale, malformed, or altered shard controls.
+Shards are emitted round-robin by SKU so the bounded GHA matrix uses every runner pool early.
+
+| Systems | EP8 | EP16 |
+|---|---|---|
+| H100/H200/B200/B300 | 1x8 NVLink, scale-up | 2x8 NVLink + RDMA, scale-out |
+| MI325X/MI355X | 1x8 XGMI, scale-up | 2x8 XGMI + RDMA, scale-out |
+| GB200/GB300 | 2x4 MNNVL, scale-up | 4x4 MNNVL, scale-up |
+
+Physical host count does not determine scope: both GB topologies stay inside one 72-GPU MNNVL
+scale-up domain.
+
+| Backend | Current scope |
+|---|---|
+| DeepEP V1 | Image-pinned `deep_ep.Buffer`: normal and native low-latency APIs; upstream v1.2.1 on x86 and the image's GB fork on arm64 |
+| DeepEP V2 | PR #605 `ElasticBuffer` plus #630: LSA for scale-up and GIN for x86 EP16 scale-out; source/SASS-bound reproducible JIT |
+| DeepEP Hybrid | Pinned `HybridEPBuffer`: x86 EP16 multi-domain RDMA/DOCA; GB EP8/EP16 in one MNNVL communication domain |
+| UCCL | Pinned 0.1.1 wheel and wrapper with normal and native low-latency APIs on Hopper; Blackwell is explicitly unsupported |
+| NCCL/RCCL A2A | Portable rank-deduplicated payload plus expert/routing-metadata reference |
+| MoRI | EP8 uses MI325X AsyncLL or MI355X IntraNode; EP16 pins InterNodeV1 over 2x8 XGMI + RDMA |
+
+FlashInfer is outside v1 because its exercised EP path failed intermittently at runtime. It is not
+misreported as a platform capability limitation and can return after a stable pinned path is proven.
+
+DeepEP V2 means the `ElasticBuffer` implementation introduced by
+[DeepEP PR #605](https://github.com/deepseek-ai/DeepEP/pull/605), not a newer legacy `Buffer` build.
+The pinned source is the minimal upstream [PR #630](https://github.com/deepseek-ai/DeepEP/pull/630)
+follow-up: its parent is the #605 merge tree and its only source change fixes pure scale-up
+initialization when GIN is unavailable. Scale-up cases request NCCL Device API LSA and fail closed
+unless the realized LSA team covers the full EP world. x86 EP16 scale-out cases instead require the
+hybrid path with GIN, two logical scale-out domains represented by two physical RDMA ranks, and eight
+scale-up ranks per domain; GB EP16 remains MNNVL scale-up and therefore uses LSA. The isolated build
+records the API, source, loaded libraries, generated JIT source, executable SASS, and raw CUBIN
+diagnostics. The current H100 runner pool is explicitly unsupported for V2 because NCCL 2.30.4
+reports that its EP8 communicator lacks Device API symmetric-memory support; re-enabling that pool
+requires an all-rank CUDA P2P/LSA-capable runtime. Other NVIDIA SKUs remain unvalidated until their
+GPU outcomes pass the native correctness and publication gates.
+
+Removed v1 axes include cached-layout `[cl]`, runtime-visible `[rv]`, FP8, quantized combine,
+extra routing distributions, activation profiles, uneven allocation, placement permutations, model
+envelopes, and scaling studies.
+
+## Workflow And Artifacts
+
+`.github/workflows/collectivex-sweep.yml` generates a public-SKU matrix, extracts a strict ignored
+`.shards/<id>.json` control, executes one allocation per shard, privacy-checks result JSON, and uploads
+raw GitHub artifacts. Raw producers are diagnostic-only; they cannot self-promote evidence.
+
+GitHub artifacts are transient publisher input; Vercel storage, GCP, Neon, managed databases, and
+managed object stores are out of scope. `publisher.py` ingests complete downloaded workflow
+artifacts into an operator-selected local workspace, verifies or promotes explicit bundle IDs, and
+emits the sanitized content-addressed dataset. It never runs on GPU workers and is not a service.
+
+The frontend requires no store or environment variable. Until measured v1 data passes promotion,
+its route generates a deterministic synthetic publication in memory and marks every series
+diagnostic; generated values cannot create cohorts, rankings, recommendations, or regression claims.
+Publishing measured results will be an explicit reviewed change that replaces the generator input
+with the sanitized publisher output. The validation contract and promotion gates are in
+[docs/methodology.md](docs/methodology.md).
+
+## Runner Configuration
+
+Runner-local Slurm and storage values use a strict per-SKU JSON document at
+`$XDG_CONFIG_HOME/inferencex/collectivex.json` or `COLLECTIVEX_OPERATOR_CONFIG`. The mode-0600,
+same-owner, non-symlink file is outside the checkout and never uploaded. Unknown runners, fields,
+duplicate keys, endpoint literals, unsafe paths, and non-JSON input fail closed; configuration is
+never evaluated as shell. GHA passes encrypted `COLLECTIVEX_OPERATOR_CONFIG_V1` content only to the
+launcher, which validates it, exports the selected SKU's allowlisted values, and deletes the
+temporary copy before allocation. Required JSON fields are:
+
+| SKU | Variables |
+|---|---|
+| `h100-dgxc`, `b200-dgxc` | `partition`, `account`, `squash_dir`, `stage_dir` |
+| `h200-dgxc` | `partition`, `squash_dir`, `stage_dir` |
+| `b300` | `partition`, `account`, `squash_dir`, `stage_dir` |
+| `gb200` | `partition`, `account`, ordered `storage_roots` |
+| `gb300` | `partition`, `account`, `squash_dir`, `stage_dir`, `enroot_cache_path` |
+| `mi325x`, `mi355x` | `partition`, `squash_dir`, `stage_dir` |
+
+Every selected non-MNNVL EP16 placement additionally requires `socket_ifname` and `rdma_devices`
+for its operator-approved fabric; optional
+`ib_gid_index` and `rdma_service_level` are also allowlisted. CollectiveX does not heuristically
+select a management route or HCA. After allocation, every non-MNNVL scale-out node must prove that
+all configured interfaces and active HCA ports exist before backend setup. Scale-up and MNNVL jobs
+clear these overrides. Scale-out NCCL/RCCL is pinned to `IB` with exact-match HCA selectors so a
+socket fallback fails instead of being mislabeled as RDMA.
+
+`stage_dir` is a pre-existing, runner-owned, non-symlinked base outside the checkout and workflow
+workspace. It is not group- or world-writable and is visible at the same path on the runner and every
+allocated node. Jobs create only a marked mode-0700 execution child, prove cross-node read/write
+visibility, and remove that exact child after allocation teardown; they never mount the runner
+checkout or create a stage beneath image storage on AMD.
+
+Before import, each Docker Hub tag is resolved with bounded registry requests and must match its
+pinned digest; digest-qualified overrides are rejected. Enroot imports use a fixed filesystem epoch
+and a versioned, registry-digest-bound cache key. Every mounted squash is freshly hashed. The
+verified registry digest and local squash hash are both recorded. Image-provided DeepEP is checked
+against exact wheel and installed-file fingerprints; source-built backends use pinned commits and
+runtime-verified GPU targets. DeepEP V2's mode-0700 cluster-local build cache is keyed by a versioned
+build recipe, verified image, architecture, upstream trees, and dependency pins; only its fixed
+`/cx-cache` mount reaches the container, and it never enters result artifacts.
+Pinned V2 and Hybrid sources are fetched once per workflow. Each job validates the complete archive,
+extracts only its exact backend root, permits only contained relative leaf symlinks to archived
+regular files, and revalidates the Git tree and submodule pins before staging.
+Compute containers receive an explicit environment allowlist. Private host, address, device, NIC,
+credential, workspace, and path data stays in encrypted config, ignored operator notes, or bounded
+mode-0600 runner logs; it is never uploaded.
+
+## Local Checks
+
+```bash
+uv run --with-requirements experimental/CollectiveX/requirements.txt \
+  python -m unittest discover experimental/CollectiveX/tests -p 'test_*.py'
+uv run --with-requirements experimental/CollectiveX/requirements.txt \
+  python experimental/CollectiveX/sweep_matrix.py --backends all --out /tmp/cx-matrix.json >/dev/null
+uv run --with-requirements experimental/CollectiveX/requirements.txt \
+  python experimental/CollectiveX/publisher.py --store-root "$COLLECTIVEX_STORE_ROOT" verify
+bash -n experimental/CollectiveX/runtime/*.sh experimental/CollectiveX/launchers/*.sh
+```
+
+Core paths are `capability.py`, `configs/`, `contracts.py`, `schemas/`, `sweep_matrix.py`,
+`publisher.py`, `runtime/`, `launchers/`, and `tests/`.