Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件 by Oseltamivir · Pull Request #2004 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-07-03T17:38:00Z

Summary

Finalizes the isolated CollectiveX v1 expert-parallel communication benchmark under
experimental/CollectiveX/. The branch is ready for three complete no-canary qualification runs;
it does not claim or include promoted v1 results yet.

Benchmark contract

Covers H100, H200, B200, B300, GB200, GB300, MI325X, and MI355X.
Supports DeepEP V1, DeepEP V2 PR #605
with the official PR #630 scale-up fix,
DeepEP Hybrid, UCCL, MoRI, and an NCCL/RCCL reference.
Resolves 38 runnable allocation shards and 360 requested cases / 840 token points: 228 runnable
cases / 532 points plus 132 explicit unsupported cases / 308 points.
Uses 8 timed iterations x 64 trials, 32 synchronized full-roundtrip warmups before every measured
component/trial/point, and exactly 512 observations for every case.
Standardizes combine as activation-only unweighted rank-sum. Dispatch gate weights remain
oracle-checked but are not returned through the timed combine path.
Keeps uniform routing as the headline; Zipf and Zipf+EPLB are experimental sensitivity evidence.

Qualification fixes

Fetches and pin-validates source-built DeepEP backends on the GitHub-hosted setup job, then passes a same-run, three-day source artifact to self-hosted jobs; canonical runners never fall back to an upstream network fetch.
Uses exact, command-scoped Git trust for UID-remapped shared filesystems and rejects wildcard trust.
Normalizes inherited B300 source-root permissions to private mode 0700 and preserves detailed
source-staging diagnostics only in private logs.
Labels the DeepEP extension as deep_ep._C in public evidence while hashing the actual binary.
Disables GIN only for declared scale-up cases, then requires NCCL's realized LSA team to cover the
full EP world; a smaller realized domain fails before timing or publication.
Selects the DeepEP V2 JIT namespace after SM/QP, topology, device, and timeout inputs agree across
ranks, and requires exactly one artifact for each of the five expected kernels.

Correctness and artifacts

The native oracle validates expert-specific payloads, destinations, source identity, multiplicity,
weights, receive counts, combine values, and input immutability on every rank. Provenance binds the
verified image and squash bytes, implementation/build identity, loaded collective runtime, runtime
fingerprint, and generated-kernel evidence.

GitHub result artifacts are transient delivery inputs to an isolated local content-addressed filesystem
publisher. The pinned-source artifact is execution-only, is rejected by the publisher, and expires after
three days. Promotion requires exactly three complete independent runs from one source SHA, exact
coverage, stable p50/p99 evidence, stable ordering, and complete controlled cohorts. No managed
database, managed object store, or third-party result hosting is introduced.

The tracked tree and reachable branch history contain none of the private runner endpoint literals.
experimental/CollectiveX/configs/platforms.yaml is absent from Git, ignored, and used only as a
local operator note.

Validation

131 Python contract/unit tests.
Byte-identical matrix generation at
292e05f8faccaa4971eda527a327190a9943e99d4f71611987f7b95f57f253e8.
All 360 cases share timing 8:64:32, 512 samples per point, and one warmup contract.
Compileall, JSON schema parsing, bash -n, ShellCheck, Actionlint, and git diff --check.
Bilingual documentation parity and exact private-endpoint scans across tracked files and reachable
Git history.

中文说明

本 PR 完成位于 experimental/CollectiveX/ 的隔离式 CollectiveX v1 专家并行（EP）通信基准测试。
当前分支已准备执行三轮完整、无 canary 的资格验证；目前尚未宣称或提交任何已晋级的 v1 结果。

基准测试约定

覆盖 H100、H200、B200、B300、GB200、GB300、MI325X 和 MI355X。
支持 DeepEP V1、DeepEP V2 PR #605
及官方 PR #630 scale-up 修复、DeepEP Hybrid、
UCCL、MoRI，以及 NCCL/RCCL 参考后端。
生成 38 个可运行分配分片，共请求 360 个用例 / 840 个 token 数据点：其中 228 个可运行用例 /
532 个数据点，另有 132 个明确标记为不支持的用例 / 308 个数据点。
每个用例统一采用 8 次计时迭代 x 64 次试验；每个组件、试验和数据点测量前执行 32 次同步完整
往返预热，最终严格得到 512 个观测值。
所有后端统一采用 activation-only、unweighted rank-sum combine。Dispatch gate weights 仍由
oracle 校验，但不会通过被测 combine 路径返回。
Uniform routing 作为主结果；Zipf 和 Zipf+EPLB 仅作为实验性敏感度证据。

资格验证修复

由 GitHub 托管的 setup job 获取并校验源码构建型 DeepEP 后端的精确版本，再通过保留三天、仅供同一次运行使用的源码产物交给 self-hosted job；规范运行不会回退到 runner 侧的上游网络获取。
对 UID 重映射共享文件系统仅启用精确、单命令作用域的 Git 信任，并拒绝通配信任。
将 B300 共享目录继承的权限规范化为私有 0700；详细源码暂存诊断只保留在私有日志中。
公共证据使用稳定的 deep_ep._C 标签，同时哈希真实 extension 二进制内容。
仅对声明为 scale-up 的用例禁用 GIN，并要求 NCCL 实际建立的 LSA team 覆盖整个 EP world；
如果实际 domain 更小，则在计时和发布前直接失败。
在所有 rank 的 SM/QP、拓扑、设备及超时参数一致后才选择 DeepEP V2 JIT namespace，并严格
要求五个预期 kernel 各有一份产物。

正确性与产物

原生 oracle 在每个 rank 校验专家特定的 payload、目标、源身份、重复次数、权重、接收计数、
combine 数值及输入不可变性。溯源信息绑定已验证的镜像与 squash 内容、实现/构建身份、实际加载的
collective runtime、运行时指纹及生成 kernel 证据。

GitHub 结果产物仅作为临时传输输入，最终写入隔离的本地内容寻址文件系统发布器。固定版本源码产物只用于执行，发布器会明确拒绝该产物，并在三天后过期。只有来自同一 source
SHA 的三轮完整独立运行同时满足精确覆盖、p50/p99 稳定性、排序稳定性及完整受控 cohort，才允许
晋级。不引入托管数据库、托管对象存储或第三方结果托管服务。

受跟踪文件和可达分支历史均不包含私有 runner endpoint 字面值。
experimental/CollectiveX/configs/platforms.yaml 不受 Git 跟踪，已被忽略，仅作为本地 operator
备注使用。

验证

131 个 Python 约定/单元测试。
两次矩阵生成逐字节一致，SHA-256 为
292e05f8faccaa4971eda527a327190a9943e99d4f71611987f7b95f57f253e8。
全部 360 个用例统一使用 8:64:32 timing、每个数据点 512 个样本及同一预热约定。
Compileall、JSON Schema 解析、bash -n、ShellCheck、Actionlint 和 git diff --check。
中英文文档一致性，以及对受跟踪文件和可达 Git 历史执行的私有 endpoint 精确扫描。

claude · 2026-07-03T18:12:16Z

+  rsync -a --delete --delete-excluded \
+    --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
+    --exclude='configs/platforms.yaml' --exclude='private-infra.md' \
+    --exclude='goal.md' --exclude='notes.md' \
+    "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/" >/dev/null 2>&1 \
+    || cx_die "staging CollectiveX failed"


🔴 The setup step writes the shard JSON to experimental/CollectiveX/results/.shard_${matrix.id}.json and sets CX_SHARD_FILE=results/.shard_${matrix.id}.json (relative), but cx_stage_repo (runtime/common.sh:145-150) rsyncs the CollectiveX tree with --exclude='results/' --delete-excluded and drops the shard file — so for every staged single-tray SKU (b300 always; gb200/gb300 with EP4 via CX_NODES<=1), the [ -f "$CX_SHARD_FILE" ] guard at run_in_container.sh:458 fails and execution falls into the single-bench else branch (line 556+), silently running one wrong-config default (uniform/decode/bf16, empty case_id) instead of the shard's N scheduled cases. Downstream make_bundle will catch this via missing_identity/coverage but only after GPU allocation was spent on the wrong workload. Cheap fix: allow-list the shard file through the rsync (--include='experimental/CollectiveX/results/' --include='experimental/CollectiveX/results/.shard_*.json' before the results/ exclude), copy the shard file into the stage dir after the rsync, or resolve CX_SHARD_FILE against the original repo root in run_in_container.sh's SHARD guard the way the rack (EP8) launchers already do (see launch_gb300-nv.sh:92-93 / launch_gb200-nv.sh cx_ep_cases).

Extended reasoning...

The bug

The sweep workflow's shard-fanout step writes the resolved case list to experimental/CollectiveX/results/.shard_${matrix.id}.json:

# .github/workflows/collectivex-sweep.yml env: CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json # RELATIVE path ... - name: Extract shard from matrix artifact working-directory: experimental/CollectiveX run: | ... json.dump({...,'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))

The physical file therefore lands at $REPO/experimental/CollectiveX/results/.shard_<id>.json, and CX_SHARD_FILE=results/.shard_<id>.json is interpreted relative to the container's cwd, which is /ix/experimental/CollectiveX.

For every SKU that requires CX_STAGE_DIR (b300 always; gb200/gb300 with EP4 via the CX_NODES<=1 delegate path in launch_gb200-nv.sh:57 / launch_gb300-nv.sh:47), the launcher calls:

# launch_b300.sh:34, launch_gb200-nv.sh:52, launch_gb300-nv.sh:24 MOUNT_SRC="$(cx_stage_repo "$REPO_ROOT" "$CX_STAGE_DIR")"

which rsyncs the tree with an exclude that drops results/:

# experimental/CollectiveX/runtime/common.sh:145-150 rsync -a --delete --delete-excluded \ --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \ --exclude='configs/platforms.yaml' --exclude='private-infra.md' \ --exclude='goal.md' --exclude='notes.md' \ "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/"

Both --exclude='results/' and --delete-excluded guarantee that the shard file the workflow just wrote is missing from the stage dir.

The consequence at runtime

The container mounts $MOUNT_SRC:/ix, cwd=/ix/experimental/CollectiveX. Inside run_in_container.sh, the SHARD guard resolves CX_SHARD_FILE relative to that cwd:

# runtime/run_in_container.sh:458 if [ -n "${CX_SHARD_FILE:-}" ] && [ -f "${CX_SHARD_FILE:-/nonexistent}" ]; then # SHARD mode — sweep every scheduled case ... else # Single-bench (workflow_dispatch) path # uses ${CX_MODE:-normal}, ${CX_PHASE:-decode}, ${CX_ROUTING:-uniform}, # ${CX_DISPATCH_DTYPE:-bf16}, empty CX_CASE_ID/CX_SUITE/CX_WORKLOAD_NAME, ...

The file resolves to /ix/experimental/CollectiveX/results/.shard_<id>.json — which is missing because rsync excluded it — so the test fails and the else branch runs a single default case with none of the shard's identity, N times cheaper than the intended N-case sweep.

Why the rack (EP8) paths escape

The rack-scale launchers iterate cases themselves in the launcher on the SUBMIT host (not inside the container). Their case-list helpers explicitly resolve the shard file against the original checkout when the relative path misses:

# launch_gb300-nv.sh cx_ep8_cases (and launch_gb200-nv.sh cx_ep_cases) local sf="${CX_SHARD_FILE:-}" [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "$CX_DIR/$sf" ] && sf="$CX_DIR/$sf"

The same workaround is absent from run_in_container.sh:458, so the EP4 single-tray path — which shares the b300/gb200-EP4/gb300-EP4 launchers with the staged mount — hits the missing file.

Affected sweeps

Every single-tray staged shard in the v1 promoted matrix, per sweep_matrix.py + configs/suites.yaml platforms:

b300 (all shards; launch_b300.sh is single-node)

gb200 EP4 (CX_NODES<=1 -> run_in_container.sh)

gb300 EP4 (CX_NODES<=1 -> run_in_container.sh)

The h100-dgxc/h200-dgxc/b200-dgxc/mi325x/mi355x paths do not set CX_STAGE_DIR in this workflow (cx_stage_repo becomes a no-op) and are unaffected.

Concrete walk-through (b300 shard)

Setup job resolves matrix; writes experimental/CollectiveX/results/.shard_b300-deepep.json on the checkout with e.g. 24 cases (varied phase/dtype/routing/eplb across ep-core-v1 + ep-routing-v1).

Sweep job on the b300 runner exports CX_SHARD_FILE=results/.shard_b300-deepep.json, checks out the repo, and calls launch_b300.sh.

launch_b300.sh:34 -> cx_stage_repo rsyncs to $CX_STAGE_DIR/job_<id>/experimental/CollectiveX/ with --exclude='results/' --delete-excluded. The shard file is not copied.

srun --container-workdir=$MOUNT_DIR/experimental/CollectiveX ... run_in_container.sh. cwd inside container = /ix/experimental/CollectiveX.

run_in_container.sh:458 tests [ -f "results/.shard_b300-deepep.json" ] -> that resolves to /ix/experimental/CollectiveX/results/.shard_b300-deepep.json -> missing.

Execution falls into the else branch at line 556+. It dispatches ${CX_BENCH} once with CX_MODE=normal, CX_PHASE=decode, CX_ROUTING=uniform, CX_DISPATCH_DTYPE=bf16, empty CX_CASE_ID, empty CX_SUITE, empty CX_WORKLOAD_NAME, empty CX_REQUIRED_PUBLICATION.

One result JSON is produced with no case_id and mismatched identity; the other 23 scheduled cases never run.

Aggregate job's make_bundle.py validate_expected_coverage computes missing_identity + missing + identity_mismatch against matrix_full.json and raise SystemExit(...) — the whole aggregate fails, after b300 GPU-time was spent on the wrong workload.

Impact

For every b300/gb200-EP4/gb300-EP4 shard promoted through v1 (three of the eight SKUs in ep-core-v1 + ep-routing-v1), the sweep silently runs one wrong-config default point instead of the scheduled N-case sweep. Bundle validation catches the divergence but only post-hoc, so the failure is loud yet wasteful: GPU allocations spent, aggregate job red, invalidating the v1 dataset this PR is producing.

Fix

Any one of:

Allow the shard file through the rsync in cx_stage_repo (runtime/common.sh:146):

rsync -a --delete --delete-excluded \ --include='experimental/CollectiveX/results/' \ --include='experimental/CollectiveX/results/.shard_*.json' \ --exclude='__pycache__/' --exclude='results/' ...

Copy the shard file into the stage dir after the rsync completes:

[ -n "${CX_SHARD_FILE:-}" ] && [ -f "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" ] \ && cp -a "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" \ "$stage_dir/experimental/CollectiveX/$CX_SHARD_FILE"

Mirror the rack (EP8) launcher workaround in run_in_container.sh:458:

sf="${CX_SHARD_FILE:-}" # $CX_DIR is not set inside the container; use the fixed workdir instead. [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "/ix/experimental/CollectiveX/$sf" ] \ && sf="/ix/experimental/CollectiveX/$sf" if [ -n "$sf" ] && [ -f "$sf" ]; then ...

Approach (1) or (2) is the smallest change with the least surface area.

claude · 2026-07-03T18:12:17Z

+    elif _run(["ibstat", "-l"]):
+        devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]
+    return {


🟡 _rdma() calls _run(["ibstat", "-l"]) twice at env_capture.py:178-179 — once in the elif condition and once in the comprehension body. If the second invocation returns None (which _run does on shutil.which miss, TimeoutExpired/OSError, or nonzero exit), .splitlines() raises AttributeError and takes down env_capture.py under run_in_container.sh's set -euo pipefail. The trigger is genuinely rare (both calls are microseconds apart on a stable IB stack, and this branch runs only when ibv_devinfo is absent), so nit — but the fix is a one-line refactor mirroring the ibv_devinfo branch just above.

Extended reasoning...

The defect. env_capture._rdma() has an asymmetry between its two RDMA-listing branches:

listing = _run(["ibv_devinfo", "-l"]) # assigned once, iterated once if listing: for line in listing.splitlines()[1:]: ... elif _run(["ibstat", "-l"]): # called once (as a truthiness check) devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()] # called AGAIN

The ibv_devinfo branch just above does the right thing: assign once, reuse. The ibstat branch does not.

Why the crash is theoretical but real. _run() returns None on any of: shutil.which(cmd[0]) failing (line 51), subprocess.TimeoutExpired/OSError (line 57), or out.returncode != 0 (line 59). If the first call returns a truthy string but the second returns None — a transient OS timer glitch, an OOM-killed helper, a stray nonzero exit under load — then None.splitlines() raises AttributeError. Under run_in_container.sh's set -euo pipefail (line 33), that aborts the whole shard step before any GPU benchmark runs.

Step-by-step proof of the theoretical crash path:

Node has ibstat in $PATH but no ibv_devinfo (a real config: MI355X-style stacks with ibstat only).

First call: _run(["ibstat", "-l"]) succeeds → returns "mlx5_0\nmlx5_1\n" → elif condition is truthy.

Second call: a transient nonzero exit (e.g. ibstat racing an IB-driver reload, timer wraparound, PID-namespace hiccup) → out.returncode != 0 → _run returns None.

None.splitlines() → AttributeError: 'NoneType' object has no attribute 'splitlines' → Python exits nonzero → set -e aborts run_in_container.sh → the shard step fails before GPU work.

Why this is nit, not normal. Every verifier converged on the same practical assessment: ibstat -l is a fast local device listing with no network/filesystem dependency, so a transient failure between two back-to-back calls (microseconds apart) is extremely improbable. The elif branch itself only runs when ibv_devinfo is absent, which is uncommon on the target runners since both binaries come from the same InfiniBand userspace stack. And env_capture.py produces a diagnostic/provenance artifact — even a genuine crash here would break provenance capture, not the benchmark measurement. The defect exists but doesn't justify blocking merge.

The fix. One-line refactor to mirror the ibv_devinfo branch:

else: listing = _run(["ibstat", "-l"]) if listing: devices = [d.strip() for d in listing.splitlines() if d.strip()]

Same idiom the file uses immediately above. Eliminates the wasted subprocess call and the theoretical None-deref in one change. Worth doing as a follow-up cleanup, but the PR does not need to block for it.

claude · 2026-07-03T18:12:19Z

+    "required_publication": env("CX_REQUIRED_PUBLICATION") or None,
+    "backend": backend,
+    "phase": phase,
+    "ep": integer("CX_EP", integer("CX_NGPUS", 1)),
+    "gpus_per_node": integer("CX_GPUS_PER_NODE", integer("CX_NGPUS", 1)),
+    "scale_up_domain": integer("CX_SCALE_UP_DOMAIN", integer("CX_NGPUS", 1)),
+    "dispatch_dtype": env("CX_DISPATCH_DTYPE", "bf16"),
+    "mode": env("CX_MODE", "normal"),
+    "contract": env("CX_MEASUREMENT_CONTRACT", "layout-and-dispatch-v1"),
+    "routing": env("CX_ROUTING", "uniform"),
+    "eplb": enabled("CX_EPLB"),
+    "combine_quant_mode": env("CX_COMBINE_QUANT_MODE", "none"),
+    "resource_mode": env("CX_RESOURCE_MODE", "tuned"),
+    "activation_profile": env("CX_ACTIVATION_PROFILE", "normal"),
+    "placement": env("CX_PLACEMENT", "packed"),
+    "routing_step": env("CX_ROUTING_STEP", "0"),
+    "uneven_tokens": env("CX_UNEVEN_TOKENS", "none"),
+    "tokens_ladder": env("CX_TOKENS_LADDER"),
+    "canonical": enabled("CX_CANONICAL"),
+    "sampling_contract": "fixed-512-v1",
+    "samples_per_point": integer("CX_SAMPLES_PER_POINT", 512),
+    "iters": integer("CX_ITERS", 8),
+    "trials": integer("CX_TRIALS", 64),
+    "warmup": integer("CX_WARMUP", 32),
+    "warmup_semantics": env(
+        "CX_WARMUP_SEMANTICS", "full-roundtrip-per-trial-point-v1"
+    ),


🟡 cx_emit_ep_failed_case (runtime/common.sh:256-287) builds failure.case without the hidden/topk/experts/nodes keys, but every matrix case emitted by sweep_matrix.py always carries all four. On the first sweep where any case exhausts its retries (flashinfer intermittent MNNVL, HybridEP/UCCL empty-rank, any deterministic rc=5), make_bundle's _identity_differences reports the same case_id four times as hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1, and validate_expected_coverage piles on by re-listing that case in missing, so the aggregate job aborts with a dual-report that hides the real signal (the case failed all retries — the intended fail-closed behavior). Fix in either place is fine: add the four fields to cx_emit_ep_failed_case from CX_HIDDEN/CX_TOPK/CX_EXPERTS (defaults 7168/8/256) and CX_NGPUS/SLURM_NNODES, or make _identity_differences skip these fields when the actual doc is a failed-case.

Extended reasoning...

The observed behavior

With the PR merged and any sweep that produces a failed-case record for a scheduled case, the aggregate job will fail with a message like:

bundle: expected-matrix coverage failed ( missing_identity=0 missing=['cxv1-...'] extra=[] duplicates=[] identity_mismatch=['cxv1-...:hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1'])

The same case_id appears in both missing and identity_mismatch, and the mismatch string names four fields that have nothing to do with why the case actually failed.

Step-by-step proof

Take a concrete promoted case, say h100-dgxc/deepep/decode under ep-core-v1 (uniform, canonical, deepseek-v3-v1 defaults). sweep_matrix.py:181-186 builds the matrix entry with:

{ ..., "hidden": "", # h==7168 -> "" sentinel "topk": "", # t==8 -> "" "experts": "", # e==256 -> "" "nodes": "1", # always str ... }

When every one of the 4 flashinfer attempts wedges on the intermittent MNNVL completion-flag deadlock (documented in run_in_container.sh around line 526), the last attempt's cx_emit_ep_failed_case writes a failed_*.json whose failure.case dict is missing the four keys entirely — the emitter reads CX_DISPATCH_DTYPE/CX_MODE/etc. but has no CX_HIDDEN/CX_TOPK/CX_EXPERTS/SLURM_NNODES reads.

aggregate_results.py keeps that failed-case doc as the newest for that case_id. Then make_bundle.py runs validate_expected_coverage:

_expected_case_identity(matrix_case) — "hidden" in case is true (value ""), so identity["hidden"] = int("" or 7168) = 7168. Same for topk/experts (8/256). "nodes" in case is true, identity["nodes"] = int("1") = 1. Expected identity contains {hidden: 7168, topk: 8, experts: 256, nodes: 1, ...}.

_actual_case_identity(failed_doc) (the failed-case branch, line 184-195) copies failure.case verbatim, calls _expected_case_identity. None of hidden/topk/experts/nodes are in that dict, so the if field in case: guard skips all four. Actual identity contains everything except the four scheduled shape fields.

_identity_differences iterates the expected identity's items; actual_identity.get("hidden") is None, None != 7168 -> hidden=None!=7168. Same for the other three.

validate_expected_coverage (line 294-298) hits the differences branch, appends the case_id to identity_mismatch, and does not add it to actual{}. Then missing = set(expected) - set(actual) (line 301) also contains that case_id. Line 319 raises the dual-report SystemExit.

validate_results.py:validate_doc's failed-case schema (v5, ~lines 234-243) requires a different, smaller field set that happens to match what the emitter writes, so it stays silent about this desync. Only make_bundle notices, and only in a way that obscures the real cause.

Why this fires in practice

The PR explicitly builds in retry logic — CX_FLASHINFER_RETRIES defaults to 3 attempts, and both the container and rack launchers loop attempts and preserve a failed_*.json when all attempts fail. Retry-exhaustion is expected behavior for known intermittents, but the aggregate step will now report those as identity_mismatch + missing for hidden/topk/experts/nodes — the least informative signal possible.

Impact

Bundle validation still correctly rejects the incomplete run (the intended fail-closed behavior), and no incorrect data ships, so this is a diagnostic-clarity regression rather than a correctness bug. It will, however, cost real triage time in CI: an operator staring at hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1 will not obviously infer "one flashinfer case exhausted its retries."

Fix

Either add the four fields to cx_emit_ep_failed_case (read CX_HIDDEN/CX_TOPK/CX_EXPERTS with defaults 7168/8/256, and CX_NGPUS/SLURM_NNODES for nodes), or teach _identity_differences/_actual_case_identity to drop these fields when the actual doc is a failed-case. Either way the two validators stay in sync.

Freeze the 38-shard cross-vendor EP matrix on one 32-warmup, 512-observation protocol. Add native correctness, closed provenance, three-allocation promotion gates, and an isolated content-addressed filesystem publisher. Close defects exposed by rejected allocations: isolate AMD Enroot state; correct MoRI output shape and unweighted combine semantics; standardize activation-only combine across every adapter; stage pinned DeepEP sources before compute allocation; authenticate reusable build outputs; normalize Hybrid enum identity; query loaded NCCL/RCCL runtimes; and harden cleanup and failure classification. Normalize inherited B300 source-root permissions. Keep DeepEP V2 on PR #605 while pinning the official PR #630 scale-up fix, publish a stable extension evidence label with the real binary hash, require the realized NCCL LSA team to cover the full EP world when GIN is disabled, and key the exact five-kernel JIT evidence by realized topology and device code-generation inputs. 中文：完成隔离式 CollectiveX v1 专家并行基准测试套件。固定 38 个分片的跨厂商矩阵，统一采用 32 次预热和 512 个观测值，并加入原生正确性校验、严格溯源、三次独立分配晋级门槛及本地内容寻址文件系统发布器。修复已拒绝分配暴露的问题：隔离 AMD Enroot 状态；修正 MoRI 输出形状及无权重 combine 语义；统一所有 adapter 的 activation-only combine 边界；在计算节点分配前暂存固定版本的 DeepEP 源码；校验可复用构建产物；规范化 Hybrid 枚举身份；从实际加载的 NCCL/RCCL 运行库读取版本；同时强化清理和失败分类。规范化 B300 共享目录继承的权限。DeepEP V2 保持 PR #605 实现，并固定使用官方 PR #630 的纯 scale-up 修复；以稳定标签记录 extension 证据，同时保留真实二进制哈希；禁用 GIN 时要求 NCCL 实际建立的 LSA team 覆盖整个 EP world；并使用实际拓扑及设备代码生成参数隔离五个预期 JIT kernel 的证据。

Oseltamivir requested a review from a team July 3, 2026 17:38

github-project-automation Bot added this to InferenceMAX Board Jul 3, 2026

claude Bot reviewed Jul 3, 2026

View reviewed changes

Oseltamivir force-pushed the collectivex branch 4 times, most recently from 758fa52 to 1c5b901 Compare July 4, 2026 01:11

github-advanced-security AI found potential problems Jul 4, 2026

View reviewed changes

Comment thread experimental/CollectiveX/tests/test_sampling_contract.py Fixed

Oseltamivir force-pushed the collectivex branch 3 times, most recently from 7e5f80a to 28cbac4 Compare July 4, 2026 03:21

functionstackx changed the title ~~CollectiveX v1: cross-vendor EP benchmark suite~~ CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1：跨厂商 EP 基准测试套件 Jul 4, 2026

Oseltamivir force-pushed the collectivex branch from 28cbac4 to aa318f7 Compare July 4, 2026 06:58

Oseltamivir force-pushed the collectivex branch 13 times, most recently from 63c2335 to 57efb35 Compare July 4, 2026 13:38

Oseltamivir force-pushed the collectivex branch from 57efb35 to 4ff5841 Compare July 4, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex

Oseltamivir commented Jul 3, 2026 •

edited

Loading

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

claude Bot Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Oseltamivir commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark contract

Qualification fixes

Correctness and artifacts

Validation

中文说明

基准测试约定

资格验证修复

正确性与产物

验证

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

The bug

The consequence at runtime

Why the rack (EP8) paths escape

Affected sweeps

Concrete walk-through (b300 shard)

Impact

Fix

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

The observed behavior

Step-by-step proof

Why this fires in practice

Impact

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oseltamivir commented Jul 3, 2026 •

edited

Loading