Skip to content

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004

Open
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex
Open

Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件#2004
Oseltamivir wants to merge 1 commit into
mainfrom
collectivex

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Finalizes the isolated CollectiveX v1 expert-parallel communication benchmark under
experimental/CollectiveX/. The branch is ready for three complete no-canary qualification runs;
it does not claim or include promoted v1 results yet.

Benchmark contract

  • Covers H100, H200, B200, B300, GB200, GB300, MI325X, and MI355X.
  • Supports DeepEP V1, DeepEP V2 PR #605
    with the official PR #630 scale-up fix,
    DeepEP Hybrid, UCCL, MoRI, and an NCCL/RCCL reference.
  • Resolves 38 runnable allocation shards and 360 requested cases / 840 token points: 228 runnable
    cases / 532 points plus 132 explicit unsupported cases / 308 points.
  • Uses 8 timed iterations x 64 trials, 32 synchronized full-roundtrip warmups before every measured
    component/trial/point, and exactly 512 observations for every case.
  • Standardizes combine as activation-only unweighted rank-sum. Dispatch gate weights remain
    oracle-checked but are not returned through the timed combine path.
  • Keeps uniform routing as the headline; Zipf and Zipf+EPLB are experimental sensitivity evidence.

Qualification fixes

  • Fetches and pin-validates source-built DeepEP backends on the GitHub-hosted setup job, then passes a same-run, three-day source artifact to self-hosted jobs; canonical runners never fall back to an upstream network fetch.
  • Uses exact, command-scoped Git trust for UID-remapped shared filesystems and rejects wildcard trust.
  • Normalizes inherited B300 source-root permissions to private mode 0700 and preserves detailed
    source-staging diagnostics only in private logs.
  • Labels the DeepEP extension as deep_ep._C in public evidence while hashing the actual binary.
  • Disables GIN only for declared scale-up cases, then requires NCCL's realized LSA team to cover the
    full EP world; a smaller realized domain fails before timing or publication.
  • Selects the DeepEP V2 JIT namespace after SM/QP, topology, device, and timeout inputs agree across
    ranks, and requires exactly one artifact for each of the five expected kernels.

Correctness and artifacts

The native oracle validates expert-specific payloads, destinations, source identity, multiplicity,
weights, receive counts, combine values, and input immutability on every rank. Provenance binds the
verified image and squash bytes, implementation/build identity, loaded collective runtime, runtime
fingerprint, and generated-kernel evidence.

GitHub result artifacts are transient delivery inputs to an isolated local content-addressed filesystem
publisher. The pinned-source artifact is execution-only, is rejected by the publisher, and expires after
three days. Promotion requires exactly three complete independent runs from one source SHA, exact
coverage, stable p50/p99 evidence, stable ordering, and complete controlled cohorts. No managed
database, managed object store, or third-party result hosting is introduced.

The tracked tree and reachable branch history contain none of the private runner endpoint literals.
experimental/CollectiveX/configs/platforms.yaml is absent from Git, ignored, and used only as a
local operator note.

Validation

  • 131 Python contract/unit tests.
  • Byte-identical matrix generation at
    292e05f8faccaa4971eda527a327190a9943e99d4f71611987f7b95f57f253e8.
  • All 360 cases share timing 8:64:32, 512 samples per point, and one warmup contract.
  • Compileall, JSON schema parsing, bash -n, ShellCheck, Actionlint, and git diff --check.
  • Bilingual documentation parity and exact private-endpoint scans across tracked files and reachable
    Git history.

中文说明

本 PR 完成位于 experimental/CollectiveX/ 的隔离式 CollectiveX v1 专家并行(EP)通信基准测试。
当前分支已准备执行三轮完整、无 canary 的资格验证;目前尚未宣称或提交任何已晋级的 v1 结果。

基准测试约定

  • 覆盖 H100、H200、B200、B300、GB200、GB300、MI325X 和 MI355X。
  • 支持 DeepEP V1、DeepEP V2 PR #605
    及官方 PR #630 scale-up 修复、DeepEP Hybrid、
    UCCL、MoRI,以及 NCCL/RCCL 参考后端。
  • 生成 38 个可运行分配分片,共请求 360 个用例 / 840 个 token 数据点:其中 228 个可运行用例 /
    532 个数据点,另有 132 个明确标记为不支持的用例 / 308 个数据点。
  • 每个用例统一采用 8 次计时迭代 x 64 次试验;每个组件、试验和数据点测量前执行 32 次同步完整
    往返预热,最终严格得到 512 个观测值。
  • 所有后端统一采用 activation-only、unweighted rank-sum combine。Dispatch gate weights 仍由
    oracle 校验,但不会通过被测 combine 路径返回。
  • Uniform routing 作为主结果;Zipf 和 Zipf+EPLB 仅作为实验性敏感度证据。

资格验证修复

  • 由 GitHub 托管的 setup job 获取并校验源码构建型 DeepEP 后端的精确版本,再通过保留三天、仅供同一次运行使用的源码产物交给 self-hosted job;规范运行不会回退到 runner 侧的上游网络获取。
  • 对 UID 重映射共享文件系统仅启用精确、单命令作用域的 Git 信任,并拒绝通配信任。
  • 将 B300 共享目录继承的权限规范化为私有 0700;详细源码暂存诊断只保留在私有日志中。
  • 公共证据使用稳定的 deep_ep._C 标签,同时哈希真实 extension 二进制内容。
  • 仅对声明为 scale-up 的用例禁用 GIN,并要求 NCCL 实际建立的 LSA team 覆盖整个 EP world;
    如果实际 domain 更小,则在计时和发布前直接失败。
  • 在所有 rank 的 SM/QP、拓扑、设备及超时参数一致后才选择 DeepEP V2 JIT namespace,并严格
    要求五个预期 kernel 各有一份产物。

正确性与产物

原生 oracle 在每个 rank 校验专家特定的 payload、目标、源身份、重复次数、权重、接收计数、
combine 数值及输入不可变性。溯源信息绑定已验证的镜像与 squash 内容、实现/构建身份、实际加载的
collective runtime、运行时指纹及生成 kernel 证据。

GitHub 结果产物仅作为临时传输输入,最终写入隔离的本地内容寻址文件系统发布器。固定版本源码产物只用于执行,发布器会明确拒绝该产物,并在三天后过期。只有来自同一 source
SHA 的三轮完整独立运行同时满足精确覆盖、p50/p99 稳定性、排序稳定性及完整受控 cohort,才允许
晋级。不引入托管数据库、托管对象存储或第三方结果托管服务。

受跟踪文件和可达分支历史均不包含私有 runner endpoint 字面值。
experimental/CollectiveX/configs/platforms.yaml 不受 Git 跟踪,已被忽略,仅作为本地 operator
备注使用。

验证

  • 131 个 Python 约定/单元测试。
  • 两次矩阵生成逐字节一致,SHA-256 为
    292e05f8faccaa4971eda527a327190a9943e99d4f71611987f7b95f57f253e8
  • 全部 360 个用例统一使用 8:64:32 timing、每个数据点 512 个样本及同一预热约定。
  • Compileall、JSON Schema 解析、bash -n、ShellCheck、Actionlint 和 git diff --check
  • 中英文文档一致性,以及对受跟踪文件和可达 Git 历史执行的私有 endpoint 精确扫描。

Comment on lines +145 to +150
rsync -a --delete --delete-excluded \
--exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
--exclude='configs/platforms.yaml' --exclude='private-infra.md' \
--exclude='goal.md' --exclude='notes.md' \
"$repo_root/experimental/CollectiveX" "$stage_dir/experimental/" >/dev/null 2>&1 \
|| cx_die "staging CollectiveX failed"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The setup step writes the shard JSON to experimental/CollectiveX/results/.shard_${matrix.id}.json and sets CX_SHARD_FILE=results/.shard_${matrix.id}.json (relative), but cx_stage_repo (runtime/common.sh:145-150) rsyncs the CollectiveX tree with --exclude='results/' --delete-excluded and drops the shard file — so for every staged single-tray SKU (b300 always; gb200/gb300 with EP4 via CX_NODES<=1), the [ -f "$CX_SHARD_FILE" ] guard at run_in_container.sh:458 fails and execution falls into the single-bench else branch (line 556+), silently running one wrong-config default (uniform/decode/bf16, empty case_id) instead of the shard's N scheduled cases. Downstream make_bundle will catch this via missing_identity/coverage but only after GPU allocation was spent on the wrong workload. Cheap fix: allow-list the shard file through the rsync (--include='experimental/CollectiveX/results/' --include='experimental/CollectiveX/results/.shard_*.json' before the results/ exclude), copy the shard file into the stage dir after the rsync, or resolve CX_SHARD_FILE against the original repo root in run_in_container.sh's SHARD guard the way the rack (EP8) launchers already do (see launch_gb300-nv.sh:92-93 / launch_gb200-nv.sh cx_ep_cases).

Extended reasoning...

The bug

The sweep workflow's shard-fanout step writes the resolved case list to experimental/CollectiveX/results/.shard_${matrix.id}.json:

# .github/workflows/collectivex-sweep.yml
env:
  CX_SHARD_FILE: results/.shard_${{ matrix.id }}.json   # RELATIVE path
...
- name: Extract shard from matrix artifact
  working-directory: experimental/CollectiveX
  run: |
    ...
    json.dump({...,'cases':s['cases']}, open('results/.shard_${{ matrix.id }}.json','w'))

The physical file therefore lands at $REPO/experimental/CollectiveX/results/.shard_<id>.json, and CX_SHARD_FILE=results/.shard_<id>.json is interpreted relative to the container's cwd, which is /ix/experimental/CollectiveX.

For every SKU that requires CX_STAGE_DIR (b300 always; gb200/gb300 with EP4 via the CX_NODES<=1 delegate path in launch_gb200-nv.sh:57 / launch_gb300-nv.sh:47), the launcher calls:

# launch_b300.sh:34, launch_gb200-nv.sh:52, launch_gb300-nv.sh:24
MOUNT_SRC="$(cx_stage_repo "$REPO_ROOT" "$CX_STAGE_DIR")"

which rsyncs the tree with an exclude that drops results/:

# experimental/CollectiveX/runtime/common.sh:145-150
rsync -a --delete --delete-excluded \
  --exclude='__pycache__/' --exclude='results/' --exclude='.cx_workloads/' \
  --exclude='configs/platforms.yaml' --exclude='private-infra.md' \
  --exclude='goal.md' --exclude='notes.md' \
  "$repo_root/experimental/CollectiveX" "$stage_dir/experimental/"

Both --exclude='results/' and --delete-excluded guarantee that the shard file the workflow just wrote is missing from the stage dir.

The consequence at runtime

The container mounts $MOUNT_SRC:/ix, cwd=/ix/experimental/CollectiveX. Inside run_in_container.sh, the SHARD guard resolves CX_SHARD_FILE relative to that cwd:

# runtime/run_in_container.sh:458
if [ -n "${CX_SHARD_FILE:-}" ] && [ -f "${CX_SHARD_FILE:-/nonexistent}" ]; then
  # SHARD mode — sweep every scheduled case
  ...
else
  # Single-bench (workflow_dispatch) path
  # uses ${CX_MODE:-normal}, ${CX_PHASE:-decode}, ${CX_ROUTING:-uniform},
  # ${CX_DISPATCH_DTYPE:-bf16}, empty CX_CASE_ID/CX_SUITE/CX_WORKLOAD_NAME, ...

The file resolves to /ix/experimental/CollectiveX/results/.shard_<id>.json — which is missing because rsync excluded it — so the test fails and the else branch runs a single default case with none of the shard's identity, N times cheaper than the intended N-case sweep.

Why the rack (EP8) paths escape

The rack-scale launchers iterate cases themselves in the launcher on the SUBMIT host (not inside the container). Their case-list helpers explicitly resolve the shard file against the original checkout when the relative path misses:

# launch_gb300-nv.sh cx_ep8_cases (and launch_gb200-nv.sh cx_ep_cases)
local sf="${CX_SHARD_FILE:-}"
[ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "$CX_DIR/$sf" ] && sf="$CX_DIR/$sf"

The same workaround is absent from run_in_container.sh:458, so the EP4 single-tray path — which shares the b300/gb200-EP4/gb300-EP4 launchers with the staged mount — hits the missing file.

Affected sweeps

Every single-tray staged shard in the v1 promoted matrix, per sweep_matrix.py + configs/suites.yaml platforms:

  • b300 (all shards; launch_b300.sh is single-node)
  • gb200 EP4 (CX_NODES<=1 -> run_in_container.sh)
  • gb300 EP4 (CX_NODES<=1 -> run_in_container.sh)

The h100-dgxc/h200-dgxc/b200-dgxc/mi325x/mi355x paths do not set CX_STAGE_DIR in this workflow (cx_stage_repo becomes a no-op) and are unaffected.

Concrete walk-through (b300 shard)

  1. Setup job resolves matrix; writes experimental/CollectiveX/results/.shard_b300-deepep.json on the checkout with e.g. 24 cases (varied phase/dtype/routing/eplb across ep-core-v1 + ep-routing-v1).
  2. Sweep job on the b300 runner exports CX_SHARD_FILE=results/.shard_b300-deepep.json, checks out the repo, and calls launch_b300.sh.
  3. launch_b300.sh:34 -> cx_stage_repo rsyncs to $CX_STAGE_DIR/job_<id>/experimental/CollectiveX/ with --exclude='results/' --delete-excluded. The shard file is not copied.
  4. srun --container-workdir=$MOUNT_DIR/experimental/CollectiveX ... run_in_container.sh. cwd inside container = /ix/experimental/CollectiveX.
  5. run_in_container.sh:458 tests [ -f "results/.shard_b300-deepep.json" ] -> that resolves to /ix/experimental/CollectiveX/results/.shard_b300-deepep.json -> missing.
  6. Execution falls into the else branch at line 556+. It dispatches ${CX_BENCH} once with CX_MODE=normal, CX_PHASE=decode, CX_ROUTING=uniform, CX_DISPATCH_DTYPE=bf16, empty CX_CASE_ID, empty CX_SUITE, empty CX_WORKLOAD_NAME, empty CX_REQUIRED_PUBLICATION.
  7. One result JSON is produced with no case_id and mismatched identity; the other 23 scheduled cases never run.
  8. Aggregate job's make_bundle.py validate_expected_coverage computes missing_identity + missing + identity_mismatch against matrix_full.json and raise SystemExit(...) — the whole aggregate fails, after b300 GPU-time was spent on the wrong workload.

Impact

For every b300/gb200-EP4/gb300-EP4 shard promoted through v1 (three of the eight SKUs in ep-core-v1 + ep-routing-v1), the sweep silently runs one wrong-config default point instead of the scheduled N-case sweep. Bundle validation catches the divergence but only post-hoc, so the failure is loud yet wasteful: GPU allocations spent, aggregate job red, invalidating the v1 dataset this PR is producing.

Fix

Any one of:

  1. Allow the shard file through the rsync in cx_stage_repo (runtime/common.sh:146):

    rsync -a --delete --delete-excluded \
      --include='experimental/CollectiveX/results/' \
      --include='experimental/CollectiveX/results/.shard_*.json' \
      --exclude='__pycache__/' --exclude='results/' ...
  2. Copy the shard file into the stage dir after the rsync completes:

    [ -n "${CX_SHARD_FILE:-}" ] && [ -f "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" ] \
      && cp -a "$repo_root/experimental/CollectiveX/$CX_SHARD_FILE" \
              "$stage_dir/experimental/CollectiveX/$CX_SHARD_FILE"
  3. Mirror the rack (EP8) launcher workaround in run_in_container.sh:458:

    sf="${CX_SHARD_FILE:-}"
    # $CX_DIR is not set inside the container; use the fixed workdir instead.
    [ -n "$sf" ] && [ ! -f "$sf" ] && [ -f "/ix/experimental/CollectiveX/$sf" ] \
      && sf="/ix/experimental/CollectiveX/$sf"
    if [ -n "$sf" ] && [ -f "$sf" ]; then ...

Approach (1) or (2) is the smallest change with the least surface area.

Comment thread experimental/CollectiveX/env_capture.py Outdated
Comment on lines +178 to +180
elif _run(["ibstat", "-l"]):
devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]
return {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _rdma() calls _run(["ibstat", "-l"]) twice at env_capture.py:178-179 — once in the elif condition and once in the comprehension body. If the second invocation returns None (which _run does on shutil.which miss, TimeoutExpired/OSError, or nonzero exit), .splitlines() raises AttributeError and takes down env_capture.py under run_in_container.sh's set -euo pipefail. The trigger is genuinely rare (both calls are microseconds apart on a stable IB stack, and this branch runs only when ibv_devinfo is absent), so nit — but the fix is a one-line refactor mirroring the ibv_devinfo branch just above.

Extended reasoning...

The defect. env_capture._rdma() has an asymmetry between its two RDMA-listing branches:

listing = _run(["ibv_devinfo", "-l"])   # assigned once, iterated once
if listing:
    for line in listing.splitlines()[1:]:
        ...
elif _run(["ibstat", "-l"]):             # called once (as a truthiness check)
    devices = [d.strip() for d in _run(["ibstat", "-l"]).splitlines() if d.strip()]  # called AGAIN

The ibv_devinfo branch just above does the right thing: assign once, reuse. The ibstat branch does not.

Why the crash is theoretical but real. _run() returns None on any of: shutil.which(cmd[0]) failing (line 51), subprocess.TimeoutExpired/OSError (line 57), or out.returncode != 0 (line 59). If the first call returns a truthy string but the second returns None — a transient OS timer glitch, an OOM-killed helper, a stray nonzero exit under load — then None.splitlines() raises AttributeError. Under run_in_container.sh's set -euo pipefail (line 33), that aborts the whole shard step before any GPU benchmark runs.

Step-by-step proof of the theoretical crash path:

  1. Node has ibstat in $PATH but no ibv_devinfo (a real config: MI355X-style stacks with ibstat only).
  2. First call: _run(["ibstat", "-l"]) succeeds → returns "mlx5_0\nmlx5_1\n" → elif condition is truthy.
  3. Second call: a transient nonzero exit (e.g. ibstat racing an IB-driver reload, timer wraparound, PID-namespace hiccup) → out.returncode != 0_run returns None.
  4. None.splitlines()AttributeError: 'NoneType' object has no attribute 'splitlines' → Python exits nonzero → set -e aborts run_in_container.sh → the shard step fails before GPU work.

Why this is nit, not normal. Every verifier converged on the same practical assessment: ibstat -l is a fast local device listing with no network/filesystem dependency, so a transient failure between two back-to-back calls (microseconds apart) is extremely improbable. The elif branch itself only runs when ibv_devinfo is absent, which is uncommon on the target runners since both binaries come from the same InfiniBand userspace stack. And env_capture.py produces a diagnostic/provenance artifact — even a genuine crash here would break provenance capture, not the benchmark measurement. The defect exists but doesn't justify blocking merge.

The fix. One-line refactor to mirror the ibv_devinfo branch:

else:
    listing = _run(["ibstat", "-l"])
    if listing:
        devices = [d.strip() for d in listing.splitlines() if d.strip()]

Same idiom the file uses immediately above. Eliminates the wasted subprocess call and the theoretical None-deref in one change. Worth doing as a follow-up cleanup, but the PR does not need to block for it.

Comment on lines +260 to +286
"required_publication": env("CX_REQUIRED_PUBLICATION") or None,
"backend": backend,
"phase": phase,
"ep": integer("CX_EP", integer("CX_NGPUS", 1)),
"gpus_per_node": integer("CX_GPUS_PER_NODE", integer("CX_NGPUS", 1)),
"scale_up_domain": integer("CX_SCALE_UP_DOMAIN", integer("CX_NGPUS", 1)),
"dispatch_dtype": env("CX_DISPATCH_DTYPE", "bf16"),
"mode": env("CX_MODE", "normal"),
"contract": env("CX_MEASUREMENT_CONTRACT", "layout-and-dispatch-v1"),
"routing": env("CX_ROUTING", "uniform"),
"eplb": enabled("CX_EPLB"),
"combine_quant_mode": env("CX_COMBINE_QUANT_MODE", "none"),
"resource_mode": env("CX_RESOURCE_MODE", "tuned"),
"activation_profile": env("CX_ACTIVATION_PROFILE", "normal"),
"placement": env("CX_PLACEMENT", "packed"),
"routing_step": env("CX_ROUTING_STEP", "0"),
"uneven_tokens": env("CX_UNEVEN_TOKENS", "none"),
"tokens_ladder": env("CX_TOKENS_LADDER"),
"canonical": enabled("CX_CANONICAL"),
"sampling_contract": "fixed-512-v1",
"samples_per_point": integer("CX_SAMPLES_PER_POINT", 512),
"iters": integer("CX_ITERS", 8),
"trials": integer("CX_TRIALS", 64),
"warmup": integer("CX_WARMUP", 32),
"warmup_semantics": env(
"CX_WARMUP_SEMANTICS", "full-roundtrip-per-trial-point-v1"
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 cx_emit_ep_failed_case (runtime/common.sh:256-287) builds failure.case without the hidden/topk/experts/nodes keys, but every matrix case emitted by sweep_matrix.py always carries all four. On the first sweep where any case exhausts its retries (flashinfer intermittent MNNVL, HybridEP/UCCL empty-rank, any deterministic rc=5), make_bundle's _identity_differences reports the same case_id four times as hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1, and validate_expected_coverage piles on by re-listing that case in missing, so the aggregate job aborts with a dual-report that hides the real signal (the case failed all retries — the intended fail-closed behavior). Fix in either place is fine: add the four fields to cx_emit_ep_failed_case from CX_HIDDEN/CX_TOPK/CX_EXPERTS (defaults 7168/8/256) and CX_NGPUS/SLURM_NNODES, or make _identity_differences skip these fields when the actual doc is a failed-case.

Extended reasoning...

The observed behavior

With the PR merged and any sweep that produces a failed-case record for a scheduled case, the aggregate job will fail with a message like:

bundle: expected-matrix coverage failed (
  missing_identity=0 missing=['cxv1-...'] extra=[] duplicates=[]
  identity_mismatch=['cxv1-...:hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1'])

The same case_id appears in both missing and identity_mismatch, and the mismatch string names four fields that have nothing to do with why the case actually failed.

Step-by-step proof

Take a concrete promoted case, say h100-dgxc/deepep/decode under ep-core-v1 (uniform, canonical, deepseek-v3-v1 defaults). sweep_matrix.py:181-186 builds the matrix entry with:

{
  ...,
  "hidden": "",     # h==7168 -> "" sentinel
  "topk": "",       # t==8    -> ""
  "experts": "",    # e==256  -> ""
  "nodes": "1",     # always str
  ...
}

When every one of the 4 flashinfer attempts wedges on the intermittent MNNVL completion-flag deadlock (documented in run_in_container.sh around line 526), the last attempt's cx_emit_ep_failed_case writes a failed_*.json whose failure.case dict is missing the four keys entirely — the emitter reads CX_DISPATCH_DTYPE/CX_MODE/etc. but has no CX_HIDDEN/CX_TOPK/CX_EXPERTS/SLURM_NNODES reads.

aggregate_results.py keeps that failed-case doc as the newest for that case_id. Then make_bundle.py runs validate_expected_coverage:

  1. _expected_case_identity(matrix_case)"hidden" in case is true (value ""), so identity["hidden"] = int("" or 7168) = 7168. Same for topk/experts (8/256). "nodes" in case is true, identity["nodes"] = int("1") = 1. Expected identity contains {hidden: 7168, topk: 8, experts: 256, nodes: 1, ...}.
  2. _actual_case_identity(failed_doc) (the failed-case branch, line 184-195) copies failure.case verbatim, calls _expected_case_identity. None of hidden/topk/experts/nodes are in that dict, so the if field in case: guard skips all four. Actual identity contains everything except the four scheduled shape fields.
  3. _identity_differences iterates the expected identity's items; actual_identity.get("hidden") is None, None != 7168 -> hidden=None!=7168. Same for the other three.
  4. validate_expected_coverage (line 294-298) hits the differences branch, appends the case_id to identity_mismatch, and does not add it to actual{}. Then missing = set(expected) - set(actual) (line 301) also contains that case_id. Line 319 raises the dual-report SystemExit.

validate_results.py:validate_doc's failed-case schema (v5, ~lines 234-243) requires a different, smaller field set that happens to match what the emitter writes, so it stays silent about this desync. Only make_bundle notices, and only in a way that obscures the real cause.

Why this fires in practice

The PR explicitly builds in retry logic — CX_FLASHINFER_RETRIES defaults to 3 attempts, and both the container and rack launchers loop attempts and preserve a failed_*.json when all attempts fail. Retry-exhaustion is expected behavior for known intermittents, but the aggregate step will now report those as identity_mismatch + missing for hidden/topk/experts/nodes — the least informative signal possible.

Impact

Bundle validation still correctly rejects the incomplete run (the intended fail-closed behavior), and no incorrect data ships, so this is a diagnostic-clarity regression rather than a correctness bug. It will, however, cost real triage time in CI: an operator staring at hidden=None!=7168,topk=None!=8,experts=None!=256,nodes=None!=1 will not obviously infer "one flashinfer case exhausted its retries."

Fix

Either add the four fields to cx_emit_ep_failed_case (read CX_HIDDEN/CX_TOPK/CX_EXPERTS with defaults 7168/8/256, and CX_NGPUS/SLURM_NNODES for nodes), or teach _identity_differences/_actual_case_identity to drop these fields when the actual doc is a failed-case. Either way the two validators stay in sync.

@Oseltamivir Oseltamivir force-pushed the collectivex branch 4 times, most recently from 758fa52 to 1c5b901 Compare July 4, 2026 01:11
Comment thread experimental/CollectiveX/tests/test_sampling_contract.py Fixed
@Oseltamivir Oseltamivir force-pushed the collectivex branch 3 times, most recently from 7e5f80a to 28cbac4 Compare July 4, 2026 03:21
@functionstackx functionstackx changed the title CollectiveX v1: cross-vendor EP benchmark suite CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 Jul 4, 2026
@functionstackx functionstackx changed the title CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 / CollectiveX v1: 크로스 벤더 EP 벤치마크 스위트 Jul 4, 2026
@Oseltamivir Oseltamivir changed the title CollectiveX v1: cross-vendor EP benchmark suite / CollectiveX v1:跨厂商 EP 基准测试套件 / CollectiveX v1: 크로스 벤더 EP 벤치마크 스위트 Finalize CollectiveX v1 cross-vendor EP benchmark suite / 完成 CollectiveX v1 跨厂商 EP 基准测试套件 Jul 4, 2026
@Oseltamivir Oseltamivir force-pushed the collectivex branch 13 times, most recently from 63c2335 to 57efb35 Compare July 4, 2026 13:38
Freeze the 38-shard cross-vendor EP matrix on one 32-warmup, 512-observation protocol. Add native correctness, closed provenance, three-allocation promotion gates, and an isolated content-addressed filesystem publisher.

Close defects exposed by rejected allocations: isolate AMD Enroot state; correct MoRI output shape and unweighted combine semantics; standardize activation-only combine across every adapter; stage pinned DeepEP sources before compute allocation; authenticate reusable build outputs; normalize Hybrid enum identity; query loaded NCCL/RCCL runtimes; and harden cleanup and failure classification.

Normalize inherited B300 source-root permissions. Keep DeepEP V2 on PR #605 while pinning the official PR #630 scale-up fix, publish a stable extension evidence label with the real binary hash, require the realized NCCL LSA team to cover the full EP world when GIN is disabled, and key the exact five-kernel JIT evidence by realized topology and device code-generation inputs.

中文:完成隔离式 CollectiveX v1 专家并行基准测试套件。固定 38 个分片的跨厂商矩阵,统一采用 32 次预热和 512 个观测值,并加入原生正确性校验、严格溯源、三次独立分配晋级门槛及本地内容寻址文件系统发布器。

修复已拒绝分配暴露的问题:隔离 AMD Enroot 状态;修正 MoRI 输出形状及无权重 combine 语义;统一所有 adapter 的 activation-only combine 边界;在计算节点分配前暂存固定版本的 DeepEP 源码;校验可复用构建产物;规范化 Hybrid 枚举身份;从实际加载的 NCCL/RCCL 运行库读取版本;同时强化清理和失败分类。

规范化 B300 共享目录继承的权限。DeepEP V2 保持 PR #605 实现,并固定使用官方 PR #630 的纯 scale-up 修复;以稳定标签记录 extension 证据,同时保留真实二进制哈希;禁用 GIN 时要求 NCCL 实际建立的 LSA team 覆盖整个 EP world;并使用实际拓扑及设备代码生成参数隔离五个预期 JIT kernel 的证据。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants