Skip to content

[NV] llm-d-vllm: Add llm-d to the InferenceX benchmarking framework#2050

Open
ezrasilvera wants to merge 24 commits into
mainfrom
llm-d-initial
Open

[NV] llm-d-vllm: Add llm-d to the InferenceX benchmarking framework#2050
ezrasilvera wants to merge 24 commits into
mainfrom
llm-d-initial

Conversation

@ezrasilvera

Copy link
Copy Markdown
Collaborator

Adding llm-d into InferenceX Starting with DeepSeek-V4-Pro FP4 GB200 disaggregated benchmarks (mid-curve + high-tpt)

Summary

Adds llm-d (vLLM) as a new benchmark framework in InferenceX, with a no-k8s multi-node path (Envoy + EPP + P/D sidecar over SLURM/pyxis), and theDeepSeek-V4-Pro FP4 GB200 P/D-disaggregated recipes:

  • dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe - 1P1D DEP8, conc 256/512/1024
  • dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe - 2P1D DEP8, conc 4096

Both keys share one engine/EPP recipe (benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml);
topology and concurrency come from the master-config keys, not the recipe.

Notable

  • These keys mirror dsv4-fp4-gb200-dynamo-vllm - same SKU, model, ISL/OSL, disagg TP1/DP8/EP8 layout, and the same concurrency points (256/512/1024/4096).
  • Emits the same agg_*.json rows through process_result.py (framework=llm-d-vllm, X=conc, Y=tput_per_gpu) and the same _gpus_/_ctx_/_gen_ result-filename convention the multinode ingest already parses. The only downstream addition is
    the framework label in InferenceX-app (separate PR).
  • Every reported point is a full 100%-completion sweep: (e.g. 12288/12288 at conc4096).
  • One recipe, two topologies: Mid-curve (1P1D) and high-tpt (2P1D) share a single engine/EPP recipe; node counts and concurrency live in the master-config keys - so adding a curve point is a config-key change, not a new recipe file.

Scope

  • GB200 DeepSeek-V4-Pro FP4 only. No changes to other SKUs/models/frameworks.

Known merge-blocker

  • The image: on both keys is a personal build (ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23). It should be republished to the org registry and the image: lines updated before merge.

Related

Before merge

  • Republish image to org registry; update image: on both keys.
  • Canonicalize perf-changelog.yaml pr-link from pull/XXX to this PR's number.
  • Full sweep green.

Add a new llm-d-vllm disaggregated (prefill/decode) multi-node benchmark
path to InferenceX, orchestrated entirely by SLURM (no Kubernetes). vLLM
engines with llm-d EPP routing, an Envoy front-end, and NIXL KV transfer.

- benchmarks/llm-d/: container image (Dockerfile on vLLM v0.23 base),
  Envoy + EPP config, binaries.env, and an extraction helper that
  decouples epp/sidecar/envoy from the vLLM image.
- benchmarks/multi_node/llm-d/: server.sh orchestrator (per-role TP/EP,
  prefill/decode coordinators, PREFILL_WORKERS multi-engine prefill),
  job.slurm (pyxis/enroot + docker paths, per-node model probe), submit.sh.
- runners/launch_gb200-nv.sh: dispatch branch for FRAMEWORK=llm-d-vllm.
- benchmarks/benchmark_lib.sh: anchor a relative lm_eval task yaml to the
  repo root so evals resolve inside the serving container.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Route dsv4 models through the vLLM tokenizer wrapper and allow
deepseek_v4 as a --tokenizer-mode choice, bypassing HF AutoTokenizer
which does not recognize the model on the benchmark image. Adds
--trust-remote-code / --use-chat-template / --dsv4 passthroughs.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
- benchmarks/multi_node/dsv4_fp4_gb200_llm-d-vllm.sh: GB200 wrapper
  (GPUS_PER_NODE=4) for the DSV4-Pro FP4 llm-d-vllm path.
- llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml: 1P+1D DEP8 recipe.
- nvidia-master.yaml: register the mid-curve-megamoe (conc 256/512/1024)
  and high-tpt-megamoe (Option-B 2P+1D, conc 4096) config keys.
- perf-changelog.yaml: changelog entry for the two keys.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…diagnostics

The decode leader's prefill-readiness curl had no connect/max timeout, so a
blackholed prefill endpoint made a single curl block forever - the 5-min
deadline is only checked between calls, so it never fired. A 2P
(PREFILL_WORKERS=2) run wedged ~7h on this before being cancelled.

- Add --connect-timeout 5 --max-time 10 so an unreachable endpoint trips the
  5-min deadline and FATALs cleanly instead of hanging the whole run.
- Probe every prefill node (ALL_IPS[0:PREFILL_NODES]), not just IPS[0]:
  endpoints.yaml lists one prefill endpoint per node, and PREFILL_WORKERS>1
  spans multiple independent DP engines, so IPS[0] only covered engine 1.
- On failure, dump network diagnostics - source IP toward target,
  /proc/net/route on-link-vs-gateway subnet classification, bash /dev/tcp L4
  connect test, ping, verbose curl - to separate a server-not-ready problem
  from a subnet/routing/firewall problem. Log the subnet layout up front too.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Two teardown fixes so a failed/hung benchmark can no longer wedge the whole
SLURM allocation (a conc4096 2P run hung 8h until GitHub killed it at the
max-execution limit, losing the server-log tarball):

- server.sh coordinator: write the .bench_done release marker from an EXIT
  trap, so job.slurm scancels the allocation on ANY coordinator exit (failed
  bench, set -e abort, EPP/Envoy bring-up error), not only a clean finish.
- benchmark_lib.sh: optional `timeout` wrapper around the bench client, gated
  on BENCH_TIMEOUT_S (default 2400s, set by server.sh); a hung client (asyncio
  shutdown wedged on thousands of failed in-flight requests) is killed instead
  of pinning 24 GPUs to TIME_LIMIT. The bench call is also made non-fatal so a
  multi-conc sweep records every point it can and always reaches teardown.

Replace the leader-only .prom scraper with metrics_scrape.py, which scrapes
EVERY prefill and decode vLLM (on the vLLM port, not the sidecar) plus EPP into
one time-correlated CSV. This makes per-engine load visible - crucially whether
EPP fans out across all prefill engines - which the old PREFILL_LEADER_IP-only
scraper could not show for PREFILL_WORKERS>1.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
dsv4-fp4-gb200-llm-d-vllm-smoke-2p1d: two independent 1-node DEP4 prefill
engines (PREFILL_NODES=2 PREFILL_WORKERS=2) + 1 DEP4 decode = 3 nodes / 12 GPUs,
ISL/OSL 1024/128, single conc 128 x ~512 prompts, RANDOM_RANGE_RATIO=1.0. Runs
in minutes; validates the per-engine metrics scraper and confirms EPP fans out
across both prefill engines before spending the 6-node high-tpt run. Reuses the
mid-curve recipe.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
The 3-node DEP4 smoke (1 node/engine) OOMs at model load: DSV4-Pro's MegaMoE
needs each EP rank to hold 1/4 of the experts at EP4 (~2x the per-GPU weight of
the validated EP8), which does not fit in HBM (run 28588980717:
"CUDA Error: out of memory at cumem_allocator.cpp:139"). DEP8 (2 nodes/engine)
is the minimum viable EP. Smoke is now PREFILL_NODES=4 PREFILL_WORKERS=2 (2x
DEP8) + DECODE_NODES=2 = 6 nodes, same layout as high-tpt but a tiny workload
(conc 128), so it validates the scraper + EPP fan-out fast on the real topology.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
LLMD_LB_MODE (recipe env, default hybrid). multi-port switches both roles to
--data-parallel-multi-port-external-lb + --data-parallel-supervisor-port and
drops --api-server-count, so each DP rank serves on VLLM_PORT+rank (base+global
rank; a worker node binds VLLM_PORT+START_RANK..). The decode pd-sidecar gets
--data-parallel-size=$DP_SIZE, which per data_parallel.go makes it expose
SIDECAR_PORT+i -> vLLM VLLM_PORT+i; per node only its local rank ports work, so
endpoints.yaml lists each node's local rank ports (node j: base + j*GPUS_PER_NODE
.. +GPUS_PER_NODE). Health waits + the metrics scraper follow the same per-rank
ports (scraper hits the vLLM port, never the sidecar). Mirrors the upstream
GENERIC wide-ep-lws guide (DeepSeek-R1), which uses per-rank ports for P/D; our
DSV4 path was hybrid-lb (one port/node).

Recipe dsv4-fp4-gb200-megamoe-mp.yaml = mid-curve engine config + LLMD_LB_MODE
multi-port + EPP scorers matched to the upstream router config (drop
kv-cache-utilization-scorer; prefill = prefix-cache/queue/active-request,
decode = active-request only). Keys smoke-2p1d-mp (fast validation) and
high-tpt-megamoe-mp (conc4096 A/B vs the hybrid high-tpt key) select it. Hybrid
path is unchanged (default), so existing keys are byte-identical.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
mp-smoke 28596473295 showed every prefill node (leader AND worker) binds
VLLM_PORT+local_index (8200-8203), not base+global rank: --data-parallel-start-rank
only sets DP rank IDs, the HTTP servers are always the node-local range. My
base+global assumption made the health gate probe a dead worker port
(10.30.1.39:8204 not ready). Fix everywhere to base+local:
  - HEALTH_PORT / SIDECAR_HEALTH_PORT = VLLM_PORT / SIDECAR_PORT (local rank-0).
  - sidecar --data-parallel-size = DP_SIZE_LOCAL (each node exposes its 4 local
    ranks at SIDECAR_PORT..+3 -> vLLM VLLM_PORT..+3).
  - endpoints + scraper: each node lists base..base+GPUS_PER_NODE-1 (same range
    on every node, distinct address).
Also fix a double 'export export LLMD_LB_MODE' from overlapping edits.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
… bundle

At conc4096 both Dynamo (8153 tok/s/GPU) and upstream llm-d PR #1737 (8858)
succeed on the IDENTICAL 2P TP1/DP8/EP8 layout we use, while ours collapses to
8.3% completion. The difference is the engine bundle, so this recipe applies it:
prefill gpu-mem 0.95, max-model-len 9280, max-num-seqs 16, max-num-batched-tokens
32768, no-enable-flashinfer-autotune; decode gpu-mem 0.9, max-model-len 9280,
no-flashinfer-autotune. Same EPP/env/topology + kv producer-consumer + hybrid-lb
as mid-curve, to isolate the engine bundle. The max-model-len/no-autotune/
max-num-seqs guards are what let this 32768/0.95 bundle bring up without the two
OOMs our earlier speed-bundle hit. Keys tuned-smoke + tuned-high-tpt select it.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
The no-kube Envoy front door (benchmarks/llm-d/envoy.yaml) defined the epp
ext_proc cluster and the original_dst cluster with no circuit_breakers block,
so Envoy applied its DEFAULT per-cluster thresholds of max_requests=1024 and
max_pending_requests=1024. The ext_proc filter holds one long-lived HTTP/2
stream to EPP for the entire lifetime of every request, so at max-concurrency
4096 only 1024 active + 1024 pending = 2048 requests ever reached EPP; the
remaining ~18432 (warmups + prompts) got a fast 500 and never saw the engine.

This capped every high-tpt run at exactly 1024 completions / 2048 disagg
decisions, identically across engine configs (mid-curve, Dynamo-tuned) and LB
modes (hybrid, multi-port) - because the cap was the front door, not the
engine. It is why Dynamo (no Envoy) and upstream llm-d (k8s gateway) hit
8000+ tok/s/GPU on the same TP1/DP8/EP8 2P1D layout where we stalled at ~1783.

Fix: add circuit_breakers with thresholds far above max-concurrency to both
clusters. Evidence: envoy_access RESPONSE_CODE histogram = 2048x200 + 18432x500
in all three walled runs; disagg_decision_total pinned at 2048 in all three.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…0000

Align the circuit_breakers thresholds added in the previous commit to the exact
values from the authoritative upstream preset - the llm-d-router-standalone Helm
chart (llm-d-inference-scheduler config/charts/llm-d-router-standalone/values.yaml
lines 191-208): max_connections/max_pending_requests/max_requests = 40000 on both
the ext_proc (EPP) and original_dst clusters, plus max_retries 1024 on ext_proc.
Both clear max-concurrency 4096 with wide headroom; 40000 just matches upstream
rather than an arbitrary ceiling.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
… upstream)

The prior CB commit lifted the epp ext_proc cluster to 40000 but left the
original_dst cluster at 1000000. Both clear max-concurrency 4096 identically
(this is the exact config the 8452 tok/s/GPU high-tpt confirmation run used),
but pin original_dst to 40000 too so both clusters match each other and the
upstream llm-d-router-standalone preset.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…util + decode prefix-cache)

Align the mid-curve/high-tpt recipe's EPP scheduling profiles to the upstream
generic wide-ep-lws router config
(llm-d guides/wide-ep-lws/router/wide-ep-lws.values.yaml):
  prefill = prefix-cache(3) + queue(2) + active-request(2)
  decode  = active-request only

Removed: kv-cache-utilization-scorer from prefill (upstream does not use it; its
weight becomes active-request(2)); and prefix-cache-scorer from decode - at
weight 3 it outweighed active-request(2) and pinned decode to one pod on a
shared prompt prefix, producing the ~2:1 decode routing imbalance (decode-0 1389
vs decode-1 659). Upstream decode routes on active-request alone and balances.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…keys

Pre-PR cleanup of debug/experiment scaffolding added while root-causing the
conc4096 wall (now fixed via the Envoy circuit-breaker change). Removed:
- benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-mp.yaml (multi-port
  LB experiment - dead: NIXL KV-expiry)
- benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-tuned.yaml (Dynamo
  engine-bundle A/B - superseded; the circuit-breaker fix, not engine args, was
  the wall)
- nvidia-master.yaml keys: smoke-2p1d, smoke-2p1d-mp, high-tpt-megamoe-mp,
  tuned-smoke, tuned-high-tpt (experiment/smoke keys tied to the above)

Kept: mid-curve-megamoe + high-tpt-megamoe keys, the envoy circuit_breakers fix,
the teardown/timeout fix, and the DSV4 tokenizer plumbing. Deferred to a
follow-up (after the mid-curve rerun validates and its scraper CSV is analysed):
removing the per-engine vLLM metrics scraper and the multi-port branches in
server.sh.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…pper

- Remove metrics_scrape.py + its scraper bring-up/stop/conc-tag in server.sh
- Remove dead multi-port LB path (LLMD_LB_MODE and all branches); hybrid only
- Remove EPP --v=4 and NCCL/UCX transport debug logging (keep UCX_TLS pin)
- Rename dsv4_fp4_gb200_llm-d-vllm.sh -> ...-disagg.sh; update launcher SCRIPT_NAME

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment-only cleanup across the llm-d-vllm GB200 assets:
- recipe header rewritten to describe the common per-engine shape and the
  two topologies (mid-curve 1P1D / high-tpt 2P1D) it backs
- drop historical tuning-log comments from the recipe, launcher, envoy.yaml,
  binaries.env, and Dockerfile
Verified: no non-comment lines changed; envoy circuit_breakers (40000) intact.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Include dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe (2P1D DEP8, conc 4096)
alongside the mid-curve key so a labeled PR sweep validates both topologies,
not just mid-curve.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-enabledfull-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

2 similar comments
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-enabledfull-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-enabledfull-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

Comment thread benchmarks/llm-d/binaries.env Outdated
Comment on lines +10 to +12
#
# This is currently not used as we build an use an image that contains
# all binaries

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment on lines 10-12 says "This is currently not used as we build an use an image that contains all binaries," but benchmarks/multi_node/llm-d/job.slurm (lines 236-241 of the pyxis branch) explicitly sources this file and uses LLMD_BIN_DIR to conditionally bind-mount epp/pd-sidecar/envoy — so the file IS wired into the runtime as an optional override. Also a typo: "build an use" → "build and use". Nit — no runtime impact, but the stale comment will mislead maintainers into thinking they can delete the file.

Extended reasoning...

What the comment claims vs. what the code does

Lines 10-12 of benchmarks/llm-d/binaries.env state:

This is currently not used as we build an use an image that contains all binaries

However, the pyxis branch of benchmarks/multi_node/llm-d/job.slurm explicitly sources this file and consumes LLMD_BIN_DIR to build optional bind-mounts:

[[ -f "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env" ]] && \
    source "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env"
for _bin in epp pd-sidecar envoy; do
    if [[ -n "${LLMD_BIN_DIR:-}" && -x "${LLMD_BIN_DIR}/${_bin}" ]]; then
        PYXIS_MOUNTS+=",${LLMD_BIN_DIR}/${_bin}:/usr/local/bin/${_bin}:ro"
        ...
    fi
done

So binaries.env is wired into the runtime — it acts as the single source of truth for LLMD_BIN_DIR (defaulted to /mnt/lustre01/users-public/sa-shared/llm-d-bins), which the pyxis path uses to swap in mounted binaries and skip a combined-image rebuild.

Step-by-step proof

  1. Author enables the pyxis path (the default on GB200 via LLMD_CONTAINER_ENGINE=pyxis set in runners/launch_gb200-nv.sh).
  2. job.slurm reaches the pyxis branch and sources binaries.env, populating LLMD_BIN_DIR.
  3. If a user has run extract-binaries.sh (which also sources this file), the three binaries exist under $LLMD_BIN_DIR and get bind-mounted into every container.
  4. This is the entire point of the mounted-binary path documented in extract-binaries.sh and the Dockerfile — running a stock vllm/vllm-openai image without rebuilding the combined image.

The comment therefore contradicts both the sibling files (extract-binaries.sh, Dockerfile, envoy.yaml docstrings) that describe binaries.env as the single source of truth, and job.slurm itself.

Addressing the counter-argument

One could argue the comment is "arguably accurate for the DEFAULT deployment path" — in the current recipe the combined image bakes binaries in, so the -x check makes the mount loop a no-op unless the operator has explicitly run extract-binaries.sh. That's a fair characterization of when the file has effect, but "currently not used" is stronger than "currently a no-op by default" and it flatly contradicts the sourcing on job.slurm:236. A future maintainer trying to trim "unused" scaffolding would delete this file and quietly break the mounted-binary path.

Impact and fix

No runtime impact — the code works today. The concern is purely documentation drift in a new file this PR is introducing. Fix in one edit:

# This file is sourced by benchmarks/multi_node/llm-d/job.slurm to
# optionally bind-mount extracted binaries in place of the combined
# image's baked-in copies (see extract-binaries.sh). It is a no-op
# by default (the -x check fails when the binaries have not been
# extracted); the Dockerfile also reads the same URLs when building
# and using an image that contains all binaries.

Also fix the "build an use" → "build and use" typo while you're in there.

Comment thread benchmarks/multi_node/llm-d/README.md Outdated
Comment on lines +80 to +95
### Why one entry per *leader* (not per node)

In the wide-EP guide each instance is a single vLLM engine that spans
multiple nodes via `--data-parallel-hybrid-lb`. With hybrid-lb, the
leader pod (`LWS_WORKER_INDEX=0`) accepts external traffic and
distributes it internally across the local DP ranks; in our LWS-free
SLURM mapping, the prefill-leader and decode-leader are the only nodes
addressable from outside. Adding an entry per worker would cause EPP to
route directly to a worker, bypassing the engine's internal load
balancing.

If we later want to expose all pods of an instance (the alternative
hybrid-lb interpretation: external LB across nodes too), we can extend
the loop in `server.sh` to emit one entry per `IPS[i]` in the prefill
range and one per `IPS[i]` in the decode range, all carrying the same
role label. EPP then load-balances across them via `random-picker`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The README section "Why one entry per leader (not per node)" (lines 80-89) plus the example endpoints.yaml at lines 57-69 both document per-leader emission, but server.sh add_role() (lines 334-343) actually emits ONE endpoint per node — its own comment says "ONE endpoint per node at base_port (hybrid LB: ...)". The lines 91-95 "future work" section describing per-node emission as hypothetical has already been implemented. Docs-only staleness (per-node is the intended behavior — see server.sh:289-291 and its consistency with the multi-node coordinator health poll in lines 519-535), but future maintainers may misdiagnose whether current behavior is intentional. Please update the README to reflect actual per-node emission, or add a note explaining why the leader-only text was preserved.

Extended reasoning...

What the bug is

The new file benchmarks/multi_node/llm-d/README.md (added by this PR) describes the endpoints.yaml file-discovery contract as emitting one entry per leader. Concretely:

  • Line 55: "…writes /tmp/endpoints.yaml inside the container with one entry per leader"
  • Lines 57-69: the example YAML shows only prefill-0 (PREFILL_LEADER_IP) and decode-0 (DECODE_LEADER_IP)
  • Lines 80-89: an entire section titled "Why one entry per leader (not per node)" justifies leader-only emission and warns: "Adding an entry per worker would cause EPP to route directly to a worker, bypassing the engine's internal load balancing"
  • Lines 91-95: describes per-node emission as a hypothetical future extension ("we can extend the loop … to emit one entry per IPS[i]")

But the code at benchmarks/multi_node/llm-d/server.sh:334-348 (the add_role() block that writes endpoints.yaml) does exactly the opposite:

def add_role(role, ips, base_port):
    # ONE endpoint per node at base_port (hybrid LB: the node's api-server /
    # sidecar internally load-balances its local DP ranks).
    for i, ip in enumerate(ips):
        endpoints.append({'name': f'{role}-{i}', ...})

add_role('prefill', prefill_ips, VLLM_PORT)     # prefill_ips = all_ips[:pn]
add_role('decode', decode_ips, SIDECAR_PORT)    # decode_ips = all_ips[pn:pn + dn]

The inline comment at server.sh:339-340 literally says the opposite of the README section header: "ONE endpoint per node at base_port".

Step-by-step proof

Take the new high-tpt 2P+1D key (dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe, configs/nvidia-master.yaml). Topology: PREFILL_NODES=4, DECODE_NODES=2, PREFILL_WORKERS=2.

  1. submit.sh sbatches 6 nodes. job.slurm collects their IPs in rank order into ALL_IPS = [P0,P1,P2,P3,D0,D1].
  2. On the decode leader, server.sh runs the coordinator Python block:
    • prefill_ips = all_ips[:4] = [P0, P1, P2, P3]
    • decode_ips = all_ips[4:6] = [D0, D1]
  3. add_role('prefill', prefill_ips, 8200) iterates 4 times → emits prefill-0..prefill-3.
  4. add_role('decode', decode_ips, 8000) iterates 2 times → emits decode-0..decode-1.
  5. Final endpoints.yaml has 6 entries — not the 2 (prefill-0 + decode-0) the README example shows.

For the 1P+1D mid-curve key it emits 1 prefill + 2 decode entries (still per-node, not per-leader).

Why this is intentional (i.e. the code is right, docs are stale)

Several signals confirm per-node emission is the deliberate design, not a bug:

  1. server.sh:289-291 comment explicitly justifies it: "Each decode node runs its own sidecar (SIDECAR_PORT -> local decode vLLM), and endpoints.yaml lists one decode endpoint per node so EPP fans out across all decode ranks."
  2. server.sh:519-535 (the decode coordinator's cross-node prefill health poll) iterates over _prefill_ips=( "${_ALL_IPS[@]:0:${PREFILL_NODES}}" ) — polling every prefill node, not just the leader. That poll is only meaningful if all prefill nodes are individually addressable endpoints.
  3. --api-server-count is set on every EP-enabled node (server.sh:236-240), plus --data-parallel-hybrid-lb — meaning each node has its own api-server binding, so per-node endpoints are actually the correct interpretation of hybrid-lb.
  4. PREFILL_WORKERS=2 in the high-tpt key requires per-node emission: with 2 independent DP engines across 4 prefill nodes, per-leader emission would only expose one of the two engines.

Impact

Zero runtime impact — code is self-consistent and produces the intended routing behavior. But a future maintainer reading the README's "Why one entry per leader (not per node)" section will:

  • Believe the current behavior is per-leader when it isn't.
  • Reference the warning about "EPP would bypass internal load balancing" and (incorrectly) conclude the current implementation is broken.
  • Not realize the "future work" paragraph at lines 91-95 has already been implemented.

Suggested fix

Update benchmarks/multi_node/llm-d/README.md:

  • Line 55: change "one entry per leader" → "one entry per node".
  • Lines 57-69: expand the example YAML to show all prefill and decode nodes (e.g. prefill-0..prefill-N-1, decode-0..decode-M-1).
  • Lines 80-89: rewrite the "Why one entry per leader" section to explain why per-node is used (hybrid-lb + independent api-servers + PREFILL_WORKERS multi-engine).
  • Lines 91-95: delete or rewrite the "future work" paragraph since per-node emission is what the code already does. The leader-only mode could be described here as the hypothetical alternative if desired.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d7e4180 - README now documents one endpoint per node (not per leader) to match server.sh add_role(), with corrected example ports (prefill 8200 / decode 8000) and a rewritten rationale section.

Comment on lines +9078 to +9080
# Build source: benchmarks/llm-d/Dockerfile.
dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe:
image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Both new keys (mid-curve line 9080, high-tpt line 9125) pin image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23, a personal ghcr namespace rather than the SemiAnalysisAI org registry. The PR description's "Before merge" checklist and the in-file comment at 9075-9077 both flag this — surfacing here as a merge-time reminder so it isn't missed. Republish to ghcr.io/semianalysisai/... and update both image: lines before merge.

Extended reasoning...

What the bug is

Both new master-config keys added by this PR pin their engine image to a personal ghcr namespace:

  • configs/nvidia-master.yaml:9080 (dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe): image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23
  • configs/nvidia-master.yaml:9125 (dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe): image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23

The PR author has explicitly flagged this in two places already: the PR description's "Known merge-blocker" section ("The image: on both keys is a personal build... It should be republished to the org registry and the image: lines updated before merge") and an in-file comment at lines 9075-9077 ("PR follow-up (merge-blocking): the image: on both llm-d-vllm keys below (mid-curve + high-tpt) is a personal build. Republish it to the org registry (ghcr.io/semianalysisai/...) and update these image: lines before merge").

Why it matters

Reproducibility. On the GB200 pyxis path, runners/launch_gb200-nv.sh derives SQUASH_FILE="${SQUASH_DIR}/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" — so the enroot-import cache key is a direct function of the image string. Any historical rerun (e.g., re-ingesting a failed sweep, reproducing a curve point months later) resolves the image via that exact URL. A personal ghcr namespace can be retagged, deleted, or made private without notice by the individual owner, silently breaking reruns.

Step-by-step proof

  1. configs/nvidia-master.yaml:9080 sets image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23 on the mid-curve key.
  2. configs/nvidia-master.yaml:9125 sets the same image on the high-tpt key.
  3. runners/launch_gb200-nv.sh (the llm-d-vllm branch, top of file) reads $IMAGE and constructs ENROOT_URL="docker://ghcr.io#ezrasilvera/llm-d-nokube-vllm:vllm0.23" for enroot import.
  4. enroot import authenticates and pulls against ezrasilvera's personal ghcr namespace on every fresh cluster / cache miss.
  5. If ezrasilvera retags/deletes/private-marks this tag, step 4 fails; the sweep can no longer be reproduced.

Addressing the refutation

One verifier argued this duplicates the author's own annotations (PR description merge-blocker + in-file comment) and should be dropped as "no new information." That's a fair point, and it's why I'm filing this as nit rather than normal — I'm not claiming the author is unaware. But surfacing it in the code-review pass provides a merge-time reminder against the actual diff (as opposed to a checklist item in the description that can be overlooked when the reviewer clicks Merge). If this shows up alongside other findings in the same review, it's cheap to satisfy; if the maintainer knows the image swap is already in flight, the reminder can be dismissed in a second.

Fix

Republish the combined image (built from benchmarks/llm-d/Dockerfile) to ghcr.io/semianalysisai/ and update both image: lines. The enroot cache key changes with the image string, so this is a valid config change to land in this PR rather than a follow-up — landing it here keeps the initial ingest and all historical reruns pinned to the org-owned image.

Comment thread perf-changelog.yaml Outdated
Comment on lines +4441 to +4444
- "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks"
- "Mid-curve 1P+1D DEP8 (conc 256/512/1024)"
- "High-tpt 2P+1D DEP8 (conc 4096)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (literal XXX placeholder at perf-changelog.yaml:4444). Every other entry in this file resolves to a real PR number — this one alone will 404 if merged as-is. Fix by replacing XXX with 2050 before merge. (The author already listed this under 'Before merge' in the PR description, so filing as a friendly nit rather than a blocker.)

Extended reasoning...

What the bug is. The perf-changelog entry appended by this PR (perf-changelog.yaml:4441-4444) carries a literal pull/XXX placeholder rather than the real PR number:

- config-keys:
    - dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe
    - dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe
  description:
    - "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks"
    - "Mid-curve 1P+1D DEP8 (conc 256/512/1024)"
    - "High-tpt 2P+1D DEP8 (conc 4096)"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Why the current diff is inconsistent. Every other entry in perf-changelog.yaml resolves to a real PR (the immediately preceding entry ends with pull/2001; earlier entries reference pull/1990, pull/1978, etc.). The pr-link contract of this file is to trace which PR introduced which benchmark set — a literal XXX breaks that contract.

Impact if merged as-is. Clicking the link in the changelog will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which resolves to no PR (GitHub returns 404 for numeric paths that don't exist, or worse silently redirects). No functionality is affected — the sweep still runs, the config keys still work — but the diagnostic "which PR introduced this row?" trail is broken for the two new dsv4-fp4-gb200-llm-d-vllm keys.

Why existing checks don't catch it. There is no linter validating perf-changelog pr-link values against real PR numbers, so nothing stops the placeholder from shipping. The PR description's 'Before merge' checklist explicitly lists this: 'Canonicalize perf-changelog.yaml pr-link from pull/XXX to this PR's number' — the author has already flagged it as a known merge-blocker to canonicalize.

Step-by-step proof.

  1. Open perf-changelog.yaml at line 4444 → observe pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  2. grep -c 'pull/XXX' perf-changelog.yaml → 1 (only this new entry).
  3. grep -oE 'pull/[0-9]+' perf-changelog.yaml | tail -3 → all resolve to real numeric PRs (2001, 1990, 1978).
  4. Attempt https://github.com/SemiAnalysisAI/InferenceX/pull/XXX in a browser → GitHub 404.
  5. Expected value: pull/2050 (this PR's number, per the PR metadata).

Fix. One-character-scope edit: replace XXX with 2050 on perf-changelog.yaml:4444. No other files need to change.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d7e4180 - pr-link canonicalized from pull/XXX to pull/2050.

- llm-d/README.md: document endpoints.yaml as one entry per node (not per
  leader) to match server.sh add_role(); fix example ports (prefill 8200,
  decode 8000) and rewrite the rationale section (C2)
- perf-changelog.yaml: canonicalize pr-link XXX -> 2050 (C4)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

…ent (C1)

Bump the EPP and P/D routing-sidecar source images from the moving
-dev:main tags to release :v0.9.0 in both binaries.env and the
combined-image Dockerfile, so a rebuild is reproducible. Verified both
tags ship linux/arm64 and still expose /app/epp and /app/pd-sidecar
(statically-linked aarch64). Envoy pin unchanged.

Also reword the binaries.env header (PR #2050 review C1): the old
comment claimed the file is 'not used', but job.slurm still sources it
and conditionally bind-mounts the binaries via LLMD_BIN_DIR. Clarify
it is an optional mounted-binary override, a no-op by default, and fix
the 'build an use' -> 'build and use' typo.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

…ild-arg

BuildKit rejects `COPY --from=${VAR}` with 'variable expansion is not
supported for --from'. Promote VLLM_BASE and the three source-image ARGs
(EPP/sidecar/envoy) to global scope, bind each to a named FROM stage
(epp_src / sidecar_src / envoy_src), and COPY --from those stage aliases.
No change to defaults or --build-arg override behavior; still in sync
with binaries.env.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…sses

The eval-only jobs failed 'eval metadata concurrency does not match
workflow request' even though gsm8k accuracy passed (0.95 >= 0.91). Cause:
run_eval/append_lm_eval_summary read EVAL_CONCURRENT_REQUESTS and CONC
(benchmark_lib.sh), never EVAL_CONC, so with CONC empty meta_env.json
recorded conc=1 while validate_scores.py --expected-concs checked 4096.

Mirror the AMD multi-node servers: before run_eval, set
EVAL_CONCURRENT_REQUESTS from EVAL_CONC (fallback: max of the x-delimited
BENCH_MAX_CONCURRENCY) and export CONC so the stamped concurrency matches
the workflow request.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Add dsv4-fp4-gb200-llm-d-vllm-max-tpt-megamoe: 3 prefill (DEP8) + 1 decode
(DEP8) at conc=4096, mirroring NVIDIA/srt-slurm disagg-gb200-max-tpt-megamoe
(engine-identical to high-tpt; only prefill scale-out 2->3), so it reuses the
same CONFIG_FILE. Validated standalone via e2e run 28711505502 at 8797
tok/s/GPU (median TTFT 58 ms) - above high-tpt's 8241 tok/s/GPU, since the
extra prefill instance relieves the conc=4096 prefill bottleneck.

Adds a matching perf-changelog entry under PR #2050.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Move the max-tpt config-key to the top of the changelog entry so it
dispatches first in the serialized multi-node matrix. The prior sweep
(28714892207) had max-tpt land last; a transient decode-node slowdown
degraded that point (4.5x TPOT vs high-tpt on an identical decode
config). Running it first surfaces its result early and re-tests the
transient before the shorter mid/high legs finish.

Ordering only - all three points still run and land on the graph.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant