[AMD] MiniMax-M3 MXFP8 MI355X vLLM: nightly + AITER-on TP4 + emulatin linear / MiniMax-M3 MXFP8 MI355X vLLM：升级 nightly + 启用 AITER TP4 + emulation linear by hongxiayang · Pull Request #2003 · SemiAnalysisAI/InferenceX

hongxiayang · 2026-07-03T14:41:09Z

Bump minimaxm3-fp8-mi355x-vllm to nightly-09663abde
enable AITER for TP-only explicitly via --moe-backend aiter,
use --linear-backend emulation (beats stock native MXFP8 linear),
--max-num-batched-tokens 32768, and
a single TP4 conc 1-512 sweep for 1k1k and 8k1k.

Thanks for @chunfangamd to help inferenceX setup

中文说明

将 minimaxm3-fp8-mi355x-vllm 镜像升级至最新 nightly，显式启用 AITER（--moe-backend aiter），使用 --linear-backend emulation（优于原生 MXFP8 linear），设置 --max-num-batched-tokens 32768，并配置 TP4 单节点 1k1k/8k1k 并发 1-512 扫描。

github-actions · 2026-07-03T14:41:19Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

hongxiayang · 2026-07-03T20:13:23Z

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28667371208

claude · 2026-07-03T20:13:36Z

Claude finished @hongxiayang's task in 6m 3s —— View job

Review complete

Gather context (PR diff, changed files)
Review benchmark script changes (verified --moe-backend aiter and --linear-backend emulation are valid backend choices in the vLLM nightly at commit 09663abde)
Validate master config & perf-changelog rules (changelog entry present, appended at end of file, image is public Docker Hub)
Post review feedback

Summary: One 🔴 blocking issue — the new perf-changelog entry's pr-link is a placeholder pull/XXX and should be pull/2003 (inline suggestion posted). One 🟡 warning — the AITER master-switch gate was inverted so the (now-unused) EP branch sets VLLM_ROCM_USE_AITER=0 while --moe-backend aiter is hardcoded, an untested combination if an ep: search-space entry is ever re-added; please confirm intent or note it in the script comment. Everything else checks out: config sweep simplification matches the changelog, the image is publicly accessible, and the serve command follows the repo's formatting and EP-conditional patterns.

claude · 2026-07-03T20:19:25Z

+    - "Add --max-num-batched-tokens 32768 (env MAX_NUM_BATCHED_TOKENS) to enlarge the per-step prefill budget and improve TP4 throughput at high concurrency"
+    - "Enable the AITER master switch for TP-only (no-EP) runs via --moe-backend aiter: the earlier degenerate-output issue that forced it off for TP-only is fixed by vllm-project/vllm#47158, so TP4 uses the AITER_MXFP8 MoE path (verified GSM8K 0.9613 flex / 0.9621 strict on this nightly)"
+    - "Simplify both search spaces to a single TP4 conc 1-512 sweep for 1k1k and 8k1k (drop TP8 and TP4/EP4: TP8 has poor throughput/GPU and plain TP4 matches or beats TP4/EP4 at high concurrency)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX


🔴 BLOCKING: pr-link is a placeholder (pull/XXX) instead of the actual PR number.
Why it matters: The perf-changelog is the record tying config/perf changes to their PRs; every other entry has a real link, and this one will be a dead link once merged.
Fix:

Suggested change

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/2003

claude · 2026-07-03T20:19:31Z

+# https://github.com/vllm-project/vllm/pull/47158 fix this
 if printf '%s\n' "${PARALLEL_ARGS[@]}" | grep -qxF -- '--enable-expert-parallel'; then
-    export VLLM_ROCM_USE_AITER=1
-else
    export VLLM_ROCM_USE_AITER=0
+else
+    export VLLM_ROCM_USE_AITER=1
 fi


🟡 WARNING: The EP branch now sets VLLM_ROCM_USE_AITER=0 while --moe-backend aiter is hardcoded unconditionally on the serve command (line 73) — the exact inverse of the previously validated EP configuration (which relied on VLLM_ROCM_USE_AITER=1 with the aiter fused-MoE path auto-selected).
Why it matters: No current search-space entry uses EP (this PR drops tp4/ep4), so nothing breaks in CI today. But the script still accepts EP_SIZE/DP_ATTENTION from configs, and anyone re-adding an ep: entry (which existed until this PR) will silently run an untested combination: forced aiter MoE backend with the AITER master switch off. Per vLLM's rocm_aiter_ops, is_fused_moe_enabled() requires VLLM_ROCM_USE_AITER=1, so master-switch-gated aiter paths (shared-experts fusion, attention/norm ops) behave differently from what was validated. If EP is intentionally unsupported now, please say so in the comment; the current # vllm-project/vllm#47158 fix this comment doesn't explain why EP flips the switch off.

claude · 2026-07-03T20:20:45Z

+# https://github.com/vllm-project/vllm/pull/47158 fix this
 if printf '%s\n' "${PARALLEL_ARGS[@]}" | grep -qxF -- '--enable-expert-parallel'; then
-    export VLLM_ROCM_USE_AITER=1
-else
    export VLLM_ROCM_USE_AITER=0
+else
+    export VLLM_ROCM_USE_AITER=1
 fi

+# Larger per-step prefill token budget to improve TP4 throughput at high


🟡 The EP branch of the AITER master-switch gate silently flipped from VLLM_ROCM_USE_AITER=1 to 0 with no explanation (only the TP-only flip is justified by #47158 in the PR description). The current search space is TP4-only so the EP branch is dead code and there is no CI impact, but this is a latent trap for future EP re-introduction and the sibling minimaxm3_fp8_mi355x_mtp.sh still carries the original EP=1/TP=0 gate with the detailed rationale comment, so the two scripts now silently disagree on AITER gating policy. Suggest either deleting the now-dead EP branch here or restoring EP=1 with a comment explaining the divergence; the replacement comment '# vllm-project/vllm#47158 fix this' is also ungrammatical and no longer explains WHY EP gets 0.

Extended reasoning...

Nit / maintainability — no CI impact today.

The AITER master-switch conditional at benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh:54-58 was flipped on both branches by this PR:

Branch Before After

EP-on VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER=0

TP-only VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER=1

The PR description and the new perf-changelog entry justify only the TP-only flip — vllm-project/vllm#47158 fixed the earlier degenerate-output issue that had forced AITER off for TP-only. The EP branch flip from 1→0 has no explanation, and the replacement comment # https://github.com/vllm-project/vllm/pull/47158 fix this is ungrammatical and only speaks to the TP-only side.

Why this is only a nit, not a bug: the refutation is correct that the current sweep never exercises the EP branch. configs/amd-master.yaml:2487-2494 was simplified to a single { tp: 4, conc-start: 1, conc-end: 512 } for both 1k1k and 8k1k, with no ep field and no dp-attn. So both DP_ATTENTION and EP_SIZE > 1 are false at runtime, --enable-expert-parallel never lands in PARALLEL_ARGS, and every sweep row falls through the else branch (VLLM_ROCM_USE_AITER=1). Additionally, --moe-backend aiter is now set unconditionally on the vllm serve command line, so the AITER MoE path is forced regardless of the master switch value.

Why it is still worth flagging:

The EP conditional is still physically present in the script (lines 55-57). Any future re-introduction of an EP row to the search space, or a manual EP_SIZE=4/DP_ATTENTION=true run, would silently take the regressed EP=0 setting instead of the previously validated EP=1.

The sibling benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x_mtp.sh still carries the original logic (EP=1, else=0) with the detailed rationale comment intact at lines 88-96. The MTP recipe is on the older nightly-4559c43a image and has not yet moved to the nightly-09663abde that carries #47158's fix — but the divergence is now silent (no comment explains why the base recipe gates AITER differently on EP), which makes future maintenance harder.

The new comment is ungrammatical (fix this) and no longer documents why EP gets AITER=0. The old comment explicitly named the invariant: with EP the fused AITER MoE path is the auto-selected backend, and with TP-only AITER produced degenerate MiniMax-M3 output.

Step-by-step proof this is latent-only today:

configs/amd-master.yaml:2487-2494 — search space is - { tp: 4, conc-start: 1, conc-end: 512 } with no ep/dp-attn field.

In minimaxm3_fp8_mi355x.sh, DP_ATTENTION is unset → the if [ "$DP_ATTENTION" = "true" ] branch is false. EP_SIZE defaults to 1 → the elif [ "$EP_SIZE" -gt 1 ] branch is false. PARALLEL_ARGS = (--tensor-parallel-size 4).

printf ... | grep -qxF -- --enable-expert-parallel → no match → else → VLLM_ROCM_USE_AITER=1.

So all TP4 rows get AITER=1 as intended; EP=0 branch is unreachable this sweep.

If a future PR re-adds { tp: 4, ep: 4, ... } (which existed in the pre-PR search space!), that row would now set VLLM_ROCM_USE_AITER=0 — different from the pre-PR behavior for the same config, with no comment to warn the maintainer.

Suggested fix (pick one):

Delete the now-dead EP branch entirely: export VLLM_ROCM_USE_AITER=1 unconditionally, since the sweep no longer includes EP configs and #47158 makes AITER=1 safe for TP-only.

Or restore VLLM_ROCM_USE_AITER=1 on the EP branch and update the comment to explain that both branches now use AITER post-#47158 (matching the sibling MTP script's rationale, which will presumably follow when it upgrades to the new nightly).

Either fixes the latent trap and the stale/ungrammatical comment together.

github-actions · 2026-07-03T22:28:46Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28683732009
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28683732009

github-actions · 2026-07-03T23:49:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28686016620
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28686016620

…n linear Bump minimaxm3-fp8-mi355x-vllm to nightly-09663abde..., enable AITER for TP-only (vllm#47158 fix) via --moe-backend aiter, use --linear-backend emulation (beats stock native MXFP8 linear), --max-num-batched-tokens 32768, and a single TP4 conc 1-512 sweep for 1k1k and 8k1k. Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

…ecipe Match minimaxm3_fp8_mi355x.sh: add --moe-backend aiter, --linear-backend emulation, and --max-num-batched-tokens 32768 to the EAGLE3 MTP recipe (keeping --speculative-config and the EAGLE3 in-place patch). Make DRAFT_MODEL env-overridable for local testing. Verified locally on nightly-09663abde (TP4 conc512, local eagle3 draft): 5120/5120 completed, eagle3 patch a no-op (nightly natively supports SupportsEagle3). Co-authored-by: Cursor <cursoragent@cursor.com>

Make the ported MTP recipe changes take effect: bump the minimaxm3-fp8-mi355x-vllm-mtp image to nightly-09663abde... and simplify its search space to a single TP4 conc 1-512 sweep (drop TP8/EP layouts, matching the non-MTP config). Add the changelog entry. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-07-04T04:23:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28690689417
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28690689417

hongxiayang · 2026-07-04T04:40:46Z

@functionstackx error again Error: File was unable to be removed Error: EACCES: permission denied, rmdir '/it-share/gharunners2/gharunner06/actions-runner/_work/InferenceX/InferenceX/benchmark_logs/logs/slurm_job-32929'

functionstackx · 2026-07-04T04:50:19Z

@hongxiayang here is the claude command to fix it. this error happens when someone from AMD runs an github workflow that generates files as ROOT

https://github.com/SemiAnalysisAI/InferenceX/blob/main/.claude/commands/clean-amd-mi355-runner-root-files.md

functionstackx · 2026-07-04T04:55:08Z

@hongxiayang The root-owned files on gharunner06 (/it-share/gharunners2/gharunner06/actions-runner/_work/InferenceX/InferenceX/benchmark_logs/logs/slurm_job-32929) have been cleaned up — I ran the /clean-amd-mi355-runner-root-files Claude command to scan and verify all MI355X TW runners. The scan came back clean (zero root-owned files across all gharunners*/gharunner* workspaces), so the offending files were already removed since the last failure. I've rerun the failed 8k1k jobs from run 28690689417.

Root cause reminder: multi-node disagg benchmarks submit Slurm jobs whose containers write logs as root into the runner workspace (benchmark_logs/logs/slurm_job-*). When a job is cancelled, teardown is skipped and root-owned dirs get stranded — the runner user can't delete them, which breaks actions/checkout workspace cleanup for every subsequent job on that runner.

Could you please advise your coworkers (cc @billishyahao @seungrokj @YukioZzz) to avoid generating files as root in the runner workspace? If Slurm containers must run as root, the cleanup/teardown script should chown or rm those files before the job exits. This would prevent the EACCES errors from recurring.

functionstackx · 2026-07-04T04:57:44Z

@hongxiayang adding as an reminder to CONTRIBUTING.md see #2043

Never write as root into the runner workspace — use /tmp or a dedicated staging path instead.
If root writes are unavoidable, add a trap cleanup EXIT that removes root-owned files before the job exits.
Test by cancelling a running benchmark mid-flight and verifying no root files remain.

One stranded root-owned directory bricks every subsequent job on that runner and blocks the entire AMD MI355X sweep queue for all contributors.

github-actions · 2026-07-04T08:13:22Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28690689417
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28690689417

hongxiayang · 2026-07-04T14:34:27Z

@functionstackx Thank you for help and resolve the issue. Can this PR be merged?

github-project-automation Bot added this to InferenceMAX Board Jul 3, 2026

hongxiayang marked this pull request as ready for review July 3, 2026 20:13

hongxiayang requested a review from a team July 3, 2026 20:13

hongxiayang requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners July 3, 2026 20:13

claude Bot reviewed Jul 3, 2026

View reviewed changes

hongxiayang added the full-sweep-enabled label Jul 3, 2026

hongxiayang force-pushed the hy/mm3-mxfp8-configs branch from 17583d7 to e0974d5 Compare July 3, 2026 22:28

hongxiayang and others added 4 commits July 4, 2026 01:24

address feedback

8699b80

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

hongxiayang force-pushed the hy/mm3-mxfp8-configs branch from e0974d5 to c696685 Compare July 4, 2026 01:25

functionstackx changed the title ~~[AMD] MiniMax-M3 MXFP8 MI355X vLLM: nightly + AITER-on TP4 + emulatin linear~~ [AMD] MiniMax-M3 MXFP8 MI355X vLLM: nightly + AITER-on TP4 + emulatin linear / MiniMax-M3 MXFP8 MI355X vLLM：升级 nightly + 启用 AITER TP4 + emulation linear Jul 4, 2026

functionstackx mentioned this pull request Jul 4, 2026

[Klaud Cold] CONTRIBUTING: warn against root-owned files on AMD runners / 贡献指南：警告勿在 AMD runner 工作区留下 root 所属文件 #2043

Merged

2 tasks

	pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
	pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/2003

Branch	Before	After
EP-on	`VLLM_ROCM_USE_AITER=1`	`VLLM_ROCM_USE_AITER=0`
TP-only	`VLLM_ROCM_USE_AITER=0`	`VLLM_ROCM_USE_AITER=1`

Uh oh!

Conversation

hongxiayang commented Jul 3, 2026 • edited by functionstackx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

中文说明

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

hongxiayang commented Jul 3, 2026

Uh oh!

claude Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review complete

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

hongxiayang commented Jul 4, 2026

Uh oh!

functionstackx commented Jul 4, 2026

Uh oh!

functionstackx commented Jul 4, 2026

Uh oh!

functionstackx commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

hongxiayang commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hongxiayang commented Jul 3, 2026 •

edited by functionstackx

Loading

claude Bot commented Jul 3, 2026 •

edited

Loading

functionstackx commented Jul 4, 2026 •

edited

Loading