Skip to content

[AMD] MiniMax-M3 MXFP8 MI355X vLLM: nightly + AITER-on TP4 + emulatin linear / MiniMax-M3 MXFP8 MI355X vLLM:升级 nightly + 启用 AITER TP4 + emulation linear#2003

Open
hongxiayang wants to merge 4 commits into
mainfrom
hy/mm3-mxfp8-configs
Open

[AMD] MiniMax-M3 MXFP8 MI355X vLLM: nightly + AITER-on TP4 + emulatin linear / MiniMax-M3 MXFP8 MI355X vLLM:升级 nightly + 启用 AITER TP4 + emulation linear#2003
hongxiayang wants to merge 4 commits into
mainfrom
hy/mm3-mxfp8-configs

Conversation

@hongxiayang

@hongxiayang hongxiayang commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Bump minimaxm3-fp8-mi355x-vllm to nightly-09663abde
enable AITER for TP-only explicitly via --moe-backend aiter,
use --linear-backend emulation (beats stock native MXFP8 linear),
--max-num-batched-tokens 32768, and
a single TP4 conc 1-512 sweep for 1k1k and 8k1k.

Thanks for @chunfangamd to help inferenceX setup

中文说明

minimaxm3-fp8-mi355x-vllm 镜像升级至最新 nightly,显式启用 AITER(--moe-backend aiter),使用 --linear-backend emulation(优于原生 MXFP8 linear),设置 --max-num-batched-tokens 32768,并配置 TP4 单节点 1k1k/8k1k 并发 1-512 扫描。

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@hongxiayang hongxiayang marked this pull request as ready for review July 3, 2026 20:13
@hongxiayang hongxiayang requested a review from a team July 3, 2026 20:13
@hongxiayang

Copy link
Copy Markdown
Collaborator Author

@claude

claude Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Claude finished @hongxiayang's task in 6m 3s —— View job


Review complete

  • Gather context (PR diff, changed files)
  • Review benchmark script changes (verified --moe-backend aiter and --linear-backend emulation are valid backend choices in the vLLM nightly at commit 09663abde)
  • Validate master config & perf-changelog rules (changelog entry present, appended at end of file, image is public Docker Hub)
  • Post review feedback

Summary: One 🔴 blocking issue — the new perf-changelog entry's pr-link is a placeholder pull/XXX and should be pull/2003 (inline suggestion posted). One 🟡 warning — the AITER master-switch gate was inverted so the (now-unused) EP branch sets VLLM_ROCM_USE_AITER=0 while --moe-backend aiter is hardcoded, an untested combination if an ep: search-space entry is ever re-added; please confirm intent or note it in the script comment. Everything else checks out: config sweep simplification matches the changelog, the image is publicly accessible, and the serve command follows the repo's formatting and EP-conditional patterns.

Comment thread perf-changelog.yaml Outdated
- "Add --max-num-batched-tokens 32768 (env MAX_NUM_BATCHED_TOKENS) to enlarge the per-step prefill budget and improve TP4 throughput at high concurrency"
- "Enable the AITER master switch for TP-only (no-EP) runs via --moe-backend aiter: the earlier degenerate-output issue that forced it off for TP-only is fixed by vllm-project/vllm#47158, so TP4 uses the AITER_MXFP8 MoE path (verified GSM8K 0.9613 flex / 0.9621 strict on this nightly)"
- "Simplify both search spaces to a single TP4 conc 1-512 sweep for 1k1k and 8k1k (drop TP8 and TP4/EP4: TP8 has poor throughput/GPU and plain TP4 matches or beats TP4/EP4 at high concurrency)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: pr-link is a placeholder (pull/XXX) instead of the actual PR number.
Why it matters: The perf-changelog is the record tying config/perf changes to their PRs; every other entry has a real link, and this one will be a dead link once merged.
Fix:

Suggested change
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/2003

Comment on lines 54 to 59
# https://github.com/vllm-project/vllm/pull/47158 fix this
if printf '%s\n' "${PARALLEL_ARGS[@]}" | grep -qxF -- '--enable-expert-parallel'; then
export VLLM_ROCM_USE_AITER=1
else
export VLLM_ROCM_USE_AITER=0
else
export VLLM_ROCM_USE_AITER=1
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING: The EP branch now sets VLLM_ROCM_USE_AITER=0 while --moe-backend aiter is hardcoded unconditionally on the serve command (line 73) — the exact inverse of the previously validated EP configuration (which relied on VLLM_ROCM_USE_AITER=1 with the aiter fused-MoE path auto-selected).
Why it matters: No current search-space entry uses EP (this PR drops tp4/ep4), so nothing breaks in CI today. But the script still accepts EP_SIZE/DP_ATTENTION from configs, and anyone re-adding an ep: entry (which existed until this PR) will silently run an untested combination: forced aiter MoE backend with the AITER master switch off. Per vLLM's rocm_aiter_ops, is_fused_moe_enabled() requires VLLM_ROCM_USE_AITER=1, so master-switch-gated aiter paths (shared-experts fusion, attention/norm ops) behave differently from what was validated. If EP is intentionally unsupported now, please say so in the comment; the current # vllm-project/vllm#47158 fix this comment doesn't explain why EP flips the switch off.

Comment on lines +54 to +61
# https://github.com/vllm-project/vllm/pull/47158 fix this
if printf '%s\n' "${PARALLEL_ARGS[@]}" | grep -qxF -- '--enable-expert-parallel'; then
export VLLM_ROCM_USE_AITER=1
else
export VLLM_ROCM_USE_AITER=0
else
export VLLM_ROCM_USE_AITER=1
fi

# Larger per-step prefill token budget to improve TP4 throughput at high

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The EP branch of the AITER master-switch gate silently flipped from VLLM_ROCM_USE_AITER=1 to 0 with no explanation (only the TP-only flip is justified by #47158 in the PR description). The current search space is TP4-only so the EP branch is dead code and there is no CI impact, but this is a latent trap for future EP re-introduction and the sibling minimaxm3_fp8_mi355x_mtp.sh still carries the original EP=1/TP=0 gate with the detailed rationale comment, so the two scripts now silently disagree on AITER gating policy. Suggest either deleting the now-dead EP branch here or restoring EP=1 with a comment explaining the divergence; the replacement comment '# vllm-project/vllm#47158 fix this' is also ungrammatical and no longer explains WHY EP gets 0.

Extended reasoning...

Nit / maintainability — no CI impact today.

The AITER master-switch conditional at benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh:54-58 was flipped on both branches by this PR:

Branch Before After
EP-on VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER=0
TP-only VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER=1

The PR description and the new perf-changelog entry justify only the TP-only flipvllm-project/vllm#47158 fixed the earlier degenerate-output issue that had forced AITER off for TP-only. The EP branch flip from 1→0 has no explanation, and the replacement comment # https://github.com/vllm-project/vllm/pull/47158 fix this is ungrammatical and only speaks to the TP-only side.

Why this is only a nit, not a bug: the refutation is correct that the current sweep never exercises the EP branch. configs/amd-master.yaml:2487-2494 was simplified to a single { tp: 4, conc-start: 1, conc-end: 512 } for both 1k1k and 8k1k, with no ep field and no dp-attn. So both DP_ATTENTION and EP_SIZE > 1 are false at runtime, --enable-expert-parallel never lands in PARALLEL_ARGS, and every sweep row falls through the else branch (VLLM_ROCM_USE_AITER=1). Additionally, --moe-backend aiter is now set unconditionally on the vllm serve command line, so the AITER MoE path is forced regardless of the master switch value.

Why it is still worth flagging:

  1. The EP conditional is still physically present in the script (lines 55-57). Any future re-introduction of an EP row to the search space, or a manual EP_SIZE=4/DP_ATTENTION=true run, would silently take the regressed EP=0 setting instead of the previously validated EP=1.
  2. The sibling benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x_mtp.sh still carries the original logic (EP=1, else=0) with the detailed rationale comment intact at lines 88-96. The MTP recipe is on the older nightly-4559c43a image and has not yet moved to the nightly-09663abde that carries #47158's fix — but the divergence is now silent (no comment explains why the base recipe gates AITER differently on EP), which makes future maintenance harder.
  3. The new comment is ungrammatical (fix this) and no longer documents why EP gets AITER=0. The old comment explicitly named the invariant: with EP the fused AITER MoE path is the auto-selected backend, and with TP-only AITER produced degenerate MiniMax-M3 output.

Step-by-step proof this is latent-only today:

  1. configs/amd-master.yaml:2487-2494 — search space is - { tp: 4, conc-start: 1, conc-end: 512 } with no ep/dp-attn field.
  2. In minimaxm3_fp8_mi355x.sh, DP_ATTENTION is unset → the if [ "$DP_ATTENTION" = "true" ] branch is false. EP_SIZE defaults to 1 → the elif [ "$EP_SIZE" -gt 1 ] branch is false. PARALLEL_ARGS = (--tensor-parallel-size 4).
  3. printf ... | grep -qxF -- --enable-expert-parallel → no match → elseVLLM_ROCM_USE_AITER=1.
  4. So all TP4 rows get AITER=1 as intended; EP=0 branch is unreachable this sweep.
  5. If a future PR re-adds { tp: 4, ep: 4, ... } (which existed in the pre-PR search space!), that row would now set VLLM_ROCM_USE_AITER=0 — different from the pre-PR behavior for the same config, with no comment to warn the maintainer.

Suggested fix (pick one):

  • Delete the now-dead EP branch entirely: export VLLM_ROCM_USE_AITER=1 unconditionally, since the sweep no longer includes EP configs and #47158 makes AITER=1 safe for TP-only.
  • Or restore VLLM_ROCM_USE_AITER=1 on the EP branch and update the comment to explain that both branches now use AITER post-#47158 (matching the sibling MTP script's rationale, which will presumably follow when it upgrades to the new nightly).

Either fixes the latent trap and the stale/ungrammatical comment together.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

hongxiayang and others added 4 commits July 4, 2026 01:24
…n linear

Bump minimaxm3-fp8-mi355x-vllm to nightly-09663abde..., enable AITER for
TP-only (vllm#47158 fix) via --moe-backend aiter, use --linear-backend
emulation (beats stock native MXFP8 linear), --max-num-batched-tokens
32768, and a single TP4 conc 1-512 sweep for 1k1k and 8k1k.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
…ecipe

Match minimaxm3_fp8_mi355x.sh: add --moe-backend aiter, --linear-backend
emulation, and --max-num-batched-tokens 32768 to the EAGLE3 MTP recipe
(keeping --speculative-config and the EAGLE3 in-place patch). Make
DRAFT_MODEL env-overridable for local testing. Verified locally on
nightly-09663abde (TP4 conc512, local eagle3 draft): 5120/5120 completed,
eagle3 patch a no-op (nightly natively supports SupportsEagle3).

Co-authored-by: Cursor <cursoragent@cursor.com>
Make the ported MTP recipe changes take effect: bump the
minimaxm3-fp8-mi355x-vllm-mtp image to nightly-09663abde... and simplify
its search space to a single TP4 conc 1-512 sweep (drop TP8/EP layouts,
matching the non-MTP config). Add the changelog entry.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hongxiayang hongxiayang force-pushed the hy/mm3-mxfp8-configs branch from e0974d5 to c696685 Compare July 4, 2026 01:25
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx changed the title [AMD] MiniMax-M3 MXFP8 MI355X vLLM: nightly + AITER-on TP4 + emulatin linear [AMD] MiniMax-M3 MXFP8 MI355X vLLM: nightly + AITER-on TP4 + emulatin linear / MiniMax-M3 MXFP8 MI355X vLLM:升级 nightly + 启用 AITER TP4 + emulation linear Jul 4, 2026
@hongxiayang

Copy link
Copy Markdown
Collaborator Author

@functionstackx error again Error: File was unable to be removed Error: EACCES: permission denied, rmdir '/it-share/gharunners2/gharunner06/actions-runner/_work/InferenceX/InferenceX/benchmark_logs/logs/slurm_job-32929'

@functionstackx

Copy link
Copy Markdown
Collaborator

@hongxiayang here is the claude command to fix it. this error happens when someone from AMD runs an github workflow that generates files as ROOT

https://github.com/SemiAnalysisAI/InferenceX/blob/main/.claude/commands/clean-amd-mi355-runner-root-files.md

@functionstackx

Copy link
Copy Markdown
Collaborator

@hongxiayang The root-owned files on gharunner06 (/it-share/gharunners2/gharunner06/actions-runner/_work/InferenceX/InferenceX/benchmark_logs/logs/slurm_job-32929) have been cleaned up — I ran the /clean-amd-mi355-runner-root-files Claude command to scan and verify all MI355X TW runners. The scan came back clean (zero root-owned files across all gharunners*/gharunner* workspaces), so the offending files were already removed since the last failure. I've rerun the failed 8k1k jobs from run 28690689417.

Root cause reminder: multi-node disagg benchmarks submit Slurm jobs whose containers write logs as root into the runner workspace (benchmark_logs/logs/slurm_job-*). When a job is cancelled, teardown is skipped and root-owned dirs get stranded — the runner user can't delete them, which breaks actions/checkout workspace cleanup for every subsequent job on that runner.

Could you please advise your coworkers (cc @billishyahao @seungrokj @YukioZzz) to avoid generating files as root in the runner workspace? If Slurm containers must run as root, the cleanup/teardown script should chown or rm those files before the job exits. This would prevent the EACCES errors from recurring.

@functionstackx

functionstackx commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

@hongxiayang adding as an reminder to CONTRIBUTING.md see #2043

  1. Never write as root into the runner workspace — use /tmp or a dedicated staging path instead.
  2. If root writes are unavoidable, add a trap cleanup EXIT that removes root-owned files before the job exits.
  3. Test by cancelling a running benchmark mid-flight and verifying no root files remain.

One stranded root-owned directory bricks every subsequent job on that runner and blocks the entire AMD MI355X sweep queue for all contributors.

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@hongxiayang

Copy link
Copy Markdown
Collaborator Author

@functionstackx Thank you for help and resolve the issue. Can this PR be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants