[Klaud Cold] Add /debug-mi300-enroot-pyxis command / 新增 /debug-mi300-enroot-pyxis 命令#2078
Conversation
…ures Encodes the enroot-nsenter userns diagnosis: signature matching, srun-based read-only sysctl survey via the amd-vultr-mi300 controller, the approval-gated apparmor_restrict_unprivileged_userns fix with sysctl.d persistence, sweep reruns, and the provisioning-image follow-up. 中文:新增 /debug-mi300-enroot-pyxis 命令,用于诊断 MI300X 集群上 enroot/pyxis 容器启动失败 - 包含错误签名匹配、经 slurm 控制器的只读 sysctl 巡检、需用户批准的内核参数修复(含 sysctl.d 持久化)、sweep 重跑以及供应镜像层面的长期修复建议。 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM — documentation-only addition of a new debugging slash command.
Extended reasoning...
Overview
This PR adds a single new file, .claude/commands/debug-mi300-enroot-pyxis.md, which documents a debugging workflow for enroot/pyxis user-namespace failures on the MI300X Vultr cluster. It captures the error signature, controller-access pattern via ssh amd-vultr-mi300 + srun, a read-only sysctl survey, the approval-gated fix, and a note about the durable provisioning-image fix.
Security risks
None from this change itself — it is a Markdown runbook, not executable code. The fix it describes (setting kernel.apparmor_restrict_unprivileged_userns=0) does weaken a kernel mitigation, but the file appropriately gates that action behind explicit user approval (AskUserQuestion) and only aligns nodes to the cluster's existing working baseline. No secrets, credentials, or sensitive endpoints are introduced.
Level of scrutiny
Low. This is a .claude/commands/*.md slash-command reference file with no impact on production code, CI, or benchmark behavior. It only takes effect when a user explicitly invokes the command interactively.
Other factors
No bugs were found by the bug hunting system, there is no prior reviewer discussion to reconcile, and the runbook matches the incident described in the PR description (validated live on nodes chi-mi300x-043/049/057/058, with canaries #2062/#2063/#2069 rerun). Safe to auto-approve.
Summary
.claude/commands/debug-mi300-enroot-pyxis.md, encoding today's MI300X debugging session: theenroot-nsenter: failed to create user namespace: Permission deniedsignature, root-on-controller access viassh amd-vultr-mi300withsrunto reach compute nodes (no direct root SSH), a read-only survey ofkernel.apparmor_restrict_unprivileged_usernsacross nodes, and the approval-gated fix (sysctl -w+/etc/sysctl.d/99-enroot-userns.conf) matching the cluster's working baseline.unshare -Uworks even on broken nodes (Ubuntu's stock AppArmor profile), the rerun step, and the durable fix (bake the sysctl into the provisioning image — freshly provisioned nodes regress otherwise).中文说明
.claude/commands/debug-mi300-enroot-pyxis.md,沉淀今日 MI300X 调试过程:enroot-nsenter用户命名空间 Permission denied 的错误签名、通过ssh amd-vultr-mi300(控制器 root)加srun访问计算节点(节点不接受直接 root SSH)、对kernel.apparmor_restrict_unprivileged_userns的只读全节点巡检,以及需用户批准的修复(sysctl -w+/etc/sysctl.d/99-enroot-userns.conf,对齐集群自身的正常基线)。unshare -U也能成功(Ubuntu 自带 AppArmor profile 覆盖 unshare);并包含重跑步骤与长期修复建议(将该 sysctl 写入节点供应镜像,否则新供应节点会回退)。🤖 Generated with Claude Code