[Klaud Cold] CONTRIBUTING: warn against root-owned files on AMD runners / 贡献指南:警告勿在 AMD runner 工作区留下 root 所属文件#2043
Conversation
…rs / 贡献指南:警告勿在 AMD runner 工作区留下 root 所属文件 Add a section to CONTRIBUTING.md (and _zh counterpart) documenting that Slurm containers must not strand root-owned files in GitHub Actions runner workspaces on the AMD MI355X TW cluster. A single stranded directory bricks the entire runner queue for everyone. Includes rules for avoiding the problem and a pointer to the recovery procedure. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM — documentation-only addition with EN/zh in sync.
Extended reasoning...
Overview
This PR adds a new warning section to CONTRIBUTING.md and its Chinese counterpart CONTRIBUTING_zh.md about not leaving root-owned files in GitHub Actions runner workspaces on the AMD MI355X TW cluster. It documents three rules (no root writes to workspace, cleanup trap if unavoidable, test the teardown path) and links to the existing recovery procedure at .claude/commands/clean-amd-mi355-runner-root-files.md (verified to exist).
Security risks
None. This is purely documentation guidance — no executable code, config, or workflow changes.
Level of scrutiny
Low. This is a pure Markdown edit with no functional impact on the build, CI, or runtime. Both language versions are updated in sync per repo convention, and the referenced command file exists.
Other factors
No bugs flagged by the bug hunting system, no prior reviews on the timeline, and the change follows the same bilingual pattern as recent merged PRs in the tree. The motivation (recurring runner brick incidents) is well-established and the guidance matches operational reality reflected in AGENTS.md and existing command files.
Summary
Adds a new section to
CONTRIBUTING.md(and itsCONTRIBUTING_zh.mdcounterpart) warning contributors not to leave root-owned files in GitHub Actions runner workspaces on the AMD MI355X TW cluster.Problem: Multi-node Slurm containers run as root and write logs into the runner workspace. When a job is cancelled before teardown, root-owned directories get stranded. The runner user cannot delete them, causing
EACCES: permission deniederrors atactions/checkoutthat brick the runner for every subsequent job. Since all AMD MI355X sweeps share the same runner pool, one stranded directory blocks the entire queue for everyone.New rules documented:
/tmpor a dedicated staging pathtrap cleanup EXIT) that removes root-owned files before exitIncludes a pointer to the recovery procedure in
.claude/commands/clean-amd-mi355-runner-root-files.md.Motivated by recurring incidents on
gharunner06-- most recently #2003 (comment).Test plan
CONTRIBUTING.mdandCONTRIBUTING_zh.mdupdated in sync🤖 Generated with Claude Code
中文说明
在
CONTRIBUTING.md及其中文版CONTRIBUTING_zh.md中新增章节,警告贡献者不要在 AMD MI355X TW 集群的 GitHub Actions runner 工作区中留下 root 所属文件。Slurm 容器以 root 运行时写入的日志文件如果在任务取消后未清理,会导致后续所有任务因EACCES: permission denied而失败,阻塞整个 runner 队列。新增三条规则:禁止以 root 写入工作区、必须添加清理 trap、测试 teardown 路径。