Skip to content

fix(supervisor): drop sandbox child capability bounding set#2001

Open
alangou wants to merge 1 commit into
mainfrom
alangou/os-173-high-docker-sandboxes-run-as-root-with-sys_admin-and
Open

fix(supervisor): drop sandbox child capability bounding set#2001
alangou wants to merge 1 commit into
mainfrom
alangou/os-173-high-docker-sandboxes-run-as-root-with-sys_admin-and

Conversation

@alangou

@alangou alangou commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Drop the Linux capability bounding set in the common sandbox child privilege-drop path so workloads and openshell connect shells cannot regain container-granted capabilities after exec.

Related Issue

Closes #1452

Changes

  • Added Linux-only capability bounding set reduction in drop_privileges() after initgroups/setgid and before setuid.
  • Drops capabilities from 0..=cap_last_cap, with CAP_SETPCAP dropped last and a graceful fallback when CAP_SETPCAP is unavailable.
  • Added focused Linux-only regression coverage and documented the child-process invariant in architecture/sandbox.md.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E not applicable; no e2e files changed

Additional checks run:

  • mise run fmt && mise run e2e && mise run e2e:kubernetes && mise run e2e:docker
  • cargo test -p openshell-supervisor-process --lib process::tests -- --nocapture
  • cargo test -p openshell-supervisor-process --lib ssh::tests::pre_exec_always_calls_drop_privileges -- --nocapture

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

Reduce the Linux capability bounding set in the common privilege-drop path before executing sandbox workloads or connect shells.

Signed-off-by: Adrien Langou <alangou@nvidia.com>

#[cfg(target_os = "linux")]
fn kernel_cap_last_cap() -> u32 {
std::fs::read_to_string(PROC_CAP_LAST_CAP_PATH)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought here was whether there's an existing crate that would provide the functionality we need. Codex suggests that https://crates.io/crates/capctl may be applicable:

  - Best fit: capctl (https://crates.io/crates/capctl) / GitHub (https://github.com/cptpcrd/capctl)
    Pure Rust, direct Linux prctl() capability API, and it has the exact bounding-set surface we need: bounding::drop,
    bounding::ensure_dropped, bounding::clear, bounding::clear_unknown, bounding::read/probe, plus effective-set inspection through
    CapState. The important bit: it explicitly handles newer kernel capabilities via clear_unknown(), which matches the PR’s current /proc/
    sys/kernel/cap_last_cap intent better than most wrappers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using a dedicated crate would make the parsing cleaner.

For this patch, I intentionally reused the same low-level mechanism that the codebase already uses for Linux process hardening. The goal was to keep the change narrow, avoid adding a new dependency, and avoid rewriting working capability-related code as part of a defense-in-depth fix.

My concern with adding a capabilities crate only for parsing is that it would be a bit inconsistent: if we introduce a crate that can also wrap the prctl capability operations, then we should probably use it consistently for the whole Linux capabilities path instead of mixing manual prctl calls with crate-based parsing.

So I’d prefer to keep this patch minimal, then consider a separate cleanup/refactor later if we want to move all Linux capability handling to a dedicated crate.

If we are ok to delay this fix for the NeMoClaw team I'm happy to add the crate and rewrite the capability handling with that crate. If the fix can't wait more then I can tackle this in a separate PR as well

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to do this as a follow-up and apply it consistently.

@elezar elezar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A codex-assisted review:

I think this still fails closed incorrectly for the Podman path. drop_capability_bounding_set() returns Ok(()) when CAP_SETPCAP is
not effective, and also when the first PR_CAPBSET_DROP returns EPERM. But the Podman driver currently drops SETPCAP from the
supervisor container, so that path leaves the child bounding set unchanged while still spawning the workload/connect shell.

Could we either keep SETPCAP available to the supervisor until child setup, or fail the spawn when the bounding set cannot be cleared
and is still nonempty? This may also be a good place to use capctl rather than custom /proc parsing and raw prctl;
capctl::caps::bounding::clear() plus an explicit “EPERM is only OK if the bounding set is already empty” check would make the invariant
clearer.

The current regression test skips when CAP_SETPCAP is unavailable, so it would not catch the Podman-relevant failure mode.

@johntmyers

Copy link
Copy Markdown
Collaborator

A codex-assisted review:

I think this still fails closed incorrectly for the Podman path....

+1. Seems like different drivers can do different drops on their own which would impact this common code that runs driver-agnostically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(sandbox): support --cap-drop / capability scoping on sandbox create

3 participants