fix(server): prevent exec relays from hanging on idle connections by Gal-Zaidman · Pull Request #1992 · NVIDIA/OpenShell

Gal-Zaidman · 2026-06-24T19:07:53Z

Summary

Gateway ExecSandbox calls could hang indefinitely after the command finished, when a supervisor session reset mid-exec orphaned the relay channel — the exec loop blocked on channel.wait() with no liveness backstop, so callers hung until their own deadline. This adds SSH and HTTP/2 keepalives and bounds the post-exit wait so a wedged/orphaned relay fails fast instead of hanging.

Related Issue

Closes #1990

Changes

Exec relay russh client: SSH keepalive so a wedged/orphaned relay is shed (returns an error) instead of parking on channel.wait() forever. Channel-silent execs (e.g. an agent that redirects stdout to a file) stay alive while the relay is healthy — liveness is probed via keepalive, not output-idle.
Bound how long the gateway waits for the trailing channel close after a command reports its exit status.
Return UNAVAILABLE when a relay closes before reporting an exit status, instead of a misleading exit code 1.
Server-side HTTP/2 keepalive (with the required Timer) on supervisor multiplex connections, to reduce the session resets that orphan relays.
Documented the relay-liveness backstops in architecture/gateway.md.

Testing

mise run pre-commit — rust:format:check, cargo clippy -D warnings, and markdownlint are clean for this change (ran individually). Note: the local mise run pre-commit aborts on its python:proto step due to a missing grpc_tools dev dependency in the venv, unrelated to this change; CI runs the full suite.
Unit tests added/updated — none; the exec loop requires a live SSH relay, so there is no isolated unit surface.
E2E tests added/updated — N/A; reproducing the fix path needs a mid-exec supervisor-session reset, which isn't reliably stageable in an in-process e2e.

Manually validated on a Kubernetes deployment: rebuilt and deployed the gateway image, confirmed the gateway is healthy, that long channel-silent execs are not killed by the keepalive, and that the previously-observed multi-sandbox hang no longer reproduces.

Checklist

Follows Conventional Commits
Commits are signed off (DCO)

Add HTTP/2 keepalive on supervisor multiplex connections so half-dead sessions cannot leave in-flight exec relays parked indefinitely. Configure SSH keepalive on exec relay clients so long silent commands are not timed out on stdout idle alone; wedged or orphaned relays fail after missed keepalives instead. After a command reports exit status, bound how long the gateway waits for the trailing channel close. Return UNAVAILABLE when a relay closes before reporting exit status rather than defaulting to exit code 1. Signed-off-by: Gal Zaidman <gzaidman@nvidia.com>

copy-pr-bot · 2026-06-24T19:07:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-24T19:09:33Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

Gal-Zaidman · 2026-06-24T19:11:59Z

I have read the DCO document and I hereby sign the DCO.

Gal-Zaidman · 2026-06-24T19:12:23Z

recheck

TaylorMutch · 2026-06-24T20:24:00Z

/ok to test b4878be

TaylorMutch · 2026-06-24T20:24:14Z

@Gal-Zaidman have you been able to verify this resolves the issue in your environment?

Gal-Zaidman · 2026-06-25T08:08:06Z

@Gal-Zaidman have you been able to verify this resolves the issue in your environment?

Yes, currently ran a job with 80 concurrent agents each running an SWE bench task with long exec (that is how harbor works) - zero hangs.
Before the fix, even with 20 agents, more than half would have hang.

Gal-Zaidman requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 24, 2026 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(server): prevent exec relays from hanging on idle connections#1992

fix(server): prevent exec relays from hanging on idle connections#1992
Gal-Zaidman wants to merge 1 commit into
NVIDIA:mainfrom
Gal-Zaidman:1990-exec-relay-idle-timeout

Gal-Zaidman commented Jun 24, 2026

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Gal-Zaidman commented Jun 24, 2026

Uh oh!

Gal-Zaidman commented Jun 24, 2026

Uh oh!

TaylorMutch commented Jun 24, 2026

Uh oh!

TaylorMutch commented Jun 24, 2026

Uh oh!

Gal-Zaidman commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Gal-Zaidman commented Jun 24, 2026

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gal-Zaidman commented Jun 24, 2026

Uh oh!

Gal-Zaidman commented Jun 24, 2026

Uh oh!

TaylorMutch commented Jun 24, 2026

Uh oh!

TaylorMutch commented Jun 24, 2026

Uh oh!

Gal-Zaidman commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 24, 2026 •

edited

Loading

Gal-Zaidman commented Jun 25, 2026 •

edited

Loading