Skip to content

fix(server): prevent exec relays from hanging on idle connections#1992

Open
Gal-Zaidman wants to merge 1 commit into
NVIDIA:mainfrom
Gal-Zaidman:1990-exec-relay-idle-timeout
Open

fix(server): prevent exec relays from hanging on idle connections#1992
Gal-Zaidman wants to merge 1 commit into
NVIDIA:mainfrom
Gal-Zaidman:1990-exec-relay-idle-timeout

Conversation

@Gal-Zaidman

Copy link
Copy Markdown

Summary

Gateway ExecSandbox calls could hang indefinitely after the command finished, when a supervisor session reset mid-exec orphaned the relay channel — the exec loop blocked on channel.wait() with no liveness backstop, so callers hung until their own deadline. This adds SSH and HTTP/2 keepalives and bounds the post-exit wait so a wedged/orphaned relay fails fast instead of hanging.

Related Issue

Closes #1990

Changes

  • Exec relay russh client: SSH keepalive so a wedged/orphaned relay is shed (returns an error) instead of parking on channel.wait() forever. Channel-silent execs (e.g. an agent that redirects stdout to a file) stay alive while the relay is healthy — liveness is probed via keepalive, not output-idle.
  • Bound how long the gateway waits for the trailing channel close after a command reports its exit status.
  • Return UNAVAILABLE when a relay closes before reporting an exit status, instead of a misleading exit code 1.
  • Server-side HTTP/2 keepalive (with the required Timer) on supervisor multiplex connections, to reduce the session resets that orphan relays.
  • Documented the relay-liveness backstops in architecture/gateway.md.

Testing

  • mise run pre-commitrust:format:check, cargo clippy -D warnings, and markdownlint are clean for this change (ran individually). Note: the local mise run pre-commit aborts on its python:proto step due to a missing grpc_tools dev dependency in the venv, unrelated to this change; CI runs the full suite.
  • Unit tests added/updated — none; the exec loop requires a live SSH relay, so there is no isolated unit surface.
  • E2E tests added/updated — N/A; reproducing the fix path needs a mid-exec supervisor-session reset, which isn't reliably stageable in an in-process e2e.

Manually validated on a Kubernetes deployment: rebuilt and deployed the gateway image, confirmed the gateway is healthy, that long channel-silent execs are not killed by the keepalive, and that the previously-observed multi-sandbox hang no longer reproduces.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

Add HTTP/2 keepalive on supervisor multiplex connections so half-dead
sessions cannot leave in-flight exec relays parked indefinitely. Configure
SSH keepalive on exec relay clients so long silent commands are not timed
out on stdout idle alone; wedged or orphaned relays fail after missed
keepalives instead.

After a command reports exit status, bound how long the gateway waits for
the trailing channel close. Return UNAVAILABLE when a relay closes before
reporting exit status rather than defaulting to exit code 1.

Signed-off-by: Gal Zaidman <gzaidman@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@Gal-Zaidman

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@Gal-Zaidman

Copy link
Copy Markdown
Author

recheck

@TaylorMutch

Copy link
Copy Markdown
Collaborator

/ok to test b4878be

@TaylorMutch

Copy link
Copy Markdown
Collaborator

@Gal-Zaidman have you been able to verify this resolves the issue in your environment?

@Gal-Zaidman

Gal-Zaidman commented Jun 25, 2026

Copy link
Copy Markdown
Author

@Gal-Zaidman have you been able to verify this resolves the issue in your environment?

Yes, currently ran a job with 80 concurrent agents each running an SWE bench task with long exec (that is how harbor works) - zero hangs.
Before the fix, even with 20 agents, more than half would have hang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: ExecSandbox hangs indefinitely after the command exits when a supervisor session resets mid-exec

2 participants