fix: stop codex harness retry loop draining tokens on exhausted rate-limit reconnects#41385
fix: stop codex harness retry loop draining tokens on exhausted rate-limit reconnects#41385Copilot wants to merge 5 commits into
Conversation
…limit reconnects Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #41385 does not have the 'implementation' label and has 0 new lines of code in business logic directories (threshold: 100). |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Test Quality Sentinel completed test quality analysis. |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the Codex CLI harness retry logic to avoid wasting tokens/time when Codex hits a TPM rate limit and exhausts its own internal reconnect attempts, and to align Codex’s non-retryable guard conditions with other harnesses.
Changes:
- Expand Codex rate-limit detection to also match the human-readable
"Rate limit reached ..."message format. - Stop harness-level retries when Codex output shows both a rate-limit error and an exhausted internal reconnect sequence (
Reconnecting... N/N). - Treat
max_runs_exceededas a non-retryable guard condition for Codex, matching existing behavior in other harnesses, and add/extend tests for these scenarios.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/codex_harness.cjs | Expands rate-limit matching and adds an “exhausted reconnects + rate limit” non-retryable break, plus maxRunsExceeded handling. |
| actions/setup/js/codex_harness.test.cjs | Updates retry-policy tests to cover the new rate-limit string format, exhausted reconnect detection, and max_runs_exceeded. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 1
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached/i; | ||
| const SERVER_ERROR_PATTERN = /InternalServerError|ServiceUnavailableError|500 Internal Server Error|503 Service Unavailable/i; |
There was a problem hiding this comment.
Updated in commit 8f2536b: the test helper now calls isRateLimitError and isServerError directly, so the retry behavior tests stay aligned with production detection logic.
🧪 Test Quality Sentinel Report✅ Test Quality Score: 97/100 — Excellent
📊 Metrics & Test Classification (10 tests analyzed)
Go: 0 ( Scoring breakdown:
Verdict
References: §28151305743
|
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /diagnose and /tdd — approving with minor suggestions.
📋 Key Themes & Highlights
Root-Cause Analysis (/diagnose)
The fix correctly targets the actual failure: the harness was pattern-matching too narrowly and re-running sessions that had already drained their TPM budget. Both changes address the root cause:
- Pattern expansion:
Rate limit reachedcovers the human-readable reconnect-error format Codex emits in addition to the three existing machine-readable forms. - Exhausted-reconnect gate: the
RECONNECT_EXHAUSTED_PATTERNbackreference (\d+\/\1) is an elegant, correct way to distinguish "last attempt" from "intermediate attempt" without parsing the counter fraction explicitly. maxRunsExceededparity: thecodex_harnesswas silently ignoring a guard thatclaude_harnessalready handled — good catch.
Test Coverage (/tdd)
Seven focused tests were added covering: new rate-limit format (with and without exhausted reconnects), reconnect-exhaustion alone without rate-limit (still retryable), multi-digit N/N, N/M mismatch, and maxRunsExceeded. The matrix is solid.
Minor gaps flagged inline:
isRateLimitErrorunit-test suite — not extended with the new "Rate limit reached for" variant; only covered via the integrationshouldRetryhelper.RATE_LIMIT_ERROR_PATTERNduplication — the test helper re-declares the regex locally rather than calling the importedisRateLimitError; future pattern changes need two edits.- Pattern specificity —
Rate limit reachedis slightly broader than needed;Rate limit reached forwould anchor to the observed Codex message format.
Positive Highlights
- ✅ Backreference regex is ReDoS-safe (no nested quantifiers)
- ✅ Guard ordering is correct:
maxRunsExceededchecked before reconnect-exhaustion, so no ambiguous overlap - ✅ New non-retryable path logs clearly, matching the style of other
breaksites - ✅ PR description is exemplary — concrete reproduction, exact log lines, and a commented code snippet explaining the regex semantics
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 68.1 AIC · ⌖ 8.78 AIC · ⊞ 6.5K
| function shouldRetry(result, attempt) { | ||
| if (result.exitCode === 0) return false; | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError/i; | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached/i; |
There was a problem hiding this comment.
[/tdd] shouldRetry re-declares RATE_LIMIT_ERROR_PATTERN as a local constant instead of calling the imported isRateLimitError. This is a latent drift risk: if the production pattern changes again, the update must be made in both places.
💡 Suggestion
Replace lines 412 and 419 with:
const isRateLimit = isRateLimitError(result.output);This keeps the helper honest — it tests the same function that main() actually calls, and the pattern is maintained in exactly one place.
There was a problem hiding this comment.
Applied in commit 8f2536b: shouldRetry now uses isRateLimitError(result.output) instead of a local rate-limit regex copy.
| expect(shouldRetry(result, 0)).toBe(false); | ||
| }); | ||
|
|
||
| it("retries on rate limit with format 'Rate limit reached for' without exhausted reconnects", () => { |
There was a problem hiding this comment.
[/tdd] The new "Rate limit reached for ..." format is tested here through the shouldRetry integration helper, but the dedicated isRateLimitError unit-test block (lines 88–110) was not extended to cover this variant. That block already has cases for each of the other three alternatives.
💡 Suggested addition to the isRateLimitError describe block
it("returns true for 'Rate limit reached for' human-readable message", () => {
expect(
isRateLimitError(
"Rate limit reached for gpt-4o-mini in organization org-xxx on tokens per min (TPM): " +
"Limit 200000, Used 166655, Requested 35398. Please try again in 615ms."
)
).toBe(true);
});Keeping each alternative in the unit test makes the contract of isRateLimitError self-documenting and ensures regression detection stays close to the function under test.
There was a problem hiding this comment.
Added in commit 8f2536b: the isRateLimitError unit-test block now includes a dedicated case for the human-readable Rate limit reached for ... on tokens per min message.
| // ("429 Too Many Requests"), the client-side exception class ("RateLimitError"), and | ||
| // the human-readable message Codex emits inside "Reconnecting..." / error lines: | ||
| // "Rate limit reached for <model> in organization <org> on tokens per min (TPM): ..." | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached/i; |
There was a problem hiding this comment.
[/diagnose] Rate limit reached is a more general phrase than the other three alternatives, which are all tied to specific OpenAI error codes or HTTP verbs. There is a small risk that a non-rate-limit message (e.g., from a proxy or another API) containing this phrase triggers a false positive, causing the harness to back off unnecessarily.
💡 Consider a tighter anchor
The Codex-emitted format appears to be Rate limit reached for <model>. Anchoring on Rate limit reached for rather than just Rate limit reached reduces the false-positive surface while still covering all observed messages:
const RATE_LIMIT_ERROR_PATTERN =
/rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached for /i;The trailing space ensures the phrase is followed by a model name, not just any word. Low priority — the current pattern is correct for the observed Codex format — but worth noting.
There was a problem hiding this comment.
Implemented in commit 8f2536b: the production pattern is now tightened to Rate limit reached for .+ on tokens per min to reduce false positives from generic prose.
There was a problem hiding this comment.
REQUEST_CHANGES — two correctness issues and one test coverage gap must be addressed
The maxRunsExceeded addition and RECONNECT_EXHAUSTED_PATTERN backreference logic are sound. However there are two blocking issues with the rate-limit classification and one test design problem that must be addressed before merge.
Blocking findings
1. Rate limit reached is too broad for a phrase used in non-retry suppression (harness line 66)
Plain-English phrases like "Rate limit reached" appear in application code, docs, and prompts agents routinely touch. Adding this phrase to a pattern whose false positives stop retrying — not just add retries — is disproportionately risky. Should be anchored to the specific provider message structure (Rate limit reached for .+ on tokens per min).
2. Hard-stopping on exhausted reconnects discards the backoff window (harness line 505)
The harness has INITIAL_DELAY_MS / MAX_DELAY_MS backoff expressly to let transient limits expire before a fresh run. By breaking unconditionally, we discard that recovery path. For the 15-min session scenario in the PR description the TPM tokens from the first 14 minutes are outside the rolling window by the time a retried attempt reaches the same point. Stopping after exactly 0 further attempts converts timing-based failures into permanent failures. At minimum this tradeoff needs documentation; ideally it allows exactly one more attempt after a long delay.
3. Test helper copies pattern constants instead of calling exported helpers (test line 412)
After this PR both copies are in sync, but the sync is manual. Subsequent pattern changes will silently leave the tests testing stale logic. The helper should use the exported isRateLimitError() / isServerError() functions.
🔎 Code quality review by PR Code Quality Reviewer · 99.6 AIC · ⌖ 7.54 AIC · ⊞ 5.2K
| // ("429 Too Many Requests"), the client-side exception class ("RateLimitError"), and | ||
| // the human-readable message Codex emits inside "Reconnecting..." / error lines: | ||
| // "Rate limit reached for <model> in organization <org> on tokens per min (TPM): ..." | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached/i; |
There was a problem hiding this comment.
Rate limit reached broadens the pattern to match arbitrary task output, enabling false rate-limit classification that silently suppresses retries.
💡 Details and suggested fix
The previous alternatives (rate_limit_exceeded, 429 Too Many Requests, RateLimitError) are structured tokens — an OpenAI error type field, an HTTP status string, a client exception class name — that are extremely unlikely to appear in agent-generated content. The new phrase Rate limit reached is common English and will appear whenever an agent works on code, scripts, prompts, or docs that discuss rate limiting.
Because isRateLimit feeds the new exhausted-reconnect guard (isRateLimit && isReconnectExhaustedError(result.output)), a false positive here silently converts a retryable transient failure into a permanent non-retry. The operator sees a confusing log line ("rate-limit with exhausted reconnects") for a failure that had nothing to do with rate limits.
Suggested fix — anchor to the actual provider TPM message format:
// Too broad — matches arbitrary prose:
const RATE_LIMIT_ERROR_PATTERN =
/rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached/i;
// Anchored to the specific OpenAI TPM phrase codex emits:
const RATE_LIMIT_ERROR_PATTERN =
/rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached for .+ on tokens per min/i;This pins the match to Rate limit reached for <model> in organization <org> on tokens per min (TPM):, which won't appear in typical task output.
There was a problem hiding this comment.
Addressed in commit 8f2536b by narrowing the new human-readable match to Rate limit reached for .+ on tokens per min, keeping detection specific to the provider TPM format.
| // rate-limit error. Each reconnect attempt immediately failed with the same limit, | ||
| // so a fresh harness run will encounter the same rate-limit at the same point in the | ||
| // session and drain the token budget further without making progress. | ||
| if (isRateLimit && isReconnectExhaustedError(result.output)) { |
There was a problem hiding this comment.
Hard-stopping on exhausted reconnects is permanent — it discards the backoff delay that could have allowed the TPM window to roll over.
💡 Details
The harness's exponential backoff (INITIAL_DELAY_MS = 5000, max MAX_DELAY_MS = 60000) between attempts was designed precisely to let transient limits expire before a fresh run. OpenAI TPM windows are 1-minute rolling windows. By the time a first codex attempt finishes (~15 min per the PR description), runs 60 s of backoff, and reaches attempt 2, the earliest tokens consumed are well outside the window — the token budget has substantially rolled over.
Attempt 1 — t=0: runs 15 min, hits exhausted reconnect at t=15 min
break → harness exits entirely
← NEW BEHAVIOR: all backoff opportunity is discarded
vs the old behavior (wrongly retrying three times) vs the better behaviour (retry once after max backoff):
Attempt 1 — t=0: runs 15 min, hits rate-limit with exhausted reconnect
Backoff: t=15: wait MAX_DELAY_MS (60 s) — TPM window rolls over
Attempt 2 — t=16: fresh run, window is partially cleared, succeeds (or fails cleanly)
Suggested alternative: instead of break, apply a fixed long-delay sleep (e.g., 2× the TPM window = 120 s) then allow exactly one more attempt, rather than zero. If that also exhausts reconnects, then break.
At minimum, document the tradeoff: if the failure is truly non-retryable (rate limit exceeds session budget regardless of timing), the current change is correct. If the failure is timing-based (instantaneous limit with rolling window), the current change silently discards the recovery path.
There was a problem hiding this comment.
Kept current behavior intentionally. This PR targets the token-drain failure mode where exhausted reconnects repeatedly consume budget without recovery. We now stop only when both conditions are present (rate-limit + exhausted N/N reconnects), while keeping normal transient retries in place for non-exhausted reconnects and other transient failures.
| function shouldRetry(result, attempt) { | ||
| if (result.exitCode === 0) return false; | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError/i; | ||
| const RATE_LIMIT_ERROR_PATTERN = /rate_limit_exceeded|429 Too Many Requests|RateLimitError|Rate limit reached/i; |
There was a problem hiding this comment.
Test helper re-defines RATE_LIMIT_ERROR_PATTERN locally instead of using the exported isRateLimitError(), so tests can pass while production logic silently diverges.
💡 Details and suggested fix
The shouldRetry test helper at this line owns a private copy of RATE_LIMIT_ERROR_PATTERN (and SERVER_ERROR_PATTERN). This PR just updated both the module constant and the test copy simultaneously — but that discipline is entirely manual. The next time someone extends the module's pattern, tests will stay green even though the helper's pattern no longer matches production behaviour.
The module already exports isRateLimitError and isServerError. The helper should delegate to those:
function shouldRetry(result, attempt) {
if (result.exitCode === 0) return false;
// ← no local RATE_LIMIT_ERROR_PATTERN / SERVER_ERROR_PATTERN needed
if (attempt === 0 && isAuthenticationFailedError(result.output)) return false;
if (isMissingApiKeyError(result.output)) return false;
if (hasNumerousPermissionDeniedIssues(result.output)) return false;
const nonRetryableGuard = detectNonRetryableHarnessGuard(result.output);
if (nonRetryableGuard.aiCreditsExceeded || nonRetryableGuard.awfAPIProxyBlockingRequests ||
nonRetryableGuard.goalAlreadyActive || nonRetryableGuard.maxRunsExceeded) return false;
const isRateLimit = isRateLimitError(result.output); // ← use export
if (isRateLimit && isReconnectExhaustedError(result.output)) return false;
const isTransient = isRateLimit || isServerError(result.output); // ← use export
return attempt < MAX_RETRIES && (result.hasOutput || isTransient);
}This makes the test helper a true adapter over the same decision functions production uses, closing the divergence risk.
There was a problem hiding this comment.
Done in commit 8f2536b: the helper no longer re-defines rate-limit/server patterns and now delegates to exported isRateLimitError and isServerError functions.
|
@copilot run pr-finisher skill |
|
@copilot review all comments and address unresolved review feedback.
|
|
@copilot please rerun checks on the updated branch and post a short blocker summary if anything still fails.
|
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
@copilot review all comments and address unresolved review feedback.
|
|
Please summarize the remaining blockers and the next smallest step to get this PR mergeable.
|
When a codex session hit a TPM rate limit, its internal reconnect loop exhausted all 5 attempts (
Reconnecting... 1/5→5/5), then exited non-zero withhasOutput=true. The harness didn't recognise"Rate limit reached for gpt-4o-mini..."as a rate-limit error (pattern only matchedrate_limit_exceeded/429 Too Many Requests/RateLimitError), so it classified the failure as "partial execution" and retried 3 more times — each consuming a full TPM budget, running for ~15 min/attempt, totalling 63+ minutes.Changes
Expand
RATE_LIMIT_ERROR_PATTERN— addsRate limit reachedto catch the human-readable format codex emits in reconnect/error JSON lines.Detect exhausted reconnects as non-retryable — new
RECONNECT_EXHAUSTED_PATTERNuses a backreference to matchReconnecting... N/N(both numbers identical = last attempt, e.g.5/5) while ignoring intermediate attempts (1/5,3/5). When this appears together with a rate-limit error, the harness stops retrying:Add
maxRunsExceededto codex non-retryable guard —claude_harness.cjsalready checked this;codex_harness.cjswas missing it.pr-sous-chef: requested branch update for review freshness.
Run: https://github.com/github/gh-aw/actions/runs/28152194350