Skip to content

Fix concurrent capture worker rebuild after a reaped browser#309

Open
probably-jaden wants to merge 1 commit into
mainfrom
fix/capture-worker-rebuild-loop-state
Open

Fix concurrent capture worker rebuild after a reaped browser#309
probably-jaden wants to merge 1 commit into
mainfrom
fix/capture-worker-rebuild-loop-state

Conversation

@probably-jaden

Copy link
Copy Markdown
Contributor

When the stall supervisor kills a wedged browser process, the worker's dead sync-Playwright instance leaves its asyncio loop registered as running on that thread. The rebuild's fresh `sync_playwright().start()` then refuses with "Playwright Sync API inside the asyncio loop", the exception propagates out of the worker, and the whole capture run dies.

Fix: capture the thread's loop-state baseline when a worker starts, and restore it (`asyncio.events._set_running_loop`) after abandoning a dead fetcher, so the rebuilt fetcher can start cleanly. The dead loop is deliberately leaked; its browser processes are already swept by the reaper. Includes regression tests for the rebuild path (browser death mid-run, and a failed rebuild no longer killing the run).

Observed in practice: any capture run large enough to hit one hung page (>~200 URLs at concurrency 5) crashed within minutes; with this fix the same runs complete.

🤖 Generated with Claude Code

When the supervisor kills a stalled browser, the worker's rebuild
(sync_playwright().start() via the fetcher factory) raised "Playwright
Sync API inside the asyncio loop" and killed the whole run: the dead
instance's event loop stays registered as running on the worker thread,
and _close_quietly can't clear it because Playwright teardown is
thread-affine.

- pipeline: reset the worker thread's running-loop marker to its
  pre-fetcher baseline before rebuilding (and at worker teardown),
  abandoning the poisoned loop instead of touching thread-affine
  Playwright internals.
- pipeline: a failed rebuild no longer propagates out of the worker;
  the URL keeps its error outcome and the next URL retries the rebuild.
- playwright_fetcher: __exit__ still calls playwright.stop() when
  browser.close() fails on an already-dead browser, so same-thread
  teardown un-registers the loop properly.
- tests: fake sync-Playwright-alike fetcher reproducing the poisoned
  thread state; both new tests fail on the previous code.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant