[FIX]: Replace SysV init with native systemd unit to fix gunicorn duplicate-master race#1139
Open
x15sr71 wants to merge 2 commits into
Open
[FIX]: Replace SysV init with native systemd unit to fix gunicorn duplicate-master race#1139x15sr71 wants to merge 2 commits into
x15sr71 wants to merge 2 commits into
Conversation
555041c to
f51478e
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



In raising this pull request, I confirm the following (please check boxes):
My familiarity with the project is as follows (check one):
Problem
Under restarts, two gunicorn master groups can end up bound to
sampleplatform.socksimultaneously — one stale/orphaned, one serving. The orphan group holds ~370 MB of otherwise-reclaimable memory and contributes to the platform's intermittent outages on the 2 GB VM.Root cause (and how each change addresses it)
install/platformis a SysV init script thatsystemd-sysv-generatorwraps into a unit withType=forking,GuessMainPID=no,KillMode=process. Combined with gunicorn's--daemonflag (which double-forks and detaches), the failure unfolds in a chain — each link is fixed by a specific change in this PR:--daemondouble-forks and detaches, so the process systemd launched exits → systemd can't track the masterbootstrap_gunicorn.py: remove--daemon/--pid; the new unit runs gunicorn in the foregroundGuessMainPID=no+ detached master →systemctl showreportsMainPID=0(systemd has no PID to act on)platform.service:Type=simple+ foreground gunicorn → systemd records the real master PIDKillMode=process+MainPID=0→systemctl stopkills nothing; journal logsUnit process <pid> (gunicorn) remains running after unit stoppedplatform.service:KillMode=control-group→ stop reaps the entire cgroup (master + workers + git subprocesses)Found left-over process <pid> (gunicorn)... Ignoring), leaves both → two masters on one socketstopdoeskill $(cat gunicorn.pid), which no-ops when the pidfile is missing (a transient condition during reexec)install/platformremoved entirely; supervision is now systemd's, not a pidfile'sEvidence (production journal + app log):
systemctl show platform -p MainPID -p KillMode -p GuessMainPIDon prod returnedMainPID=0,KillMode=process,GuessMainPID=no— confirming systemd never tracked the master.MainPIDis0for the unit's entire retained lifetime (never non-zero).control-groupkill would reap them all.error.logsince at least 2026-03-10 (4 independently confirmed in the systemd journal). Triggers include any service restart —apt-daily-upgrade's systemd reexec confirmed for one event; manual restarts and an OS upgrade account for others.Changes
install/platform.service— native systemd unit (Type=simple,KillMode=control-group, foreground gunicorn).install/platform— the SysV init script (the root-cause source the generator wrapped).bootstrap_gunicorn.py— drop--daemon/--pid; foreground only. (Note: the unit'sExecStartinvokes gunicorn directly; this script is retained for manual debugging and is no longer the deployment path.)install/install.sh— remove the old init script + its rc.d links, install + enable the native unit, dropupdate-rc.d ... defaults(invalid without a SysV script).install/installation.md—systemctlcommands replace/etc/init.d/platform.Testing
Test bed: a physical x86_64 laptop running Ubuntu 24.04, hosting an isolated Ubuntu 24.04.4 (Noble) VM via Multipass with systemd as PID 1. The supervision tests were run inside that VM to keep them off the host and reproduce a clean, prod-faithful systemd environment.
Matched to production on every factor that governs this bug:
-w 4, unix socket,-m 007,--timeout 120Reproduced the bug under the old SysV-generated unit: confirmed
MainPID=0,KillMode=process,GuessMainPID=no— matching production exactly.Verified the fix under the native unit:
MainPID= non-zero — systemd tracks the master.KillMode=control-group.systemctl restartcycles → exactly 5 processes every time, never accumulating duplicates.systemctl stop→ 0 gunicorn processes — the whole cgroup reaped, directly fixing the "remains running after unit stopped" failure that caused the orphans.Additional checks:
bootstrap_gunicorn.py(no--daemon,subprocess.run) was confirmed to boot gunicorn cleanly in the foreground.install.sh${root_dir}path substitution was verified to rewrite every path in the unit correctly (no malformed paths)./usr/bin/gunicorn), confirmed importable bywww-data— so the unit'sExecStartpath is correct for prod.Scope of testing (stated honestly):
/hometo isolate it from filesystem-permission noise; thewww-data//var/wwwsocket ownership (770 www-data:www-data) and nginx integration were verified separately on the production VM.install/install.shchanges were reviewed and the unit-generation step verified, but the script was not run end-to-end through a full platform install; it is applied on the live VM via the controlled cutover below.Deployment
Because the two duplicate masters are already running, a merge won't touch them, and deploying over the current state could leave the service in a bad state (a third master, or a failed/hung start). The safe sequence, once the PR is merged and the code is on the VM:
Stop and fully clear the old setup — stop the service, hard-kill any remaining gunicorn processes (both masters and all workers), and remove the stale
sampleplatform.sockandgunicorn.pid. Confirm no gunicorn processes remain before continuing.Remove the SysV init script (
/etc/init.d/platform) and its rc.d links. This is essential: if it's left in place,systemd-sysv-generatorkeeps emitting a competing unit of the same name, so you'd have two definitions forplatform.serviceand which one wins is non-deterministic.Install and reload the native unit — copy
install/platform.serviceinto/etc/systemd/system/, thensystemctl daemon-reload. The make-or-break check here issystemctl cat platform.service: it must show the file under/etc/systemd/system/, not "generated by systemd-sysv-generator". If it still says generated, step 2 didn't fully take — stop and fix that first.Start and verify —
systemctl enable --now platform.service, then confirmMainPIDis non-zero (it was0under the old unit),KillMode=control-group, exactly one master + four workers, the socket is back at770 www-data:www-data, and the app responds. A couple ofsystemctl restartcycles should hold at one master with no duplicates.After this one-time cutover, subsequent deploys are clean — the native unit is in the repo, so future installs use it directly.
Note on the gunicorn path:
This unit sets
ExecStart=/usr/bin/gunicorn, which matches the current production VM (verified — gunicorn is installed and importable at that path). systemd requires an absolute path in ExecStart, so the previous PATH-resolved gunicorn invocation (used by the oldbootstrap_gunicorn.py) can't be carried over directly. Separately,install.shinstalls the gunicorn3 package, which provides/usr/bin/gunicorn3— so on a clean install from the script, the path may need alignment. This is a pre-existing packaging inconsistency that this PR's absolute-path requirement surfaces; it's outside the scope of the race fix, but flagging it so the canonical gunicorn install can be confirmed.