[FIX]: Replace SysV init with native systemd unit to fix gunicorn duplicate-master race by x15sr71 · Pull Request #1139 · CCExtractor/sample-platform

x15sr71 · 2026-06-28T19:28:53Z

In raising this pull request, I confirm the following (please check boxes):

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.

My familiarity with the project is as follows (check one):

I have never used the project.
I have used the project briefly.
I have used the project extensively, but have not contributed previously.
I am an active contributor to the project.

Problem

Under restarts, two gunicorn master groups can end up bound to sampleplatform.sock simultaneously — one stale/orphaned, one serving. The orphan group holds ~370 MB of otherwise-reclaimable memory and contributes to the platform's intermittent outages on the 2 GB VM.

Root cause (and how each change addresses it)

install/platform is a SysV init script that systemd-sysv-generator wraps into a unit with Type=forking, GuessMainPID=no, KillMode=process. Combined with gunicorn's --daemon flag (which double-forks and detaches), the failure unfolds in a chain — each link is fixed by a specific change in this PR:

#	Root-cause mechanism (observed on prod)	Fix in this PR
1	gunicorn `--daemon` double-forks and detaches, so the process systemd launched exits → systemd can't track the master	`bootstrap_gunicorn.py`: remove `--daemon`/`--pid`; the new unit runs gunicorn in the foreground
2	`GuessMainPID=no` + detached master → `systemctl show` reports `MainPID=0` (systemd has no PID to act on)	`platform.service`: `Type=simple` + foreground gunicorn → systemd records the real master PID
3	`KillMode=process` + `MainPID=0` → `systemctl stop` kills nothing; journal logs `Unit process <pid> (gunicorn) remains running after unit stopped`	`platform.service`: `KillMode=control-group` → stop reaps the entire cgroup (master + workers + git subprocesses)
4	Next start spawns a fresh master, finds orphaned workers (`Found left-over process <pid> (gunicorn)... Ignoring`), leaves both → two masters on one socket	Fixed transitively by 1–3: with a tracked master and a clean stop, no orphans survive to be stacked on
5	The init script's `stop` does `kill $(cat gunicorn.pid)`, which no-ops when the pidfile is missing (a transient condition during reexec)	`install/platform` removed entirely; supervision is now systemd's, not a pidfile's

Evidence (production journal + app log):

systemctl show platform -p MainPID -p KillMode -p GuessMainPID on prod returned MainPID=0, KillMode=process, GuessMainPID=no — confirming systemd never tracked the master.
Across the journal, MainPID is 0 for the unit's entire retained lifetime (never non-zero).
The duplicate-creation sequence (stop → "remains running after unit stopped" → start → "Found left-over process… Ignoring") is logged on multiple dates.
The unit's cgroup was observed holding all running gunicorn processes (master + 4 workers + git children), confirming a control-group kill would reap them all.
Recurrence: ≥11 occurrences in the application error.log since at least 2026-03-10 (4 independently confirmed in the systemd journal). Triggers include any service restart — apt-daily-upgrade's systemd reexec confirmed for one event; manual restarts and an OS upgrade account for others.

Changes

New: install/platform.service — native systemd unit (Type=simple, KillMode=control-group, foreground gunicorn).
Removed: install/platform — the SysV init script (the root-cause source the generator wrapped).
Edited: bootstrap_gunicorn.py — drop --daemon/--pid; foreground only. (Note: the unit's ExecStart invokes gunicorn directly; this script is retained for manual debugging and is no longer the deployment path.)
Edited: install/install.sh — remove the old init script + its rc.d links, install + enable the native unit, drop update-rc.d ... defaults (invalid without a SysV script).
Edited: install/installation.md — systemctl commands replace /etc/init.d/platform.

Testing

Test bed: a physical x86_64 laptop running Ubuntu 24.04, hosting an isolated Ubuntu 24.04.4 (Noble) VM via Multipass with systemd as PID 1. The supervision tests were run inside that VM to keep them off the host and reproduce a clean, prod-faithful systemd environment.

Matched to production on every factor that governs this bug:

Factor	Production	Test VM
OS	Ubuntu 24.04.4 LTS (Noble)	Ubuntu 24.04.4 LTS (Noble)
Arch	x86_64	x86_64
Python	3.12	3.12
gunicorn	20.1.0	20.1.0
systemd as PID 1	yes	yes
Launch flags	`-w 4`, unix socket, `-m 007`, `--timeout 120`	same

Reproduced the bug under the old SysV-generated unit: confirmed MainPID=0, KillMode=process, GuessMainPID=no — matching production exactly.

Verified the fix under the native unit:

MainPID = non-zero — systemd tracks the master.
KillMode=control-group.
Exactly 1 master + 4 workers.
5 consecutive systemctl restart cycles → exactly 5 processes every time, never accumulating duplicates.
systemctl stop → 0 gunicorn processes — the whole cgroup reaped, directly fixing the "remains running after unit stopped" failure that caused the orphans.

Additional checks:

The edited bootstrap_gunicorn.py (no --daemon, subprocess.run) was confirmed to boot gunicorn cleanly in the foreground.
The install.sh ${root_dir} path substitution was verified to rewrite every path in the unit correctly (no malformed paths).
Production uses the system gunicorn (/usr/bin/gunicorn), confirmed importable by www-data — so the unit's ExecStart path is correct for prod.

Scope of testing (stated honestly):

RAM was not matched (VM had more) — irrelevant here, as this is purely systemd supervision behavior, not memory-dependent.
The supervision test ran as a non-root user in /home to isolate it from filesystem-permission noise; the www-data / /var/www socket ownership (770 www-data:www-data) and nginx integration were verified separately on the production VM.
The runtime behavior of the unit is what was validated. The install/install.sh changes were reviewed and the unit-generation step verified, but the script was not run end-to-end through a full platform install; it is applied on the live VM via the controlled cutover below.

Deployment

⚠️ This is best applied as a one-time manual cutover on the live VM, rather than a straight merge-and-auto-deploy — the running processes need handling that a deploy alone won't do.

Because the two duplicate masters are already running, a merge won't touch them, and deploying over the current state could leave the service in a bad state (a third master, or a failed/hung start). The safe sequence, once the PR is merged and the code is on the VM:

Stop and fully clear the old setup — stop the service, hard-kill any remaining gunicorn processes (both masters and all workers), and remove the stale sampleplatform.sock and gunicorn.pid. Confirm no gunicorn processes remain before continuing.
Remove the SysV init script (/etc/init.d/platform) and its rc.d links. This is essential: if it's left in place, systemd-sysv-generator keeps emitting a competing unit of the same name, so you'd have two definitions for platform.service and which one wins is non-deterministic.
Install and reload the native unit — copy install/platform.service into /etc/systemd/system/, then systemctl daemon-reload. The make-or-break check here is systemctl cat platform.service: it must show the file under /etc/systemd/system/, not "generated by systemd-sysv-generator". If it still says generated, step 2 didn't fully take — stop and fix that first.
Start and verify — systemctl enable --now platform.service, then confirm MainPID is non-zero (it was 0 under the old unit), KillMode=control-group, exactly one master + four workers, the socket is back at 770 www-data:www-data, and the app responds. A couple of systemctl restart cycles should hold at one master with no duplicates.

After this one-time cutover, subsequent deploys are clean — the native unit is in the repo, so future installs use it directly.

Note on the gunicorn path:

This unit sets ExecStart=/usr/bin/gunicorn, which matches the current production VM (verified — gunicorn is installed and importable at that path). systemd requires an absolute path in ExecStart, so the previous PATH-resolved gunicorn invocation (used by the old bootstrap_gunicorn.py) can't be carried over directly. Separately, install.sh installs the gunicorn3 package, which provides /usr/bin/gunicorn3 — so on a clean install from the script, the path may need alignment. This is a pre-existing packaging inconsistency that this PR's absolute-path requirement surfaces; it's outside the scope of the race fix, but flagging it so the canonical gunicorn install can be confirmed.

…master race

sonarqubecloud · 2026-06-28T19:45:56Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

x15sr71 requested review from canihavesomecoffee and thealphadollar as code owners June 28, 2026 19:28

Replace SysV init with native systemd unit to fix gunicorn duplicate-…

f51478e

…master race

x15sr71 force-pushed the fix/gunicorn-systemd-unit branch from 555041c to f51478e Compare June 28, 2026 19:41

Add trailing newline to bootstrap_gunicorn.py

ef96a6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX]: Replace SysV init with native systemd unit to fix gunicorn duplicate-master race#1139

[FIX]: Replace SysV init with native systemd unit to fix gunicorn duplicate-master race#1139
x15sr71 wants to merge 2 commits into
CCExtractor:masterfrom
x15sr71:fix/gunicorn-systemd-unit

x15sr71 commented Jun 28, 2026

Uh oh!

sonarqubecloud Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

x15sr71 commented Jun 28, 2026

Problem

Root cause (and how each change addresses it)

Changes

Testing

Deployment

Note on the gunicorn path:

Uh oh!

sonarqubecloud Bot commented Jun 28, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant