Skip to content

[FIX]: Replace SysV init with native systemd unit to fix gunicorn duplicate-master race#1139

Open
x15sr71 wants to merge 2 commits into
CCExtractor:masterfrom
x15sr71:fix/gunicorn-systemd-unit
Open

[FIX]: Replace SysV init with native systemd unit to fix gunicorn duplicate-master race#1139
x15sr71 wants to merge 2 commits into
CCExtractor:masterfrom
x15sr71:fix/gunicorn-systemd-unit

Conversation

@x15sr71

@x15sr71 x15sr71 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.

My familiarity with the project is as follows (check one):

  • I have never used the project.
  • I have used the project briefly.
  • I have used the project extensively, but have not contributed previously.
  • I am an active contributor to the project.

Problem

Under restarts, two gunicorn master groups can end up bound to sampleplatform.sock simultaneously — one stale/orphaned, one serving. The orphan group holds ~370 MB of otherwise-reclaimable memory and contributes to the platform's intermittent outages on the 2 GB VM.

Root cause (and how each change addresses it)

install/platform is a SysV init script that systemd-sysv-generator wraps into a unit with Type=forking, GuessMainPID=no, KillMode=process. Combined with gunicorn's --daemon flag (which double-forks and detaches), the failure unfolds in a chain — each link is fixed by a specific change in this PR:

# Root-cause mechanism (observed on prod) Fix in this PR
1 gunicorn --daemon double-forks and detaches, so the process systemd launched exits → systemd can't track the master bootstrap_gunicorn.py: remove --daemon/--pid; the new unit runs gunicorn in the foreground
2 GuessMainPID=no + detached master → systemctl show reports MainPID=0 (systemd has no PID to act on) platform.service: Type=simple + foreground gunicorn → systemd records the real master PID
3 KillMode=process + MainPID=0systemctl stop kills nothing; journal logs Unit process <pid> (gunicorn) remains running after unit stopped platform.service: KillMode=control-group → stop reaps the entire cgroup (master + workers + git subprocesses)
4 Next start spawns a fresh master, finds orphaned workers (Found left-over process <pid> (gunicorn)... Ignoring), leaves both → two masters on one socket Fixed transitively by 1–3: with a tracked master and a clean stop, no orphans survive to be stacked on
5 The init script's stop does kill $(cat gunicorn.pid), which no-ops when the pidfile is missing (a transient condition during reexec) install/platform removed entirely; supervision is now systemd's, not a pidfile's

Evidence (production journal + app log):

  • systemctl show platform -p MainPID -p KillMode -p GuessMainPID on prod returned MainPID=0, KillMode=process, GuessMainPID=no — confirming systemd never tracked the master.
  • Across the journal, MainPID is 0 for the unit's entire retained lifetime (never non-zero).
  • The duplicate-creation sequence (stop → "remains running after unit stopped" → start → "Found left-over process… Ignoring") is logged on multiple dates.
  • The unit's cgroup was observed holding all running gunicorn processes (master + 4 workers + git children), confirming a control-group kill would reap them all.
  • Recurrence: ≥11 occurrences in the application error.log since at least 2026-03-10 (4 independently confirmed in the systemd journal). Triggers include any service restart — apt-daily-upgrade's systemd reexec confirmed for one event; manual restarts and an OS upgrade account for others.

Changes

  • New: install/platform.service — native systemd unit (Type=simple, KillMode=control-group, foreground gunicorn).
  • Removed: install/platform — the SysV init script (the root-cause source the generator wrapped).
  • Edited: bootstrap_gunicorn.py — drop --daemon/--pid; foreground only. (Note: the unit's ExecStart invokes gunicorn directly; this script is retained for manual debugging and is no longer the deployment path.)
  • Edited: install/install.sh — remove the old init script + its rc.d links, install + enable the native unit, drop update-rc.d ... defaults (invalid without a SysV script).
  • Edited: install/installation.mdsystemctl commands replace /etc/init.d/platform.

Testing

Test bed: a physical x86_64 laptop running Ubuntu 24.04, hosting an isolated Ubuntu 24.04.4 (Noble) VM via Multipass with systemd as PID 1. The supervision tests were run inside that VM to keep them off the host and reproduce a clean, prod-faithful systemd environment.

Matched to production on every factor that governs this bug:

Factor Production Test VM
OS Ubuntu 24.04.4 LTS (Noble) Ubuntu 24.04.4 LTS (Noble)
Arch x86_64 x86_64
Python 3.12 3.12
gunicorn 20.1.0 20.1.0
systemd as PID 1 yes yes
Launch flags -w 4, unix socket, -m 007, --timeout 120 same

Reproduced the bug under the old SysV-generated unit: confirmed MainPID=0, KillMode=process, GuessMainPID=no — matching production exactly.

Verified the fix under the native unit:

  • MainPID = non-zero — systemd tracks the master.
  • KillMode=control-group.
  • Exactly 1 master + 4 workers.
  • 5 consecutive systemctl restart cycles → exactly 5 processes every time, never accumulating duplicates.
  • systemctl stop0 gunicorn processes — the whole cgroup reaped, directly fixing the "remains running after unit stopped" failure that caused the orphans.

Additional checks:

  • The edited bootstrap_gunicorn.py (no --daemon, subprocess.run) was confirmed to boot gunicorn cleanly in the foreground.
  • The install.sh ${root_dir} path substitution was verified to rewrite every path in the unit correctly (no malformed paths).
  • Production uses the system gunicorn (/usr/bin/gunicorn), confirmed importable by www-data — so the unit's ExecStart path is correct for prod.

Scope of testing (stated honestly):

  • RAM was not matched (VM had more) — irrelevant here, as this is purely systemd supervision behavior, not memory-dependent.
  • The supervision test ran as a non-root user in /home to isolate it from filesystem-permission noise; the www-data / /var/www socket ownership (770 www-data:www-data) and nginx integration were verified separately on the production VM.
  • The runtime behavior of the unit is what was validated. The install/install.sh changes were reviewed and the unit-generation step verified, but the script was not run end-to-end through a full platform install; it is applied on the live VM via the controlled cutover below.

Deployment

⚠️ This is best applied as a one-time manual cutover on the live VM, rather than a straight merge-and-auto-deploy — the running processes need handling that a deploy alone won't do.

Because the two duplicate masters are already running, a merge won't touch them, and deploying over the current state could leave the service in a bad state (a third master, or a failed/hung start). The safe sequence, once the PR is merged and the code is on the VM:

  1. Stop and fully clear the old setup — stop the service, hard-kill any remaining gunicorn processes (both masters and all workers), and remove the stale sampleplatform.sock and gunicorn.pid. Confirm no gunicorn processes remain before continuing.

  2. Remove the SysV init script (/etc/init.d/platform) and its rc.d links. This is essential: if it's left in place, systemd-sysv-generator keeps emitting a competing unit of the same name, so you'd have two definitions for platform.service and which one wins is non-deterministic.

  3. Install and reload the native unit — copy install/platform.service into /etc/systemd/system/, then systemctl daemon-reload. The make-or-break check here is systemctl cat platform.service: it must show the file under /etc/systemd/system/, not "generated by systemd-sysv-generator". If it still says generated, step 2 didn't fully take — stop and fix that first.

  4. Start and verifysystemctl enable --now platform.service, then confirm MainPID is non-zero (it was 0 under the old unit), KillMode=control-group, exactly one master + four workers, the socket is back at 770 www-data:www-data, and the app responds. A couple of systemctl restart cycles should hold at one master with no duplicates.

After this one-time cutover, subsequent deploys are clean — the native unit is in the repo, so future installs use it directly.

Note on the gunicorn path:

This unit sets ExecStart=/usr/bin/gunicorn, which matches the current production VM (verified — gunicorn is installed and importable at that path). systemd requires an absolute path in ExecStart, so the previous PATH-resolved gunicorn invocation (used by the old bootstrap_gunicorn.py) can't be carried over directly. Separately, install.sh installs the gunicorn3 package, which provides /usr/bin/gunicorn3 — so on a clean install from the script, the path may need alignment. This is a pre-existing packaging inconsistency that this PR's absolute-path requirement surfaces; it's outside the scope of the race fix, but flagging it so the canonical gunicorn install can be confirmed.

@x15sr71 x15sr71 force-pushed the fix/gunicorn-systemd-unit branch from 555041c to f51478e Compare June 28, 2026 19:41
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant