Skip to content

feat(email): durable email outbox with push-based retry consumer#327

Merged
hhvrc merged 15 commits into
developfrom
feat/email-outbox-retry-queue
Jun 30, 2026
Merged

feat(email): durable email outbox with push-based retry consumer#327
hhvrc merged 15 commits into
developfrom
feat/email-outbox-retry-queue

Conversation

@hhvrc

@hhvrc hhvrc commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What this does

Transactional emails (account activation, password reset, email-change verification, email-change notice) were sent inline during the HTTP request. If the mail provider was down, the process restarted, or the network blipped, the email was simply lost - with no record and no way to resend.

This PR makes email delivery durable and retryable via a transactional outbox delivered by Hangfire: the request handler writes an EmailOutboxMessage row in the same transaction as the business change and commits; the Cron host picks it up and hands it to Hangfire, which sends it and retries until it succeeds or is exhausted. Rows are never auto-deleted, so failures stay visible and resendable.

The split is deliberate: the outbox row owns transactional integrity and the audit fact ("this user is owed this email, here is its delivery state"); Hangfire owns the delivery machinery (retry back-off, persistent scheduling, crash requeue, and an operator dashboard) instead of a hand-rolled worker.

How it works

  1. Enqueue, don't send. AccountService creates the request row (e.g. UserPasswordReset) and an EmailOutboxMessage (status Pending) in one transaction, then publishes a Redis notification. The API host does nothing else - it never sends.
  2. Hand-off in Cron. EmailOutboxConsumer (a coalescing background service in the Cron host) is woken by the Redis notification (email-outbox-pending), with a 1-minute sweep as a safety net for any missed notification. The push and the sweep share one capacity-1 wake signal, so they can never fire on top of each other. It claims Pending rows with FOR UPDATE SKIP LOCKED and enqueues a Hangfire job per email, flipping each row to Queued.
  3. Hangfire delivers. EmailOutboxJob.SendAsync(id) carries [AutomaticRetry(Attempts = 10)]: a transient failure throws so Hangfire reschedules it on its built-in back-off curve; a permanent/skip outcome marks the row Failed; success marks it Sent. On the final attempt (detected via PerformContext.RetryCount) the row is marked Failed too, so an exhausted message never stays stuck looking Queued. Crash mid-send is recovered by Hangfire's requeue. The precise attempt count + per-retry history live in the Hangfire dashboard.
  4. Single send path. IEmailOutboxDispatcher is the only place email is sent. It maps the message type + dynamic jsonb payload to a template, mints the secret token lazily at send time (re-minted on every resend), and classifies the result. IEmailService returns EmailSendResult (Sent / Transient / Permanent) instead of throwing.

Security note

Because a reset/verification link can't be reconstructed from a stored hash, tokens are minted at send time and only the hash is persisted. The queue therefore never contains a usable link, and a resend produces a fresh token - the desired behaviour for security links anyway. Existing token tables are unchanged: the hash is seeded on insert and rotated on send.


Why the diff is large (and what's actually in it)

52 files, +2996 / −339. The line count is dominated by generated code and a project move, not new logic:

Category Lines Notes
EF migration + model snapshot (autogenerated) ~1,743 (~58%) dotnet ef migrations add output; the bulk is the Designer/snapshot EF rewrites wholesale. Not hand-written, not reviewed line-by-line.
Email stack moved API → Cron large Providers, templates, dispatcher, options, and .liquid files relocated so sending lives in the Cron host. Mostly moves (namespace-only changes); a few git-detected renames, the rest paired add/delete.
New delivery logic ~280 EmailOutboxConsumer (159) + EmailOutboxJob (120) - the real surface area to review.
Tests / docs - EmailOutboxMessageTests; XML docs explaining the design so it's self-documenting.

So the substantive new code is the ~280-line consumer + Hangfire job; the rest is generated schema, a project relocation, the outbox entity/enums, and documentation.


Outbox + Hangfire (the design)

This is a transactional outbox with Hangfire as the relay/executor - the two layers own different things and are combined on purpose:

  • The outbox row owns transactional integrity + the audit fact. The email intent commits in the same transaction as the change that caused it, so there's no window where the business row commits but the email is lost, or where an email is scheduled for a change that then rolls back. The row is also the domain answer to "is this user owed this email, and what's its state" - one query gives type / status / last error / sent-or-failed.
  • Hangfire owns the delivery machinery. Once the consumer hands a row off, Hangfire provides the retry back-off, the persistent retry schedule, crash requeue, and the dashboard (with per-job history and one-click requeue) - the parts a bespoke worker would otherwise reinvent. This also keeps email delivery consistent with how the rest of the codebase already runs background work (Hangfire in the Cron host).

The API host stays free of email-send work entirely; it only writes the row and rings the bell.


Scope / roadmap

One self-contained PR. Follow-ups (now trivial - they just insert a row): admin "send any email type", admin "list email types", and a Cron prune job for old Sent rows.

Deployment note: the Cron host now sends email, so it needs the OpenShock:Mail config section (and, for a real provider, Mailjet/SMTP credentials). The API host no longer needs that config or the templates.

Testing

  • Full solution builds clean.
  • Migration regenerated via tooling (then renamed back to its original timestamp to keep the diff a modification, not a delete + re-add); model has no pending changes; existing token tables untouched.
  • Common.Tests green (265 passing).
  • Existing Docker-based MailTests are consistent with async delivery (they already wait for mail, parse the minted token, expect the same link formats) but were not run in this environment.

Replace inline transactional email sends with a database-backed outbox so
unsent mail is never lost and can be retried/resent.

- Add EmailOutboxMessage (Common): jsonb payload, email_type/email_status
  pg enums, three states (Sending/Sent/Failed), attempt/lease/backoff fields.
- EmailOutboxWorker (API hosted service): push-triggered via Redis pub/sub
  with a 15s poll fallback; claims rows with FOR UPDATE SKIP LOCKED so it is
  safe in every replica; exponential backoff + jitter, retries on 5xx/429/
  transport errors. EmailOutboxRetryPolicy holds the pure, tested state machine.
- IEmailOutboxDispatcher: single send path; mints each token lazily at send
  time (re-minted on resend) so the queue never stores a usable link.
- IEmailService now returns EmailSendResult (Sent/Transient/Permanent) and
  no longer throws on provider failures; Mailjet/SMTP/None updated.
- AccountService: activation, OAuth activation, password reset and email
  change now enqueue durable outbox rows (token hash seeded at insert,
  re-minted at send) and notify the consumer instead of sending inline.
- Migration AddEmailOutbox (tooling-generated); existing token tables unchanged.
- Tests: EmailOutboxRetryPolicyTests (10), EmailOutboxMessageTests (2).
@stage-review

stage-review Bot commented Jun 30, 2026

Copy link
Copy Markdown

@hhvrc hhvrc self-assigned this Jun 30, 2026
@hhvrc hhvrc requested a review from LucHeart June 30, 2026 10:53
Comment thread API/Services/Email/Outbox/EmailOutboxWorker.cs Fixed
Comment thread API/Services/Email/Outbox/EmailOutboxWorker.cs Fixed
Comment thread Cron/Services/Email/Smtp/SmtpEmailService.cs Fixed
Comment thread Cron/Services/Email/Smtp/SmtpEmailService.cs Fixed
hhvrc added 2 commits June 30, 2026 13:18
CI caught two runtime failures (build passed, inserts 500'd):
- Postgres enum labels for email_type must be snake_case to match Npgsql's
  default member translator (e.g. password_reset), not camelCase.
- Dictionary<string,string> -> jsonb needs Npgsql dynamic JSON; enable it via
  ConfigureDataSource(EnableDynamicJson) rather than a hand-rolled converter.
Both verified against a real Postgres and the full MailTests suite (16/16).

Hardening from a self-review pass:
- Redis enqueue notification is now best-effort: a publish failure can no
  longer fail a request whose outbox row already committed (poll delivers it).
- Deduplicate the password-reset / email-change validity checks in the
  dispatcher into a shared helper.
- Fix back-off overflow for large attempt counts (cap in seconds before
  converting to TimeSpan).
- Correct stale doc comments (jsonb via dynamic JSON; token hash seeded then
  re-minted, not nullable).

Migration keeps its original id (only the enum labels changed).
…tom worker

Move email sending out of the API host and into the Cron host, and hand the
delivery machinery (retry back-off, scheduling, crash recovery, dashboard) to
Hangfire rather than the bespoke worker.

- Delete EmailOutboxWorker (polling loop, FOR UPDATE SKIP LOCKED claim, lease)
  and EmailOutboxRetryPolicy (+ test); sending is now a per-email Hangfire job
  EmailOutboxJob.SendAsync with [AutomaticRetry]. Transient failure throws so
  Hangfire retries; the job marks the row Failed on its last attempt (detected
  via PerformContext.RetryCount) so an exhausted row never stays stuck Queued.
- Keep the transactional outbox: the API writes the Pending row in the business
  transaction and publishes the Redis nudge, nothing more.
- Move the whole email stack (providers, templates, dispatcher, options, .liquid
  files) API -> Cron so no ASP.NET/DI logic or MailKit/Fluid lands in Common.
- New coalescing EmailOutboxConsumer in Cron: Redis push + 1-minute safety sweep
  share one wake signal, claim Pending rows, and enqueue Hangfire jobs.
- EmailStatus -> Pending, Queued, Sent, Failed; drop next_attempt_at,
  attempt_started_at, attempt_count (Hangfire owns retry state). Migration
  updated in place; anti-Hangfire docs rewritten for the hybrid design.
Comment thread Cron/Services/Email/Outbox/EmailOutboxJob.cs Fixed
Comment thread Cron/Services/Email/Outbox/EmailOutboxJob.cs Fixed
Comment thread Cron/Services/Email/Smtp/SmtpEmailService.cs Fixed
Comment thread Cron/Services/Email/Smtp/SmtpEmailService.cs Fixed
hhvrc added 3 commits June 30, 2026 14:25
…e tests

The API integration MailTests asserted real SMTP delivery, but delivery moved
to the Cron host, so the API-only test host no longer sent anything and they
failed. Boot the Cron pipeline in-process alongside the API (shared
Postgres/Redis/Mailpit) so the outbox is actually drained and delivered.

- WebApplicationFactory boots cronhost::Program (aliased to avoid the Program
  clash); its consumer + Hangfire deliver the rows the API enqueues.
- Register FrontendOptions in Cron/Program (the dispatcher builds links with it).
- Make Hangfire's QueuePollInterval configurable: production keeps the 15s
  default (no extra DB load); the test sets 0.5s so jobs land in the wait window.
- Copy Cron's Liquid templates next to the test binaries (relative-path load).
- Add CronOutboxQueueTests: drive the outbox job's state machine through each
  case (Sent / Skipped / Permanent / Transient-retry / already-terminal no-op).
…batch delivery

Make email_outbox the single source of truth for delivery state (status,
attempt_count, next_attempt_at, last_error) instead of handing retry and
scheduling to Hangfire. The delivery job claims due rows with FOR UPDATE
SKIP LOCKED, leases them (Sending), and records the outcome on the row.

- Replace the per-email Hangfire job + [AutomaticRetry] and the custom
  BackgroundService poll loop with EmailOutboxDeliveryJob ([CronJob] every
  minute) + EmailOutboxNotificationListener (enqueues it on the Redis pending
  nudge for instant delivery). No background polling loop.
- Add Sending/Skipped states; Skipped is a terminal no-op kept distinct from
  Failed so operators never requeue it.
- Self-owned retry: EmailOutboxRetryPolicy (capped-exponential backoff, 11
  attempts = parity with the old AutomaticRetry) plus a delivery lease so a
  crashed send is reclaimed.
- xmin optimistic-concurrency token so a lapsed-lease reclaim can't clobber
  attempt count / terminal state; per-message reload + isolation so one
  failure can't abort the batch; drain loop for backlogs.
- Pure EmailOutboxStateMachine unit-tested in new Cron.Tests project; DB
  round-trip / enum-mapping coverage in EmailOutboxPersistenceTests.
Comment thread Cron/Jobs/EmailOutboxDeliveryJob.cs Dismissed
Comment thread Cron/Jobs/EmailOutboxDeliveryJob.cs Dismissed
hhvrc and others added 9 commits June 30, 2026 17:00
…deterministic

WebApplicationFactory stops the Cron host at WebApplication.Build(), so the
recurring [CronJob] email-outbox sweep registered after Build() in Cron/Program.cs
never runs under tests. That left email-flow integration tests depending solely on
the one-shot Redis "pending" nudge with no periodic fallback, so mail was never
drained inside the Mailpit wait window and 8 flow tests failed.

Run the same EmailOutboxDeliveryJob.Execute() the production sweep runs, on a fast
250ms loop in the test host. It is concurrency-safe (FOR UPDATE SKIP LOCKED + xmin),
so it composes with the still-live notification listener.
Add an opaque per-row coalesce key so only the newest pending request of a kind is
delivered (account activation, password reset, email-change verification), while the
email-change notice keeps delivering every occurrence.

- EmailOutboxMessage.CoalesceKey (nullable) + coalesce_key column/index; the domain
  stamps the key via EmailOutboxCoalesceKeys (activation/pwreset/emailchange:{userId}).
- Delivery job skips a row as Skipped("Superseded by a newer request") when a strictly
  newer sibling shares its key; null key never coalesces. Fixes stale redrive double-sends
  and out-of-order delivery without requiring delivery ordering.
- Per-type EmailOutboxMessage factories (ForAccountActivation/ForPasswordReset/
  ForEmailVerification/ForEmailChangeNotice) own the (type, payload key, coalesce key)
  triple so call sites can't drift; AccountService call sites collapsed to one line each.
- Rewrite the two sibling-request mail tests to assert newest-wins delivery (older row
  Skipped) and preserve completion-time stamp-rotation invalidation via atomic DB seeding.
- Document the outbox design in docs/email-outbox-design.md.
…n instead of xmin

Guard the deferred terminal write with attempt_count (already incremented on every claim)
instead of the row's xmin: mark it IsConcurrencyToken so the write only conflicts on an
actual lease reclaim, with no false-conflict surface from unrelated updates to the row.

The AddEmailOutbox migration is regenerated with the EF tool (MigrationOpenShockContext)
rather than hand-edited, so it now reflects the current model (coalesce_key column + index,
attempt_count fencing token, no xmin shadow property) as tool-generated output.
…rocess Cron

The API's job for transactional email is to atomically write the EmailOutboxMessage row;
it never sends. Booting the Cron host + a delivery pump inside the API tests was wrong and
caused races (a background drain loop mutating rows that schema/predicate tests seed, and
two executors racing the coalescing).

- WebApplicationFactory no longer boots the Cron host, SMTP server, or a delivery loop -
  just the API + Postgres + Redis.
- MailTests now assert the correct outbox row is enqueued (type, recipient, coalesce key,
  payload) or that nothing is enqueued for rejected requests; the emailed-link end-to-end
  flows and newest-wins coalescing move to Cron.IntegrationTests.
- Drop the Cron project reference, SmtpTemplates copy, and the now-unused Mailpit fixtures.
The delivery behaviour removed from the API tests now lives in a Cron-side integration project
that boots the real Cron host against throwaway Postgres / Redis / Mailpit containers. Delivery
is driven deterministically - tests seed outbox rows then call RunDeliveryAsync() to run one pass
of the real EmailOutboxDeliveryJob, so nothing races the assertions.

Covers: password-reset delivery with lazy token minting (the emailed token verifies against the
stored hash), the self-contained change notice, and newest-wins coalescing (older sibling Skipped,
newest Sent, exactly one email). The Cron host doesn't migrate on boot, so the factory applies
migrations to the fresh database first.
…ML-doc warning

The coalescing test seeded both reset rows in one transaction, so they shared a created_at
(Postgres CURRENT_TIMESTAMP is transaction-start time). The tie fell through to a UUIDv7 id
comparison, whose Postgres ordering doesn't match .NET creation order, so "newest" came out
inverted and the newer row was wrongly skipped. Real requests are separate transactions with
distinct timestamps, so production was never affected - seed each sibling in its own
transaction to mirror that and make the test deterministic.

Also drop the lone Create() <param> tag that tripped CS1573 (fold it into the summary).
@hhvrc hhvrc merged commit 47029f0 into develop Jun 30, 2026
22 checks passed
@hhvrc hhvrc deleted the feat/email-outbox-retry-queue branch June 30, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants