Skip to content
Open
24 changes: 24 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ On every push to `main`, `.github/workflows/release.yml` compares the `pyproject
- `MACHINES`: MachinesConfig instance managing all machine configurations
- `USERS`: UsersConfig instance managing all user data
- `SLACK_HANDLER`: Optional SlackHandler for Slack integration
- `WEBHOOK_NOTIFIER`: Optional WebhookNotifier (see Status-Change Webhook below); `None` unless `STATUS_WEBHOOK_URL` is set
- `START_TIME`: Server start timestamp for uptime tracking

**Configuration System**:
Expand Down Expand Up @@ -196,8 +197,30 @@ as `/api/machine/*`):

**Admin APIs** (`/api/*`):
- `POST /api/reload-users`: Hot-reload users.json without restart
- `GET /api/machines`: List all machines and their current status as JSON
(sorted by name). Read-only; intended for external consumers such as the
Equipment Status Board. Each entry is `Machine.status_dict` (name,
display_name, derived `status`, relay, oops, locked_out, current_user,
last_checkin, last_update).
- `GET /metrics`: Prometheus metrics endpoint

### Status-Change Webhook

When `STATUS_WEBHOOK_URL` is set, `WebhookNotifier` (`src/dm_mac/webhook.py`)
POSTs a JSON webhook to that URL on every *meaningful* machine status change —
never on ordinary MCU heartbeats. Events: `login`, `logout`, `unauthorized`,
`unknown_fob`, `override_login`, `oops`, `unoops`, `lockout`, `unlock`,
`reboot`. The payload is `Machine.status_dict` plus `event`, `timestamp`, and a
distinct `user` field (the event actor, e.g. who logged out — differs from
`current_user`). Delivery is fire-and-forget (an `asyncio` task, like the Slack
notifications) with a per-attempt timeout and bounded exponential-backoff
retry, so the MCU response is never blocked; failures are logged and dropped.
`notify()` is called only from the meaningful status-change code paths in
`models/machine.py`, so heartbeats never fire it. The notifier is resolved via
`current_app` in request-context (MCU/API) paths and via the passed Slack
handler's app (`slack.quart`) in the Slack-command path, which runs without a
request context.

### Logging

Custom `RequestFormatter` adds request context (`remote_addr`, `url`) to all logs when available. The `AUTH` logger is used specifically for authentication/authorization decisions.
Expand All @@ -218,6 +241,7 @@ Optional for MAC server:
- `SLACK_SIGNING_SECRET`: Slack Signing Secret
- `SLACK_CONTROL_CHANNEL_ID`: Private admin channel ID
- `SLACK_OOPS_CHANNEL_ID`: Public channel for oops/maintenance notices
- `STATUS_WEBHOOK_URL`: If set, URL to POST status-change webhooks to (see Status-Change Webhook above); disabled when unset

## Testing Notes

Expand Down
172 changes: 172 additions & 0 deletions docs/features/completed/esb-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# ESB Support

> Status: **completed**. This document is the original feature request plus the
> implementation plan and per-milestone progress notes recorded as the work was
> carried out.

## Overview

Goal: implement the required minimal changes in this application to support https://github.com/DecaturMakers/equipment-status-board/issues/10.

At a high level, this means:

1. If we don't already have one, a simple API for retrieving a list of all machines and their current status.
2. Ability to fire a webhook (non-blocking/async or from a background worker) when a machine's status is changed (user login/logout, unauthorized RFID fob, oops/lockout/clear), using a webhook destination configured via environment variable. The webhook should include information on the currently-logged-in user if there is one. This should only fire on meaningful status changes, not on every update from the MCU.
3. If we don't already have them, API endpoints to oops a machine, put it in maintenance lockout, or clear oops/lockout.

Our approach is to minimize change in this application, and let ESB do the heavy lifting; we just need to send events to it via webhook and give it the appropriate API endpoints to call.

## Implementation Plan

Commit-message prefix for this feature: ``ESB Support - {Milestone}.{Task}``.

### Findings / Scope

Mapping the three high-level requirements onto the current codebase:

1. **Machine list/status API** — *does not exist as JSON.* Today status is only
available via the Prometheus ``/metrics`` endpoint and the Slack ``status``
command. We will add a small read-only JSON endpoint.
2. **Status-change webhook** — *does not exist.* The meaningful status changes
already have well-defined chokepoints in ``models/machine.py`` where Slack
notifications fire (login, logout, unauthorized/unknown fob, override login,
oops, un-oops, lockout, unlock, reboot). We will fire the webhook from those
same sites so it only fires on meaningful changes, never on ordinary MCU
heartbeats.
3. **Oops / lockout / clear API endpoints** — *already exist* in
``views/machine.py``: ``POST``/``DELETE /api/machine/oops/<name>`` and
``POST``/``DELETE /api/machine/locked_out/<name>``. ESB "clear" = ``DELETE``
on whichever of the two is set. No new control endpoints are required; we
only verify and document them.

The ESB issue also mentions storing 100–500 recent activity events per machine.
Per the "let ESB do the heavy lifting" directive, **ESB stores that history from
the webhook events**; MAC does not add history storage.

### Design Decisions (confirmed with maintainer)

* **Webhook destination:** ``STATUS_WEBHOOK_URL`` environment variable. When
unset, webhook firing is disabled entirely (mirrors how Slack is disabled
without its tokens).
* **Authentication:** none. The POST is sent with no auth header.
* **Delivery:** fire-and-forget (``asyncio.create_task``, like Slack) so the MCU
response is never blocked, with a bounded **retry-with-backoff** in the
background task and a per-attempt timeout. Failures are logged after retries
are exhausted; ESB can reconcile via the status API.

### Shared status representation

A single ``status_dict`` (and derived ``status`` string) will be used by **both**
the list API and the webhook payload so they never drift:

```json
{
"name": "planer",
"display_name": "Planer",
"status": "in_use", // one of: idle | in_use | oops | locked_out | unknown
"relay": true,
"oops": false,
"locked_out": false,
"current_user": { "account_id": "123", "full_name": "Jane Doe" }, // or null
"last_checkin": 1720000000.0, // epoch seconds, or null if never
"last_update": 1720000000.0 // epoch seconds, or null if never
}
```

The webhook payload is this dict plus an ``event`` field and a firing
``timestamp``. Event values: ``login``, ``logout``, ``unauthorized``,
``unknown_fob``, ``override_login``, ``oops``, ``unoops``, ``lockout``,
``unlock``, ``reboot``.

### Milestone 1 — ESB-facing status & control API — ✅ COMPLETE

Added ``Machine.status`` and ``Machine.status_dict`` (shared representation),
the ``GET /api/machines`` endpoint with ``MachineStatus`` / ``MachinesListResponse``
schemas, and unit tests for both the model properties and the endpoint. Verified
the existing oops/lockout endpoints already satisfy requirement #3 and are
well-covered (no code change needed). All ``nox -s tests`` pass at 97% coverage.

* **1.1** Add a ``status`` property (derived string) and a ``status_dict``
property on ``Machine`` (reading ``self.state``), producing the shared
representation above. ``current_user`` serializes to ``{account_id, full_name}``
or ``null``.
* **1.2** Add ``GET /api/machines`` returning ``{"machines": [status_dict, ...]}``
for all configured machines, sorted by name. Add ``pydantic`` response schemas
in ``models/api_schemas.py`` and document the route with ``quart_schema``
(tagged ``Admin``). Read-only and unauthenticated, consistent with the existing
API surface.
* **1.3** Confirm the existing oops/lockout endpoints satisfy ESB's control
needs; add any missing test coverage. No behavior change expected.
* **1.4** Unit tests for ``status``/``status_dict`` and the new endpoint
(idle/in-use/oops/locked-out, with and without a current user).

### Milestone 2 — Status-change webhook — ✅ COMPLETE

Added ``src/dm_mac/webhook.py`` (``WebhookNotifier``: fire-and-forget delivery
via ``aiohttp`` with bounded exponential-backoff retry, enabled by
``STATUS_WEBHOOK_URL``), wired it into ``create_app`` as ``WEBHOOK_NOTIFIER``,
and fired ``notify(...)`` from the meaningful status-change sites in
``models/machine.py`` (login / logout / unauthorized / unknown_fob /
override_login / oops / unoops / lockout / unlock / reboot), including the
always-enabled RFID-tracking path. The notifier is resolved via
``current_app`` in the request-context (MCU/API) paths and via the passed
Slack handler's app (``slack.quart``) in the Slack-command path, which runs
without a request context. Payloads reuse ``Machine.status_dict`` plus
``event``, ``timestamp``, and a distinct ``user`` (event actor) field. Unit
tests cover the notifier (payload/retry/backoff/disabled) and the wiring
(each event fires; heartbeats do not; Slack path; disabled no-op). All
``nox -s tests`` and ``nox -s mypy`` pass; ``webhook.py`` at 100% coverage.

* **2.1** New module ``src/dm_mac/webhook.py`` with a ``WebhookNotifier`` class:
* Constructed from ``STATUS_WEBHOOK_URL``.
* ``notify(machine, event, user=None)`` builds the payload from
``machine.status_dict`` + ``event`` + ``timestamp`` and spawns a
fire-and-forget ``asyncio.create_task`` delivery coroutine.
* Delivery coroutine POSTs JSON via ``aiohttp`` (already a dependency) with a
per-attempt timeout and retry-with-backoff (a small, bounded number of
attempts with exponential backoff); logs a single error after exhaustion.
* **2.2** In ``create_app``/``main`` (``__init__.py``), instantiate the notifier
when ``STATUS_WEBHOOK_URL`` is set and store it as
``app.config["WEBHOOK_NOTIFIER"]`` (default ``None``), exactly like
``SLACK_HANDLER``.
* **2.3** Fire ``notify(...)`` from the same async sites that emit Slack
messages, so webhooks track meaningful changes only:
* ``MachineState._handle_rfid_insert`` → ``login`` / ``unauthorized`` /
``unknown_fob`` / ``override_login``
* ``MachineState._handle_rfid_remove`` → ``logout``
* ``MachineState._handle_reboot`` → ``reboot`` (and logout of any prior user)
* ``Machine.oops`` / ``MachineState._handle_oops`` → ``oops``
* ``Machine.unoops`` → ``unoops``; ``Machine.lockout`` → ``lockout``;
``Machine.unlock`` → ``unlock``
Each site fetches the notifier via ``current_app.config.get("WEBHOOK_NOTIFIER")``
and no-ops when ``None``.
* **2.4** Unit tests: payload shape per event, disabled-when-unset, retry/backoff
on failure (mock ``aiohttp``), and non-firing on ordinary heartbeat updates.

### Milestone 3 — Acceptance Criteria — ✅ COMPLETE

Updated documentation (`docs/source/configuration.rst` env-var table;
`docs/source/http-api.rst` with the Machine Status API and Status-Change
Webhook sections; `CLAUDE.md` app-config, API endpoints, env vars, and a new
Status-Change Webhook subsection). `README.rst` is a pointer to the full docs
and needed no change. `GET /api/machines` is picked up automatically by the
generated OpenAPI spec, and `dm_mac.webhook` by `sphinx-apidoc`. New code has
unit-test coverage (`webhook.py` at 100%). All `nox` sessions pass: `tests`,
`mypy`, `pre-commit`, `typeguard`, `docs`, and `safety`.

Note: `nox -s safety` was failing on `main` due to a pre-existing transitive
`msgpack==1.2.0` advisory (GHSA-6v7p-g79w-8964), unrelated to this feature. To
satisfy the "all nox sessions passing" criterion, `msgpack ^1.2.1` is pinned in
`pyproject.toml`. Because `msgpack` is only pulled in via `cachecontrol` (a
dev-group dependency) and is not used at runtime, the pin lives in
`group.dev.dependencies` so it does not expand the production dependency set.

* **3.1** Update documentation: ``README.md`` (env var + endpoints if listed),
``docs/source/configuration.rst`` (add ``STATUS_WEBHOOK_URL`` to the env-var
table), ``docs/source/http-api.rst`` (document ``GET /api/machines`` and the
webhook payload/events), and ``CLAUDE.md`` (env vars + architecture notes).
Match the existing style and verbosity.
* **3.2** Ensure all new code has appropriate unit-test coverage.
* **3.3** All ``nox`` sessions pass (``tests``, ``mypy``, ``pre-commit``,
``safety``, ``typeguard``, ``docs``).
* **3.4** Move this file from ``docs/features/`` to ``docs/features/completed/``.
3 changes: 3 additions & 0 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,9 @@ Environment Variables
* - ``SLACK_OOPS_CHANNEL_ID``
- no
- If using the Slack integration, the Channel ID of of the public channel where Oops and maintenance notices will be posted, and where machine status can be checked.
* - ``STATUS_WEBHOOK_URL``
- no
- If set, the URL to POST a JSON status-change webhook to on every meaningful machine event (login, logout, unauthorized or unknown fob, override login, oops, un-oops, lockout, unlock, reboot). Disabled when unset. See :ref:`http-api.status-webhook`.

.. _configuration.machine-state-dir:

Expand Down
1 change: 1 addition & 0 deletions docs/source/dm_mac.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ Submodules
dm_mac.neongetter
dm_mac.slack_handler
dm_mac.utils
dm_mac.webhook
8 changes: 8 additions & 0 deletions docs/source/dm_mac.webhook.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
dm\_mac.webhook module
======================

.. automodule:: dm_mac.webhook
:members:
:private-members:
:show-inheritance:
:undoc-members:
87 changes: 87 additions & 0 deletions docs/source/http-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,93 @@ The ``display`` field is byte-identical between pre-feature and post-feature
servers for any given (machine, operator, machine state) tuple. ``second_relay``
configuration never causes LCD changes.

Machine Status API
------------------

``GET /api/machines``

Returns the current status of every configured machine as JSON, sorted by
machine name. This read-only endpoint is intended for external consumers (such
as the `Equipment Status Board
<https://github.com/DecaturMakers/equipment-status-board>`_) to poll or
reconcile machine state. It is included in the OpenAPI spec above. Each machine
entry has the shape:

::

{
"name": "planer",
"display_name": "Planer",
"status": "in_use",
"relay": true,
"oops": false,
"locked_out": false,
"current_user": {"account_id": "123", "full_name": "Jane Doe"},
"last_checkin": 1720000000.0,
"last_update": 1720000000.0
}

``status`` is a derived summary — one of ``locked_out``, ``oops``, ``in_use``,
``idle``, or ``unknown`` (never checked in). ``current_user`` is ``null`` when
no user is logged in. ``last_checkin`` and ``last_update`` are epoch seconds:
``last_checkin`` is ``null`` until the machine's first check-in, and
``last_update`` is ``null`` until its first *meaningful* state change. These are
independent — a machine that has only sent idle heartbeats has a ``last_checkin``
while ``last_update`` is still ``null``.

.. _http-api.status-webhook:

Status-Change Webhook
---------------------

When the ``STATUS_WEBHOOK_URL`` environment variable is set (see
:ref:`configuration.env-vars`), the server POSTs a JSON webhook to that URL on
every *meaningful* machine status change — never on ordinary MCU heartbeats.
This lets an external consumer (such as the Equipment Status Board) maintain
its own activity history and live status without polling.

The webhook fires for these ``event`` values: ``login``, ``logout``,
``unauthorized`` (known user lacking authorization), ``unknown_fob``,
``override_login``, ``oops``, ``unoops``, ``lockout``, ``unlock``, and
``reboot`` (MCU reboot detected).

The request body is the same per-machine object as ``GET /api/machines`` (so
``current_user`` means the same thing) plus three fields:

* ``event`` — the status-change event name (see above).
* ``timestamp`` — epoch seconds when the event fired.
* ``user`` — the user *involved in this event* (the actor), as
``{"account_id", "full_name"}`` or ``null``. This differs from
``current_user`` for events like ``logout`` and ``unauthorized`` where a user
acted but is not (or is no longer) logged in.

::

POST <STATUS_WEBHOOK_URL>
Content-Type: application/json

{
"name": "planer",
"display_name": "Planer",
"status": "idle",
"relay": false,
"oops": false,
"locked_out": false,
"current_user": null,
"last_checkin": 1720000000.0,
"last_update": 1720000000.5,
"event": "logout",
"timestamp": 1720000000.5,
"user": {"account_id": "123", "full_name": "Jane Doe"}
}

Delivery is fire-and-forget so it never blocks the MCU response: each webhook
is sent on a background task that retries with exponential backoff and a
per-attempt timeout, giving up (and logging an error) after a few attempts. No
authentication header is sent. Consumers should treat delivery as best-effort
and reconcile via ``GET /api/machines`` when needed. See
:py:mod:`dm_mac.webhook` for details.

Prometheus Metrics
------------------

Expand Down
Loading
Loading