Demo: train an MNIST MLP classifier in SQL by alxmrs · Pull Request #196 · xqlsystems/xarray-sql

alxmrs · 2026-06-28T15:08:35Z

Stacked on the ERA5/gradient-descent demo branch (#195). Adds benchmarks/mnist_mlp.py.

What this is

A one-hidden-layer MLP (196 → 32 tanh → 10 softmax, on 2×2-pooled 14×14 MNIST) trained by gradient descent where every gradient is computed in SQL over data registered as xarray. The optimisation loop is plain Python; all the math is relational.

Reverse-mode autodiff expressed as relational algebra:

matmul = join + GROUP BY SUM — a layer's pre-activation is SUM(input · weight) grouped by (sample, unit).
local derivatives = grad() — the hidden activation's Jacobian is grad(tanh(z), z), the autograd feature doing the calculus per (sample, unit).
cotangent propagation = join; parameter gradients = join + GROUP BY AVG.

The only hand-written gradient is softmax + cross-entropy's delta = softmax - onehot (softmax couples classes through a per-sample normaliser, an aggregate grad does not cross — staying faithful to SQL).

Reaches ~83% test accuracy in ~45s; downloads MNIST on first run. PEP 723 inline deps, uv run benchmarks/mnist_mlp.py.

🤖 Generated with Claude Code

Generated by Claude Code

Introduce `src/autograd.rs`, the Rust core of the autograd feature: a `differentiate(&Expr, wrt)` function that symbolically differentiates a DataFusion logical `Expr` tree with respect to a named column and returns a new `Expr` built from ordinary SQL expressions. The design mirrors JAX's per-primitive rule registry (defjvp and friends): each node type has a differentiation rule and the chain rule composes them as the tree is walked. A small 0/1-folding simplifier keeps output compact, playing the role of JAX's Zero tangents and add_tangents. Because each table row is an independent evaluation point, differentiating a column expression and letting DataFusion evaluate it row-by-row is the relational equivalent of vmap(grad(f)). This first cut implements scalar `grad`: rules for +, -, *, / (sum, product, quotient), unary chain rule for sin/cos/tan, asin/acos/atan, exp/ln/log2/ log10/sqrt, sinh/cosh/tanh, abs, and power() with constant base or exponent. Unsupported nodes/functions return a clear NotImplemented error rather than a silently wrong derivative. The engine operates purely on DataFusion `Expr`, keeping the eventual Python<->Rust transport (SQL text, Substrait, or proto) pluggable. Covered by 11 unit tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Add the `grad` marker UDF and a plan-level rewriter (`rewrite_grad_calls`) to the autograd engine, plus a `grad_rewrite` PyO3 function that bridges the differentiation engine into the datafusion-python SessionContext. Because the native extension links its own copy of DataFusion, expressions cross the Python<->Rust boundary as Substrait protobuf. Python produces the logical plan as Substrait; `grad_rewrite` consumes it into a DataFusion LogicalPlan, rewrites every `grad(expr, column)` ScalarFunction into the symbolic derivative via `differentiate`, and re-produces Substrait bytes for Python to consume and execute. The custom xarray table provider round-trips because Substrait serializes table scans by name (resolved against the registry on consume), so the rewrite context only needs empty tables with matching schemas. `grad` is registered as a marker ScalarUDF that carries the differentiation request intact through parsing, planning, and serialization; it is always rewritten away before execution and errors if it ever reaches invoke. Deps: datafusion-substrait 52 and prost 0.14 (matching the substrait crate). Building now requires `protoc` (the substrait crate codegens from .proto). Verified end to end (produce -> grad_rewrite -> consume -> execute) against analytic derivatives for cos, the product rule, and exp with 0.0 error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Wire the autograd surface into XarrayContext so users can write calculus directly in SQL: ctx.sql("SELECT grad(sin(val), val) AS d_val, sin(val) AS val FROM t") On construction the context registers the `grad` marker UDF so such queries parse and plan. XarrayContext.sql() detects `grad(` (a cheap regex gate so ordinary queries are untouched) and routes through _sql_with_autograd: it plans the query, produces the logical plan as Substrait, calls the native grad_rewrite to differentiate every grad(expr, column) symbolically, then consumes the rewritten Substrait back into an executable DataFrame. Table scans are resolved by name on the consume side, so _table_schemas() passes the (name, schema) of each registered table to the rewrite. Schema- qualified tables (mixed-dimension datasets) are skipped for now and noted as a follow-up. Adds tests/test_autograd.py covering sin/cos, product and quotient rules, power, exp, the non-grad passthrough, and a clear error for unsupported functions — all checked against numpy analytic derivatives. Existing SQL tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Adding datafusion-substrait pulls in the `substrait` crate, whose build script generates Rust from .proto files and requires `protoc`. Without it the Rust/maturin builds fail. - ci.yml, ci-build.yml, ci-rust.yml: add arduino/setup-protoc before the build (covers Linux, macOS and Windows runners). - publish.yml: setup-protoc for the macOS/Windows wheel job; for the manylinux maturin-action jobs install protoc inside the container via before-script-linux (arch-aware download). The sdist job is unchanged as it packages source without compiling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Extend the autograd surface from scalar grad() to multi-input Jacobians. SELECT jacobian(sin(x) * y, [x, y]) AS jac FROM g -- per row: [d/dx, d/dy] = [cos(x)*y, sin(x)] (a List<Float64>) `jacobian(expr, [c1, c2, ...])` differentiates `expr` with respect to each listed column and returns the gradient row as an array. Using a SQL array for the inputs keeps the marker at fixed arity two (avoiding variadic-UDF issues): the `[c1, c2, ...]` parses to make_array(c1, c2, ...), from which the rewrite extracts the input columns; the result is built with make_array of the partials. Array/list columns round-trip through Substrait, verified end to end. The single grad() marker is generalized into a reusable MarkerUdf (with grad_marker()/jacobian_marker() constructors and per-marker return types), and the plan rewrite dispatches on the function name. A full Jacobian can also be written as separate scalar grad() columns, which already worked; both forms are covered by tests against numpy analytic derivatives. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Drop the jacobian(expr, [cols]) -> List<Float64> form: a nested array column breaks the long/tidy data model (a cell should be one value aligned to its coordinates). The same Jacobian is expressed in-model as several scalar columns, e.g. grad(f, x) AS dfdx, grad(f, y) AS dfdy. Add forward- and reverse-mode gradients as scalar SQL functions: * jvp(expr, column, tangent) -> d(expr)/d(column) * tangent (forward) * vjp(expr, column, cotangent) -> cotangent * d(expr)/d(column) (reverse) A multi-input directional derivative is the sum of per-input jvp terms; both stay scalar, so they round-trip cleanly through Substrait and back to xarray. Engine: unify grad and jvp behind a single `linearize` (forward-mode chain rule with a pluggable leaf rule) — grad is a one-hot seed, jvp an arbitrary seed per input. This mirrors JAX's structure and removes rule duplication. vjp is cotangent * grad; for a scalar output forward and reverse coincide (asserted by a jvp/vjp agreement test), differing only in seed placement. Tests: 15 Rust unit tests and 11 Python integration tests (incl. jvp/vjp semantics, the multi-input sum, and jvp==vjp for a unit seed), all checked against numpy analytic derivatives. fmt/clippy/ruff/mypy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Mixed-dimension datasets register as schema-qualified tables (e.g. era5.surface / era5.time_x_level). The autograd rewrite consumes the plan in a throwaway context that registers an empty table per scanned name, but register_table("era5.time_x", ...) failed with "failed to resolve schema: era5" because the namespace did not exist. Add ensure_schema(): before registering each table, parse its name into a TableReference and, for qualified names, create the schema namespace (MemorySchemaProvider) in the default catalog if absent. The Python side already resolves qualified names via ctx.table(name).schema(); only the Rust rewrite context needed the namespace. Tests: a mixed-dimension fixture exercising grad on both the 2D surface and 3D atmosphere tables, against numpy analytic derivatives. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Nested calls such as grad(grad(f, x), x) already yield higher-order derivatives: the plan rewrite walks expressions bottom-up (transform_up), so the inner grad is differentiated to a plain expression first and the outer grad differentiates that result. No code change was needed; this adds tests and documents the behavior. - Rust: a unit test that differentiation composes (d2/dx2 sin = -sin). - Python: second derivatives of sin (-sin) and x^3 (6x) and the third derivative of sin (-cos), against numpy. - Doc: note higher-order support in the module overview. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Document and test that differentiating through SUM/AVG is just linearity: AGG(grad(f, x)) == d/dx AGG(f). Writing grad inside the aggregate composes with SQL scoping (the marker rewrites to plain SQL before the aggregate runs), so it needs no special machinery -- enough to express gradient descent in SQL. Adds tests for SUM/AVG(grad(...)) and an end-to-end gradient-descent convergence test, plus a note in the module overview. The runnable benchmark scripts live on stacked demo branches to keep this feature branch reviewable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Generalize _table_schemas() to enumerate the catalog instead of only the xarray-registered datasets, so the Substrait rewrite can resolve grad() queries that reference plain DataFusion tables too -- e.g. in-memory MemTables holding model parameters or intermediate results. This makes grad compose with ordinary relational state (a parameter table you INSERT into), not only gridded xarray data. Adds a test differentiating an expression whose coefficient lives in an in-memory table cross-joined to the xarray data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Expose the autograd engine as a "calculus compiler": differentiate_sql(expr, wrt, columns) parses a SQL scalar expression (parse_sql_expr), differentiates it with the engine, and unparses the derivative back to SQL (expr_to_sql). Where grad(...) rewrites a whole plan via Substrait, this hands back a single derivative expression as text -- usable where the Substrait round-trip can't carry a grad marker, e.g. embedding a precomputed update rule inside a recursive -CTE training loop (Substrait has no recursion). Exposed as xarray_sql. differentiate_sql; covered by a round-trip test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Make grad()/jvp()/vjp() work inside any query shape (recursive CTEs, DML, subqueries) by rewriting the calls as SQL text before planning, rather than round-tripping the logical plan through Substrait (which could not represent those shapes). Closes the gap tracked in #197. XarrayContext.sql() now hands a query containing a marker to the native rewrite_grad_sql, which parses the statement with sqlparser, differentiates each marker call with the existing engine, and renders the derivative back into the SQL in place. Because it runs before planning, every query shape the parser accepts is supported, and the result is ordinary SQL the stock datafusion-python context plans and executes directly. This removes the Substrait round-trip entirely: the datafusion-substrait and prost dependencies, the grad_rewrite/_sql_with_autograd/_table_schemas plumbing, the marker-UDF registration, and the protoc steps in CI. Unlike the FFI alternative, it needs no datafusion fork and no custom datafusion-python wheel. The grad surface is unchanged (same SQL, same results); marker arguments use unqualified column names, matching existing usage, since differentiation is syntactic and runs before binding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Stacked demo branch (on the autograd feature) holding the runnable benchmark scripts, kept out of the core branch so it stays reviewable. * grad_era5.py: symbolic grad over real ARCO-ERA5 data (wind-speed sensitivity checked exactly; saturation vapour pressure checked against the closed-form Clausius-Clapeyron slope). The queries ORDER BY latitude DESC, longitude to match ERA5's native order, so results line up with the xarray reference with no sorting on either side (single partition, so the order survives to_dataset). * grad_descent.py: gradient descent as ONE declarative recursive-CTE query. differentiate_sql compiles the per-row update rule to SQL once; a recursive CTE then iterates it. No Python loop. Fit matches numpy least-squares. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

A one-hidden-layer MLP (196->32 tanh->10 softmax, on 2x2-pooled 14x14 MNIST) trained by gradient descent with every gradient computed in SQL. The images are registered as xarray (the library's core); the model weights and per-step intermediates are DataFusion in-memory tables (register_record_batches), so a matmul is a join over them and there's no xarray pivot per step. Reverse-mode autodiff as relational algebra: matmul = join + GROUP BY SUM; the hidden activation's local Jacobian = grad(tanh(z), z); cotangent propagation = join; parameter gradients = join + GROUP BY AVG. The only hand-written gradient is softmax + cross-entropy's delta = softmax - onehot. ~83% test accuracy in ~20s. Adds a benchmarks README entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017mDoFJgsm9kS7SicGoCVF6

Rewrite mnist_mlp.py so the whole model and its entire training history live in a single append-only table model(step, layer, i, j, val): every parameter is a row tagged by generation, and a training step appends the next generation's rows rather than mutating anything. Each step is a single SQL statement (forward, grad(tanh(z),z) backprop, parameter update); evaluation is SQL too (a forward pass with ROW_NUMBER() for the argmax). Python no longer holds the weights or computes any gradients — it only sequences the steps. A 2-layer net can't be one recursive CTE (the recursive relation may be referenced only once, but W1/W2 are used several times per step) and unrolling the steps as non-recursive CTEs blows up exponentially (DataFusion inlines CTEs; no MATERIALIZED). Materialising between steps is therefore host-driven; the thin loop does exactly that. Reaches ~83% test accuracy over 60 steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Make the architecture itself data. The whole model is one xr.Dataset: each layer's weight is a data_var w{L} over its boundary dims (u{L}, u{L+1}), sharing the dims that connect adjacent layers (the join keys). The dim sizes are the layer widths and the number of weights is the depth, so differing neuron counts are just differing dim sizes — no padding, because the relational long form is naturally ragged. from_dataset splits the one Dataset into a table per weight; changing WIDTHS trains a different network with the same code. One generic contract()-based loop trains a net of any depth: forward contracts each layer, backward is the same contraction transposed (VJP of a contraction is a contraction) with grad(tanh(z), z) for the local derivative. Validated exact against numpy at depth 3. Training metrics are a relation too: each logged step appends a (step, loss, train_acc, test_acc) row to a metrics table rather than a Python list. The trained model, predictions, and metrics all come back out as xarray via to_dataset. ~83% test accuracy in ~13s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Two simplifications collapse the model to a single relation: - Bias folded into the weights (an nn.Linear): each layer's bias is the weight of a constant-1 input, kept as the row inp=width of the same weight array, so a layer is one matrix. - A layer dimension: every layer's weight lives in one weight(layer, inp, out) array, so forward/backward filter on the layer COLUMN instead of referencing a table per layer. The model is one xr.Dataset with a layer dim (NaN-padded for the ragged pyramid, dropped on seed); from_dataset registers it; the update is one query over the whole weight relation. A single contract() and a generic loop train a net of any depth (validated exact against numpy at depth 3). Tensors.put now unifies batch nullability so UNION results register cleanly. Faster too (~6s vs ~13s) at the same ~83% test accuracy; model and metrics still round-trip to xarray. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

claude added 10 commits June 27, 2026 22:44

alxmrs force-pushed the claude/xarray-sql-era5-demo branch from afb1036 to fdb17fb Compare June 28, 2026 15:17

alxmrs force-pushed the claude/xarray-sql-mnist-demo branch from 0810348 to 29b49dc Compare June 28, 2026 15:21

alxmrs force-pushed the claude/xarray-sql-era5-demo branch from fdb17fb to 8f97173 Compare June 28, 2026 15:47

alxmrs force-pushed the claude/xarray-sql-mnist-demo branch from 29b49dc to f2126da Compare June 28, 2026 15:47

alxmrs force-pushed the claude/xarray-sql-era5-demo branch from 8f97173 to 27c02d4 Compare June 28, 2026 15:58

alxmrs force-pushed the claude/xarray-sql-mnist-demo branch from f2126da to d9728c3 Compare June 28, 2026 15:58

Base automatically changed from claude/xarray-sql-era5-demo to claude/xarray-sql-autograd-73ovqq June 30, 2026 13:31

alxmrs force-pushed the claude/xarray-sql-autograd-73ovqq branch from a4fc101 to 7b1e530 Compare June 30, 2026 13:34

claude added 2 commits June 30, 2026 16:40

alxmrs force-pushed the claude/xarray-sql-mnist-demo branch from d9728c3 to b8d3e83 Compare June 30, 2026 13:40

alxmrs and others added 3 commits June 30, 2026 17:09

alxmrs force-pushed the claude/xarray-sql-autograd-73ovqq branch from 7b1e530 to 14b2697 Compare July 4, 2026 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Demo: train an MNIST MLP classifier in SQL#196

Demo: train an MNIST MLP classifier in SQL#196
alxmrs wants to merge 17 commits into
claude/xarray-sql-autograd-73ovqqfrom
claude/xarray-sql-mnist-demo

alxmrs commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alxmrs commented Jun 28, 2026

What this is

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants