Skip to content

Support a mutable/appendable table for in-SQL iterative state (e.g. optimizer trajectories) #194

Description

@alxmrs

Motivation

The autograd work (#192) makes it possible to run gradient descent in SQL — the gradient of a loss is AVG(grad(loss, θ)), and an update is θ_new = θ - lr * AVG(grad(loss, θ)). A natural, SQL-native way to express the optimizer is a parameter table that grows one row per step, so the whole training trajectory becomes a relation you can query (e.g. the loss curve in a single GROUP BY). See the discussion in #192 (comment).

Today that "append a new generation" step can't happen in SQL. The tables registered via XarrayContext.from_dataset are backed by a read-only, lazy Arrow stream provider (LazyArrowStreamTable), so there is no INSERT INTO. The demo works around it by rebuilding the whole history as an xarray Dataset and re-registering it each step:

def register_params():
    ctx.deregister_table("params")
    ds = xr.Dataset({"a": (("step",), a_hist), "b": (("step",), b_hist)},
                    coords={"step": steps})
    ctx.from_dataset("params", ds, chunks={"step": len(steps)})

It's correct and fine for a demo, but it's the one un-SQL-native part of an otherwise fully-relational training loop, and it re-materializes O(history) each step.

What we'd want

Some way to do, in SQL:

INSERT INTO params
SELECT :next_step,
       cur.a - lr * AVG(grad(loss, a)),
       cur.b - lr * AVG(grad(loss, b))
FROM d CROSS JOIN (SELECT a, b FROM params WHERE step = :k) cur
GROUP BY cur.a, cur.b;

so the optimizer's state lives in the engine and the Python side only drives iteration (or, eventually, doesn't — see "recursive CTE" below).

Design tension (why "xarray-backed append" specifically is awkward)

The obvious framing — "make xarray tables appendable" — fights the data model:

  • xarray/Zarr tables model immutable, dense, gridded arrays. A SQL row is one cell (coords…, values…). "Appending a row" to a gridded array isn't an INSERT; it's growing/concatenating along a coordinate dimension (a rechunk/xr.concat), which only makes sense for some tables (a 1-D params(step)) and not others (a 3-D (time, lat, lon) field).
  • The real need that surfaced is a small mutable scratch table for iteration state, which is conceptually separate from the large immutable data tables we read from Zarr.

So I'd frame this as "support mutable state tables," not "make xarray reads mutable."

Options

  1. Mutable in-memory scratch table (recommended if we do anything). Register a DataFusion-native mutable table (e.g. a MemTable, which already supports INSERT INTO in DataFusion) for iteration state, kept separate from xarray-backed tables. Work needed: expose a way to create/register such a table from Python, and make the autograd Substrait rewrite aware of it (it currently resolves table schemas from _registered_datasets, so a non-xarray table would need to be included in _table_schemas() / registered as an empty table on the rewrite side).
  2. Appendable xarray/Zarr-backed provider. A TableProvider that supports insert_into by appending along a dimension (or writing to a Zarr store). More work, and only natural for 1-D / append-along-one-dim tables; risks implying mutability for gridded data where it doesn't belong.
  3. Recursive CTE for the whole loop. The most ambitious "fully in SQL" answer: express N steps as WITH RECURSIVE. Blocked today — DataFusion's recursive-CTE support is limited (aggregates in the recursive term), and recursive CTEs + scalar subqueries don't round-trip through Substrait yet (the demo already hit "Cannot convert <subquery> to Substrait"). Tracking-only for now.

Do we think we should implement this?

My recommendation: not a priority, and not as "appendable xarray tables." Reasons:

  • The current re-registration workaround is correct and adequate for the demo; the per-step re-materialization cost is small for the state tables this is about.
  • "Append to xarray-backed tables" cuts against the immutable-gridded-array model that is the library's core value; baking mutability into that provider adds surface area and conceptual confusion for little gain.
  • If/when we want in-engine iteration, the clean path is a separate mutable scratch table (option 1), not mutating the data tables — and even that is arguably premature until there's a use case beyond the demo.

Filing this mainly to capture the idea and the design reasoning so the trade-off is on record. Happy to be argued out of it — option 1 is genuinely appealing for the "differential database" direction (a training loop whose state is a table), and a recursive-CTE loop (option 3) would be the dream once the engine/Substrait support lands.

Context: came out of building the autograd feature (#192) and its gradient-descent / MNIST-MLP-in-SQL demos.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions