Support a mutable/appendable table for in-SQL iterative state (e.g. optimizer trajectories)

## Motivation

The autograd work (#192) makes it possible to run **gradient descent in SQL** — the gradient of a loss is `AVG(grad(loss, θ))`, and an update is `θ_new = θ - lr * AVG(grad(loss, θ))`. A natural, SQL-native way to express the optimizer is a **parameter table that grows one row per step**, so the whole training trajectory becomes a relation you can query (e.g. the loss curve in a single `GROUP BY`). See the discussion in https://github.com/alxmrs/xarray-sql/pull/192#discussion_r3488049015.

Today that "append a new generation" step can't happen in SQL. The tables registered via `XarrayContext.from_dataset` are backed by a **read-only, lazy** Arrow stream provider (`LazyArrowStreamTable`), so there is no `INSERT INTO`. The demo works around it by rebuilding the whole history as an xarray `Dataset` and **re-registering** it each step:

```python
def register_params():
    ctx.deregister_table("params")
    ds = xr.Dataset({"a": (("step",), a_hist), "b": (("step",), b_hist)},
                    coords={"step": steps})
    ctx.from_dataset("params", ds, chunks={"step": len(steps)})
```

It's correct and fine for a demo, but it's the one un-SQL-native part of an otherwise fully-relational training loop, and it re-materializes O(history) each step.

## What we'd want

Some way to do, in SQL:

```sql
INSERT INTO params
SELECT :next_step,
       cur.a - lr * AVG(grad(loss, a)),
       cur.b - lr * AVG(grad(loss, b))
FROM d CROSS JOIN (SELECT a, b FROM params WHERE step = :k) cur
GROUP BY cur.a, cur.b;
```

so the optimizer's state lives in the engine and the Python side only drives iteration (or, eventually, doesn't — see "recursive CTE" below).

## Design tension (why "xarray-backed append" specifically is awkward)

The obvious framing — "make xarray tables appendable" — fights the data model:

- xarray/Zarr tables model **immutable, dense, gridded arrays**. A SQL row is one cell `(coords…, values…)`. "Appending a row" to a gridded array isn't an `INSERT`; it's growing/concatenating along a coordinate dimension (a rechunk/`xr.concat`), which only makes sense for some tables (a 1-D `params(step)`) and not others (a 3-D `(time, lat, lon)` field).
- The real need that surfaced is a **small mutable scratch table for iteration state**, which is conceptually separate from the large immutable data tables we read from Zarr.

So I'd frame this as "support mutable state tables," not "make xarray reads mutable."

## Options

1. **Mutable in-memory scratch table (recommended if we do anything).** Register a DataFusion-native mutable table (e.g. a `MemTable`, which already supports `INSERT INTO` in DataFusion) for iteration state, kept separate from xarray-backed tables. Work needed: expose a way to create/register such a table from Python, and make the autograd Substrait rewrite aware of it (it currently resolves table schemas from `_registered_datasets`, so a non-xarray table would need to be included in `_table_schemas()` / registered as an empty table on the rewrite side).
2. **Appendable xarray/Zarr-backed provider.** A `TableProvider` that supports `insert_into` by appending along a dimension (or writing to a Zarr store). More work, and only natural for 1-D / append-along-one-dim tables; risks implying mutability for gridded data where it doesn't belong.
3. **Recursive CTE for the whole loop.** The most ambitious "fully in SQL" answer: express N steps as `WITH RECURSIVE`. Blocked today — DataFusion's recursive-CTE support is limited (aggregates in the recursive term), and recursive CTEs + scalar subqueries don't round-trip through Substrait yet (the demo already hit "Cannot convert &lt;subquery&gt; to Substrait"). Tracking-only for now.

## Do we think we should implement this?

My recommendation: **not a priority, and not as "appendable xarray tables."** Reasons:

- The current re-registration workaround is correct and adequate for the demo; the per-step re-materialization cost is small for the state tables this is about.
- "Append to xarray-backed tables" cuts against the immutable-gridded-array model that is the library's core value; baking mutability into that provider adds surface area and conceptual confusion for little gain.
- If/when we want in-engine iteration, the clean path is a **separate mutable scratch table** (option 1), not mutating the data tables — and even that is arguably premature until there's a use case beyond the demo.

Filing this mainly to capture the idea and the design reasoning so the trade-off is on record. Happy to be argued out of it — option 1 is genuinely appealing for the "differential database" direction (a training loop whose state is a table), and a recursive-CTE loop (option 3) would be the dream once the engine/Substrait support lands.

_Context: came out of building the autograd feature (#192) and its gradient-descent / MNIST-MLP-in-SQL demos._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support a mutable/appendable table for in-SQL iterative state (e.g. optimizer trajectories) #194

Motivation

What we'd want

Design tension (why "xarray-backed append" specifically is awkward)

Options

Do we think we should implement this?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Support a mutable/appendable table for in-SQL iterative state (e.g. optimizer trajectories) #194

Description

Motivation

What we'd want

Design tension (why "xarray-backed append" specifically is awkward)

Options

Do we think we should implement this?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions