Motivation
The autograd work (#192) makes it possible to run gradient descent in SQL — the gradient of a loss is AVG(grad(loss, θ)), and an update is θ_new = θ - lr * AVG(grad(loss, θ)). A natural, SQL-native way to express the optimizer is a parameter table that grows one row per step, so the whole training trajectory becomes a relation you can query (e.g. the loss curve in a single GROUP BY). See the discussion in #192 (comment).
Today that "append a new generation" step can't happen in SQL. The tables registered via XarrayContext.from_dataset are backed by a read-only, lazy Arrow stream provider (LazyArrowStreamTable), so there is no INSERT INTO. The demo works around it by rebuilding the whole history as an xarray Dataset and re-registering it each step:
def register_params():
ctx.deregister_table("params")
ds = xr.Dataset({"a": (("step",), a_hist), "b": (("step",), b_hist)},
coords={"step": steps})
ctx.from_dataset("params", ds, chunks={"step": len(steps)})
It's correct and fine for a demo, but it's the one un-SQL-native part of an otherwise fully-relational training loop, and it re-materializes O(history) each step.
What we'd want
Some way to do, in SQL:
INSERT INTO params
SELECT :next_step,
cur.a - lr * AVG(grad(loss, a)),
cur.b - lr * AVG(grad(loss, b))
FROM d CROSS JOIN (SELECT a, b FROM params WHERE step = :k) cur
GROUP BY cur.a, cur.b;
so the optimizer's state lives in the engine and the Python side only drives iteration (or, eventually, doesn't — see "recursive CTE" below).
Design tension (why "xarray-backed append" specifically is awkward)
The obvious framing — "make xarray tables appendable" — fights the data model:
- xarray/Zarr tables model immutable, dense, gridded arrays. A SQL row is one cell
(coords…, values…). "Appending a row" to a gridded array isn't an INSERT; it's growing/concatenating along a coordinate dimension (a rechunk/xr.concat), which only makes sense for some tables (a 1-D params(step)) and not others (a 3-D (time, lat, lon) field).
- The real need that surfaced is a small mutable scratch table for iteration state, which is conceptually separate from the large immutable data tables we read from Zarr.
So I'd frame this as "support mutable state tables," not "make xarray reads mutable."
Options
- Mutable in-memory scratch table (recommended if we do anything). Register a DataFusion-native mutable table (e.g. a
MemTable, which already supports INSERT INTO in DataFusion) for iteration state, kept separate from xarray-backed tables. Work needed: expose a way to create/register such a table from Python, and make the autograd Substrait rewrite aware of it (it currently resolves table schemas from _registered_datasets, so a non-xarray table would need to be included in _table_schemas() / registered as an empty table on the rewrite side).
- Appendable xarray/Zarr-backed provider. A
TableProvider that supports insert_into by appending along a dimension (or writing to a Zarr store). More work, and only natural for 1-D / append-along-one-dim tables; risks implying mutability for gridded data where it doesn't belong.
- Recursive CTE for the whole loop. The most ambitious "fully in SQL" answer: express N steps as
WITH RECURSIVE. Blocked today — DataFusion's recursive-CTE support is limited (aggregates in the recursive term), and recursive CTEs + scalar subqueries don't round-trip through Substrait yet (the demo already hit "Cannot convert <subquery> to Substrait"). Tracking-only for now.
Do we think we should implement this?
My recommendation: not a priority, and not as "appendable xarray tables." Reasons:
- The current re-registration workaround is correct and adequate for the demo; the per-step re-materialization cost is small for the state tables this is about.
- "Append to xarray-backed tables" cuts against the immutable-gridded-array model that is the library's core value; baking mutability into that provider adds surface area and conceptual confusion for little gain.
- If/when we want in-engine iteration, the clean path is a separate mutable scratch table (option 1), not mutating the data tables — and even that is arguably premature until there's a use case beyond the demo.
Filing this mainly to capture the idea and the design reasoning so the trade-off is on record. Happy to be argued out of it — option 1 is genuinely appealing for the "differential database" direction (a training loop whose state is a table), and a recursive-CTE loop (option 3) would be the dream once the engine/Substrait support lands.
Context: came out of building the autograd feature (#192) and its gradient-descent / MNIST-MLP-in-SQL demos.
Motivation
The autograd work (#192) makes it possible to run gradient descent in SQL — the gradient of a loss is
AVG(grad(loss, θ)), and an update isθ_new = θ - lr * AVG(grad(loss, θ)). A natural, SQL-native way to express the optimizer is a parameter table that grows one row per step, so the whole training trajectory becomes a relation you can query (e.g. the loss curve in a singleGROUP BY). See the discussion in #192 (comment).Today that "append a new generation" step can't happen in SQL. The tables registered via
XarrayContext.from_datasetare backed by a read-only, lazy Arrow stream provider (LazyArrowStreamTable), so there is noINSERT INTO. The demo works around it by rebuilding the whole history as an xarrayDatasetand re-registering it each step:It's correct and fine for a demo, but it's the one un-SQL-native part of an otherwise fully-relational training loop, and it re-materializes O(history) each step.
What we'd want
Some way to do, in SQL:
so the optimizer's state lives in the engine and the Python side only drives iteration (or, eventually, doesn't — see "recursive CTE" below).
Design tension (why "xarray-backed append" specifically is awkward)
The obvious framing — "make xarray tables appendable" — fights the data model:
(coords…, values…). "Appending a row" to a gridded array isn't anINSERT; it's growing/concatenating along a coordinate dimension (a rechunk/xr.concat), which only makes sense for some tables (a 1-Dparams(step)) and not others (a 3-D(time, lat, lon)field).So I'd frame this as "support mutable state tables," not "make xarray reads mutable."
Options
MemTable, which already supportsINSERT INTOin DataFusion) for iteration state, kept separate from xarray-backed tables. Work needed: expose a way to create/register such a table from Python, and make the autograd Substrait rewrite aware of it (it currently resolves table schemas from_registered_datasets, so a non-xarray table would need to be included in_table_schemas()/ registered as an empty table on the rewrite side).TableProviderthat supportsinsert_intoby appending along a dimension (or writing to a Zarr store). More work, and only natural for 1-D / append-along-one-dim tables; risks implying mutability for gridded data where it doesn't belong.WITH RECURSIVE. Blocked today — DataFusion's recursive-CTE support is limited (aggregates in the recursive term), and recursive CTEs + scalar subqueries don't round-trip through Substrait yet (the demo already hit "Cannot convert <subquery> to Substrait"). Tracking-only for now.Do we think we should implement this?
My recommendation: not a priority, and not as "appendable xarray tables." Reasons:
Filing this mainly to capture the idea and the design reasoning so the trade-off is on record. Happy to be argued out of it — option 1 is genuinely appealing for the "differential database" direction (a training loop whose state is a table), and a recursive-CTE loop (option 3) would be the dream once the engine/Substrait support lands.
Context: came out of building the autograd feature (#192) and its gradient-descent / MNIST-MLP-in-SQL demos.