DataFusion: aggregating a dictionary column across partitions overflows the key type ("Dictionary key bigger than the key type")

Tracking issue for an upstream DataFusion bug (to file against `apache/datafusion`). One of two blockers for dictionary-encoding coordinate columns (#217).

## Summary

`GROUP BY` over a dictionary-typed column fails when the *combined* dictionary across partitions exceeds the per-batch key width:

```
DataFusion error: Arrow error: Dictionary key bigger than the key type
```

Each input partition is a **valid** `Dictionary(Int8, Int64)` — its dictionary has ≤128 distinct values, so an `Int8` key is legal *per batch*. But the partitions carry **disjoint** values, so when the aggregate combines them the combined dictionary exceeds 128 entries and the `Int8` key overflows.

## Minimal repro

Pure `datafusion` + `pyarrow`, no xarray-sql, no network: [`repros/datafusion/dict_key_overflow.py`](https://github.com/xqlsystems/xarray-sql/blob/claude/datafusion-upstream-repros-fs1bqv/repros/datafusion/dict_key_overflow.py) (branch `claude/datafusion-upstream-repros-fs1bqv`; a paste-in Rust test is in the [README](https://github.com/xqlsystems/xarray-sql/blob/claude/datafusion-upstream-repros-fs1bqv/repros/datafusion/README.md)).

```python
# 50 partitions, each Dictionary(Int8, Int64) of 100 DISJOINT values
#   -> combined cardinality 5000 >> Int8 max (127)
ctx.sql("SELECT k, SUM(v) FROM t GROUP BY k")  # -> "Dictionary key bigger than the key type"
```

The disjoint values matter: when every partition shares the *same* dictionary values, arrow unifies them and there is no overflow — which is why this is intermittent and version-sensitive in the wild (it does not reproduce on arrow-rs 58.3 when the values coincide).

## Expected

Combining dictionary-typed columns across partitions should not fail on a key type that was valid for each input batch. It should widen the key (Int8 → Int16 → …) or decode, since a producer cannot know per batch how large the *combined* dictionary will become downstream.

## Environment

datafusion 54.0.0 / pyarrow 23.0.0 (arrow-rs 58.3.0).

## Impact on xarray-sql

Blocks dictionary-encoding coordinate columns (#217): an unchunked coordinate repeated across partitions, or a chunked coordinate whose partitions hold disjoint slices, can exceed a narrow key under streaming aggregation. Reported downstream in #217 by @ghostiee-11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataFusion: aggregating a dictionary column across partitions overflows the key type ("Dictionary key bigger than the key type") #220

Summary

Minimal repro

Expected

Environment

Impact on xarray-sql

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DataFusion: aggregating a dictionary column across partitions overflows the key type ("Dictionary key bigger than the key type") #220

Description

Summary

Minimal repro

Expected

Environment

Impact on xarray-sql

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions