Tracking issue for an upstream DataFusion bug (to file against apache/datafusion). One of two blockers for dictionary-encoding coordinate columns (#217).
Summary
GROUP BY over a dictionary-typed column fails when the combined dictionary across partitions exceeds the per-batch key width:
DataFusion error: Arrow error: Dictionary key bigger than the key type
Each input partition is a valid Dictionary(Int8, Int64) — its dictionary has ≤128 distinct values, so an Int8 key is legal per batch. But the partitions carry disjoint values, so when the aggregate combines them the combined dictionary exceeds 128 entries and the Int8 key overflows.
Minimal repro
Pure datafusion + pyarrow, no xarray-sql, no network: repros/datafusion/dict_key_overflow.py (branch claude/datafusion-upstream-repros-fs1bqv; a paste-in Rust test is in the README).
# 50 partitions, each Dictionary(Int8, Int64) of 100 DISJOINT values
# -> combined cardinality 5000 >> Int8 max (127)
ctx.sql("SELECT k, SUM(v) FROM t GROUP BY k") # -> "Dictionary key bigger than the key type"
The disjoint values matter: when every partition shares the same dictionary values, arrow unifies them and there is no overflow — which is why this is intermittent and version-sensitive in the wild (it does not reproduce on arrow-rs 58.3 when the values coincide).
Expected
Combining dictionary-typed columns across partitions should not fail on a key type that was valid for each input batch. It should widen the key (Int8 → Int16 → …) or decode, since a producer cannot know per batch how large the combined dictionary will become downstream.
Environment
datafusion 54.0.0 / pyarrow 23.0.0 (arrow-rs 58.3.0).
Impact on xarray-sql
Blocks dictionary-encoding coordinate columns (#217): an unchunked coordinate repeated across partitions, or a chunked coordinate whose partitions hold disjoint slices, can exceed a narrow key under streaming aggregation. Reported downstream in #217 by @ghostiee-11.
Tracking issue for an upstream DataFusion bug (to file against
apache/datafusion). One of two blockers for dictionary-encoding coordinate columns (#217).Summary
GROUP BYover a dictionary-typed column fails when the combined dictionary across partitions exceeds the per-batch key width:Each input partition is a valid
Dictionary(Int8, Int64)— its dictionary has ≤128 distinct values, so anInt8key is legal per batch. But the partitions carry disjoint values, so when the aggregate combines them the combined dictionary exceeds 128 entries and theInt8key overflows.Minimal repro
Pure
datafusion+pyarrow, no xarray-sql, no network:repros/datafusion/dict_key_overflow.py(branchclaude/datafusion-upstream-repros-fs1bqv; a paste-in Rust test is in the README).The disjoint values matter: when every partition shares the same dictionary values, arrow unifies them and there is no overflow — which is why this is intermittent and version-sensitive in the wild (it does not reproduce on arrow-rs 58.3 when the values coincide).
Expected
Combining dictionary-typed columns across partitions should not fail on a key type that was valid for each input batch. It should widen the key (Int8 → Int16 → …) or decode, since a producer cannot know per batch how large the combined dictionary will become downstream.
Environment
datafusion 54.0.0 / pyarrow 23.0.0 (arrow-rs 58.3.0).
Impact on xarray-sql
Blocks dictionary-encoding coordinate columns (#217): an unchunked coordinate repeated across partitions, or a chunked coordinate whose partitions hold disjoint slices, can exceed a narrow key under streaming aggregation. Reported downstream in #217 by @ghostiee-11.