Describe the bug
After the DataFusion 54 upgrade (#1562), importing datafusion and performing any Arrow-backed operation segfaults (SIGSEGV) when the installed PyArrow is 24.0.0 on macOS (arm64). The crash happens on the very first Arrow allocation made through the bindings — for example building a literal lit(pa.scalar(0, type=pa.int32())), which is exactly what python/datafusion/functions/spark.py does at module import, so even a bare import datafusion crashes.
This is a regression introduced on the 54 upgrade branch; it does not affect the released datafusion-python 53.0.0.
Symptoms
import datafusion (or any operation that constructs an Arrow value) terminates the process with Segmentation fault: 11 (exit code 139). The native crash report points into PyArrow's own bundled mimalloc, not into our code:
mi_theap_malloc_zero_aligned_at_overalloc <- SIGSEGV (mimalloc v3 thread-heap)
mi_theap_realloc_zero_aligned_at
arrow::MimallocAllocator::ReallocateAligned
arrow::PoolBuffer::Resize
arrow::NumericBuilder<Int32Type>::FinishInternal
arrow::py::ConvertPySequence
__pyx_pw_7pyarrow_3lib_191scalar <- pa.scalar(0, type=int32())
Root cause
There are two independent mimalloc runtimes in the process:
datafusion-python installs mimalloc as the Rust #[global_allocator] (crates/core/src/lib.rs, enabled by the default mimalloc feature).
- PyArrow 24 ships and defaults to its own bundled mimalloc memory pool.
The DataFusion 54 dependency bump moved libmimalloc-sys 0.1.44 -> 0.1.49 (the mimalloc crate 0.1.48 -> 0.1.52), which changed the bundled allocator from mimalloc v2 to mimalloc v3. PyArrow 24 also bundles mimalloc v3. Two mimalloc-v3 runtimes collide at the macOS process-global level (malloc-zone / thread-local-heap initialization), corrupting each other's thread heap and faulting on the first allocation.
The 53.0.0 release shipped mimalloc v2 (libmimalloc-sys 0.1.44), which coexists fine with PyArrow's v3 pool — which is why no released version is affected.
Affected versions / platforms
- PyArrow: 24.0.0 triggers it. PyArrow 20.0.0 through 23.0.1 are unaffected (verified against the 54-branch build).
- datafusion-python: the in-progress 54 upgrade branch. Released 53.0.0 is not affected (verified with PyArrow 20–24).
- Platforms: confirmed on macOS arm64. Linux is expected to be unaffected because PyArrow defaults to jemalloc there (only one mimalloc in the process). Windows defaults to mimalloc like macOS, so it is potentially affected, but the macOS-specific malloc-zone vector may not apply — needs verification in CI.
Reproduction
On macOS arm64 with a 54-branch build of datafusion-python and pyarrow==24.0.0, remove the "v2" feature flag on mimalloc in Cargo.toml.
import datafusion # segfaults here (spark.py builds an int32 literal at import)
or, isolating the allocation:
import pyarrow as pa
from datafusion import lit
lit(pa.scalar(0, type=pa.int32())) # SIGSEGV
Suggested fix
Work around to be introduced for releasing 54.0.0: Pin the bundled allocator to the mimalloc v2 line so two mimalloc-v3 runtimes never coexist. libmimalloc-sys (and the mimalloc crate) expose a v2 feature for this; adding it to the mimalloc feature list in crates/core/Cargo.toml keeps the Rust global allocator (no performance loss, no PyArrow pin) and resolves the crash. This has been verified locally: with the v2 feature the 54-branch build runs cleanly against PyArrow 24.0.0.
A longer-term fix should investigate making two mimalloc-v3 instances coexist (or platform-gating the allocator), and we should add a CI smoke test that imports datafusion and constructs an Arrow literal against the newest PyArrow on macOS so this regression cannot return silently.
Acceptance / testing
The fix must include test coverage: a smoke test (run on macOS, and ideally Windows) that imports datafusion and builds an Arrow-backed literal under the newest supported PyArrow, asserting no crash.
Describe the bug
After the DataFusion 54 upgrade (#1562), importing
datafusionand performing any Arrow-backed operation segfaults (SIGSEGV) when the installed PyArrow is 24.0.0 on macOS (arm64). The crash happens on the very first Arrow allocation made through the bindings — for example building a literallit(pa.scalar(0, type=pa.int32())), which is exactly whatpython/datafusion/functions/spark.pydoes at module import, so even a bareimport datafusioncrashes.This is a regression introduced on the 54 upgrade branch; it does not affect the released
datafusion-python53.0.0.Symptoms
import datafusion(or any operation that constructs an Arrow value) terminates the process withSegmentation fault: 11(exit code 139). The native crash report points into PyArrow's own bundled mimalloc, not into our code:Root cause
There are two independent mimalloc runtimes in the process:
datafusion-pythoninstalls mimalloc as the Rust#[global_allocator](crates/core/src/lib.rs, enabled by the defaultmimallocfeature).The DataFusion 54 dependency bump moved
libmimalloc-sys0.1.44 -> 0.1.49 (themimalloccrate 0.1.48 -> 0.1.52), which changed the bundled allocator from mimalloc v2 to mimalloc v3. PyArrow 24 also bundles mimalloc v3. Two mimalloc-v3 runtimes collide at the macOS process-global level (malloc-zone / thread-local-heap initialization), corrupting each other's thread heap and faulting on the first allocation.The 53.0.0 release shipped mimalloc v2 (
libmimalloc-sys0.1.44), which coexists fine with PyArrow's v3 pool — which is why no released version is affected.Affected versions / platforms
Reproduction
On macOS arm64 with a 54-branch build of
datafusion-pythonandpyarrow==24.0.0, remove the"v2"feature flag on mimalloc inCargo.toml.or, isolating the allocation:
Suggested fix
Work around to be introduced for releasing 54.0.0: Pin the bundled allocator to the mimalloc v2 line so two mimalloc-v3 runtimes never coexist.
libmimalloc-sys(and themimalloccrate) expose av2feature for this; adding it to themimallocfeature list incrates/core/Cargo.tomlkeeps the Rust global allocator (no performance loss, no PyArrow pin) and resolves the crash. This has been verified locally: with thev2feature the 54-branch build runs cleanly against PyArrow 24.0.0.A longer-term fix should investigate making two mimalloc-v3 instances coexist (or platform-gating the allocator), and we should add a CI smoke test that imports
datafusionand constructs an Arrow literal against the newest PyArrow on macOS so this regression cannot return silently.Acceptance / testing
The fix must include test coverage: a smoke test (run on macOS, and ideally Windows) that imports
datafusionand builds an Arrow-backed literal under the newest supported PyArrow, asserting no crash.