Skip to content

Segfault with PyArrow 24 on macOS when using mimalloc v3 #1607

Description

@timsaucer

Describe the bug

After the DataFusion 54 upgrade (#1562), importing datafusion and performing any Arrow-backed operation segfaults (SIGSEGV) when the installed PyArrow is 24.0.0 on macOS (arm64). The crash happens on the very first Arrow allocation made through the bindings — for example building a literal lit(pa.scalar(0, type=pa.int32())), which is exactly what python/datafusion/functions/spark.py does at module import, so even a bare import datafusion crashes.

This is a regression introduced on the 54 upgrade branch; it does not affect the released datafusion-python 53.0.0.

Symptoms

import datafusion (or any operation that constructs an Arrow value) terminates the process with Segmentation fault: 11 (exit code 139). The native crash report points into PyArrow's own bundled mimalloc, not into our code:

mi_theap_malloc_zero_aligned_at_overalloc        <- SIGSEGV (mimalloc v3 thread-heap)
mi_theap_realloc_zero_aligned_at
arrow::MimallocAllocator::ReallocateAligned
arrow::PoolBuffer::Resize
arrow::NumericBuilder<Int32Type>::FinishInternal
arrow::py::ConvertPySequence
__pyx_pw_7pyarrow_3lib_191scalar                 <- pa.scalar(0, type=int32())

Root cause

There are two independent mimalloc runtimes in the process:

  • datafusion-python installs mimalloc as the Rust #[global_allocator] (crates/core/src/lib.rs, enabled by the default mimalloc feature).
  • PyArrow 24 ships and defaults to its own bundled mimalloc memory pool.

The DataFusion 54 dependency bump moved libmimalloc-sys 0.1.44 -> 0.1.49 (the mimalloc crate 0.1.48 -> 0.1.52), which changed the bundled allocator from mimalloc v2 to mimalloc v3. PyArrow 24 also bundles mimalloc v3. Two mimalloc-v3 runtimes collide at the macOS process-global level (malloc-zone / thread-local-heap initialization), corrupting each other's thread heap and faulting on the first allocation.

The 53.0.0 release shipped mimalloc v2 (libmimalloc-sys 0.1.44), which coexists fine with PyArrow's v3 pool — which is why no released version is affected.

Affected versions / platforms

  • PyArrow: 24.0.0 triggers it. PyArrow 20.0.0 through 23.0.1 are unaffected (verified against the 54-branch build).
  • datafusion-python: the in-progress 54 upgrade branch. Released 53.0.0 is not affected (verified with PyArrow 20–24).
  • Platforms: confirmed on macOS arm64. Linux is expected to be unaffected because PyArrow defaults to jemalloc there (only one mimalloc in the process). Windows defaults to mimalloc like macOS, so it is potentially affected, but the macOS-specific malloc-zone vector may not apply — needs verification in CI.

Reproduction

On macOS arm64 with a 54-branch build of datafusion-python and pyarrow==24.0.0, remove the "v2" feature flag on mimalloc in Cargo.toml.

import datafusion  # segfaults here (spark.py builds an int32 literal at import)

or, isolating the allocation:

import pyarrow as pa
from datafusion import lit
lit(pa.scalar(0, type=pa.int32()))  # SIGSEGV

Suggested fix

Work around to be introduced for releasing 54.0.0: Pin the bundled allocator to the mimalloc v2 line so two mimalloc-v3 runtimes never coexist. libmimalloc-sys (and the mimalloc crate) expose a v2 feature for this; adding it to the mimalloc feature list in crates/core/Cargo.toml keeps the Rust global allocator (no performance loss, no PyArrow pin) and resolves the crash. This has been verified locally: with the v2 feature the 54-branch build runs cleanly against PyArrow 24.0.0.

A longer-term fix should investigate making two mimalloc-v3 instances coexist (or platform-gating the allocator), and we should add a CI smoke test that imports datafusion and constructs an Arrow literal against the newest PyArrow on macOS so this regression cannot return silently.

Acceptance / testing

The fix must include test coverage: a smoke test (run on macOS, and ideally Windows) that imports datafusion and builds an Arrow-backed literal under the newest supported PyArrow, asserting no crash.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions