fix: keep element order when writing F-contiguous chunks with vlen codecs#4116
fix: keep element order when writing F-contiguous chunks with vlen codecs#4116oldrobotdev wants to merge 3 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4116 +/- ##
==========================================
+ Coverage 93.50% 93.58% +0.08%
==========================================
Files 90 90
Lines 11981 11981
==========================================
+ Hits 11203 11213 +10
+ Misses 778 768 -10
🚀 New features to boost your workflow:
|
|
thanks for this fix! Could you check if the other numcodecs codecs (delta, fixedscaleoffset, etc) are affected by the same bug? |
|
Checked, and yes: Delta, FixedScaleOffset and PackBits all round-trip transposed data for an F-contiguous chunk that exactly covers the selection, both when calling the numcodecs codec directly and through a zarr array with the filter configured. The path is the generic adapter in codecs/numcodecs/_codecs.py, which hands chunk_data.as_ndarray_like() to numcodecs unchanged while decode reshapes in C order, so every array-array filter goes through the same exposed path. I pushed the same ascontiguousarray treatment for the array-array and array-bytes adapter encode paths, with regression tests for the three filters above and the changelog entry updated to match. One probe gotcha worth mentioning for review: a test pattern that is symmetric under transpose passes even on broken code (arange % 3 on a 16-wide array is symmetric, since 16 ≡ 1 mod 3), so the tests use arange and a lower-triangular bool mask. I'll open the numcodecs issue about the underlying order='A' behavior next, since direct numcodecs users stay exposed either way. |
…decs The numcodecs vlen codecs flatten their input with order='A', so an F-contiguous object-dtype chunk is encoded in transposed element order while decode reshapes in C order, silently scrambling the round-trip. Make both vlen codec wrappers pass a C-contiguous array to numcodecs; this copies only the object-pointer array and only when the chunk is not already C-contiguous.
The generic numcodecs adapter codecs pass the chunk to numcodecs unchanged, so F-contiguous chunks hit the same order='A' flattening as the vlen codecs: Delta, FixedScaleOffset and PackBits all round-trip transposed data for an F-contiguous chunk that exactly covers the selection. Apply the same C-contiguity normalization in the array-array and array-bytes adapter encode paths.
96b892e to
c0be9f4
Compare
Summary
Fixes #3558. Writing an F-contiguous object-dtype array to a chunked string or bytes array scrambles the stored data whenever a chunk is covered exactly by the selection: the write path passes the chunk view to numcodecs without a copy, numcodecs flattens it with order='A' (column-major for F-contiguous input), and decode reshapes in C order. The wrappers in codecs/vlen_utf8.py now pass np.ascontiguousarray(...) to the codec, a no-op for C-contiguous chunks and a copy of the object-pointer array otherwise. Regression tests cover vlen-utf8 and vlen-bytes at both the exact-cover and partial-chunk shapes; the exact-cover cases fail on current main.
For reviewers
The one design question is the fix layer: this could live in numcodecs instead (change the flatten to C order), but that touches the v2 out= decode path where the old behavior was self-consistent for F-ordered arrays, so I kept the change on the zarr side where decode already commits to C order. The repro and codec-level probe are in #3558.
Author attestation
TODO
docs/user-guide/*.md(not applicable)changes/