Skip to content

fix: keep element order when writing F-contiguous chunks with vlen codecs#4116

Open
oldrobotdev wants to merge 3 commits into
zarr-developers:mainfrom
oldrobotdev:fix-vlen-f-contiguous-write
Open

fix: keep element order when writing F-contiguous chunks with vlen codecs#4116
oldrobotdev wants to merge 3 commits into
zarr-developers:mainfrom
oldrobotdev:fix-vlen-f-contiguous-write

Conversation

@oldrobotdev

Copy link
Copy Markdown

Summary

Fixes #3558. Writing an F-contiguous object-dtype array to a chunked string or bytes array scrambles the stored data whenever a chunk is covered exactly by the selection: the write path passes the chunk view to numcodecs without a copy, numcodecs flattens it with order='A' (column-major for F-contiguous input), and decode reshapes in C order. The wrappers in codecs/vlen_utf8.py now pass np.ascontiguousarray(...) to the codec, a no-op for C-contiguous chunks and a copy of the object-pointer array otherwise. Regression tests cover vlen-utf8 and vlen-bytes at both the exact-cover and partial-chunk shapes; the exact-cover cases fail on current main.

For reviewers

The one design question is the fix layer: this could live in numcodecs instead (change the flatten to C order), but that touches the v2 out= decode path where the old behavior was self-consistent for F-ordered arrays, so I kept the change on the zarr side where decode already commits to C order. The repro and codec-level probe are in #3558.

Author attestation

  • I am a human, these are my changes, and I have reviewed and understood every change and can explain why each is correct.

TODO

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions (no user-facing API change)
  • New/modified features documented in docs/user-guide/*.md (not applicable)
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions Bot added needs release notes Automatically applied to PRs which haven't added release notes and removed needs release notes Automatically applied to PRs which haven't added release notes labels Jul 2, 2026
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.58%. Comparing base (4b68c2c) to head (c0be9f4).

Files with missing lines Patch % Lines
src/zarr/codecs/numcodecs/_codecs.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4116      +/-   ##
==========================================
+ Coverage   93.50%   93.58%   +0.08%     
==========================================
  Files          90       90              
  Lines       11981    11981              
==========================================
+ Hits        11203    11213      +10     
+ Misses        778      768      -10     
Files with missing lines Coverage Δ
src/zarr/codecs/vlen_utf8.py 92.42% <ø> (+15.15%) ⬆️
src/zarr/codecs/numcodecs/_codecs.py 96.38% <50.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b

d-v-b commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

thanks for this fix! Could you check if the other numcodecs codecs (delta, fixedscaleoffset, etc) are affected by the same bug?

@oldrobotdev

Copy link
Copy Markdown
Author

Checked, and yes: Delta, FixedScaleOffset and PackBits all round-trip transposed data for an F-contiguous chunk that exactly covers the selection, both when calling the numcodecs codec directly and through a zarr array with the filter configured. The path is the generic adapter in codecs/numcodecs/_codecs.py, which hands chunk_data.as_ndarray_like() to numcodecs unchanged while decode reshapes in C order, so every array-array filter goes through the same exposed path.

I pushed the same ascontiguousarray treatment for the array-array and array-bytes adapter encode paths, with regression tests for the three filters above and the changelog entry updated to match. One probe gotcha worth mentioning for review: a test pattern that is symmetric under transpose passes even on broken code (arange % 3 on a 16-wide array is symmetric, since 16 ≡ 1 mod 3), so the tests use arange and a lower-triangular bool mask.

I'll open the numcodecs issue about the underlying order='A' behavior next, since direct numcodecs users stay exposed either way.

…decs

The numcodecs vlen codecs flatten their input with order='A', so an
F-contiguous object-dtype chunk is encoded in transposed element order
while decode reshapes in C order, silently scrambling the round-trip.
Make both vlen codec wrappers pass a C-contiguous array to numcodecs;
this copies only the object-pointer array and only when the chunk is
not already C-contiguous.
The generic numcodecs adapter codecs pass the chunk to numcodecs
unchanged, so F-contiguous chunks hit the same order='A' flattening as
the vlen codecs: Delta, FixedScaleOffset and PackBits all round-trip
transposed data for an F-contiguous chunk that exactly covers the
selection. Apply the same C-contiguity normalization in the array-array
and array-bytes adapter encode paths.
@oldrobotdev oldrobotdev force-pushed the fix-vlen-f-contiguous-write branch from 96b892e to c0be9f4 Compare July 3, 2026 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong array write when writing chunked array from numpy string data with "F" order

2 participants