Skip to content

cuda.core: make managed-prefetch test page-size aware#2268

Open
rparolin wants to merge 1 commit into
NVIDIA:mainfrom
rparolin:fix/managed-prefetch-page-size
Open

cuda.core: make managed-prefetch test page-size aware#2268
rparolin wants to merge 1 commit into
NVIDIA:mainfrom
rparolin:fix/managed-prefetch-page-size

Conversation

@rparolin

Copy link
Copy Markdown
Collaborator

Summary

TestPrefetchBatch.test_per_buffer_location started failing on the CTK 13.x linux-aarch64 CI configs after the runners moved to the nvidia-64k (64 KB page-size) kernel:

>       assert last0 == _HOST_LOCATION_ID
E       assert 0 == -1

Root cause

The test hardcoded a 4096-byte allocation and assumed two pooled buffers landed on separate physical pages. Managed-memory prefetch and CU_MEM_RANGE_ATTRIBUTE_LAST_PREFETCH_LOCATION operate at page granularity, and ManagedMemoryResource is a pool, so the two allocate(4096) calls are packed adjacently (pointers 4 KB apart).

  • 4 KB pages: each 4 KB buffer is its own page → per-buffer prefetch is independent → passes.
  • 64 KB pages: both buffers share one 64 KB page → prefetching bufs[1] to the device migrates the whole shared page → querying bufs[0] reports device 0 instead of host (-1) → assert 0 == -1.

The prefetch itself worked correctly; the test's premise (sub-page allocations are independently prefetchable) only holds when buffer size ≥ page size. The failure is latent on any genuine 64 KB-page platform (Grace / Grace-Blackwell), so reverting the runner kernel only masks it.

Fix

  • Derive _MANAGED_TEST_ALLOCATION_SIZE from mmap.PAGESIZE so each buffer occupies a full page on every platform (no hardcoded page size).
  • Add a precondition asserting the two buffers sit on distinct physical pages, so a future pool-packing change fails loudly instead of silently migrating a shared page.

Verification

pytest tests/memory/test_managed_ops.py34 passed, 1 skipped on an x86 (4 KB page) RTX 5880 Ada box, including the guarded test_per_buffer_location.

🤖 Generated with Claude Code

TestPrefetchBatch.test_per_buffer_location hardcoded a 4096-byte
allocation and assumed two pooled buffers landed on separate physical
pages. Managed-memory prefetch and CU_MEM_RANGE_ATTRIBUTE_LAST_PREFETCH_LOCATION
operate at page granularity, so on nvidia-64k aarch64 kernels both
4 KB buffers shared one 64 KB page; prefetching buf[1] to the device
migrated the shared page and buf[0]'s host prefetch reported device 0
(assert 0 == -1).

Derive the allocation size from mmap.PAGESIZE so each buffer occupies a
full page on every platform, and add a precondition asserting the two
buffers sit on distinct pages so a pool-packing regression fails loudly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rparolin rparolin added this to the cuda.core next milestone Jun 25, 2026
@rparolin rparolin added bug Something isn't working test Improvements or additions to tests cuda.core Everything related to the cuda.core module labels Jun 25, 2026
@rparolin rparolin requested review from kkraus14 and leofang June 25, 2026 21:36
@rparolin rparolin self-assigned this Jun 25, 2026
@mdboom

mdboom commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Should fix #2267.

@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants