Skip to content

ghenvtest#2

Closed
rppt wants to merge 1 commit into
masterfrom
ghenv
Closed

ghenvtest#2
rppt wants to merge 1 commit into
masterfrom
ghenv

Conversation

@rppt

@rppt rppt commented May 9, 2026

Copy link
Copy Markdown
Owner

No description provided.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
@rppt rppt closed this May 9, 2026
rppt pushed a commit that referenced this pull request Jun 2, 2026
…kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 7.1, take #2

- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
  that the EL3 firmware does its part of the job. KVM will refuse to
  initialise otherwise.

- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
  running under NV with E2H==0.

- Correctly deal with permission faults in guest_memfd memslots.

- Fix the steal-time selftest after the infrastructure was reworked.

- Make sure the host cannot pass a non-sensical clock update to the
  EL2 tracing infrastructure.

- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
  ability to run arm64 guests, which will inevitably lead to arm64
  code being directly used on s390.

- Make sure that EL2 is configured with both exception entry and exit
  being Context Synchronization Events.

- Handle the current vcpu being NULL on EL2 panic.

- Fix the selftest_vcpu memcache being empty at the point of donation or
  sharing.

- Check that the memcache has enough capacity before engaging on the
  share/donate path.

- Fix __deactivate_fgt() to use its parameter rather than a variable
  in the macro context.
rppt pushed a commit that referenced this pull request Jun 2, 2026
'pi2_timer' needs to be initialized in all error paths of dualpi2_init():
otherwise, a failure in qdisc_create_dflt() causes the following crash in
dualpi2_destroy():

  # tc qdisc add dev crash0 handle 1: root dualpi2
  BUG: kernel NULL pointer dereference, address: 0000000000000010
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] SMP PTI
  CPU: 4 UID: 0 PID: 471 Comm: tc Tainted: G            E       7.1.0-rc1-virtme #2 PREEMPT(full)
  Tainted: [E]=UNSIGNED_MODULE
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: 0010:hrtimer_active+0x39/0x60
  Code: f9 eb 23 0f b6 41 38 3c 01 0f 87 87 64 c0 ff 83 e0 01 75 33 48 39 4a 50 74 28 44 3b 42 10 75 06 48 3b 51 30 74 21 48 8b 51 30 <44> 8b 42 10 41 f6 c0 01 74 cf f3 90 44 8b 42 10 41 f6 c0 01 74 c3
  RSP: 0018:ffffd0db80b93620 EFLAGS: 00010282
  RAX: ffffffffc0400320 RBX: ffff8cf24a4c86b8 RCX: ffff8cf24a4c86b8
  RDX: 0000000000000000 RSI: ffff8cf2429c2ab0 RDI: ffff8cf24a4c86b8
  RBP: 00000000fffffff4 R08: 0000000000000003 R09: 0000000000000000
  R10: 0000000000000001 R11: ffff8cf24a39c500 R12: ffff8cf24822c000
  R13: ffffd0db80b936c0 R14: ffffffffc02cf360 R15: 00000000ffffffff
  FS:  00007fbc01706580(0000) GS:ffff8cf2dc759000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000010 CR3: 0000000008e02003 CR4: 0000000000172ef0
  Call Trace:
   <TASK>
   hrtimer_cancel+0x15/0x40
   dualpi2_destroy+0x20/0x40 [sch_dualpi2]
   qdisc_create+0x230/0x570
   tc_modify_qdisc+0x716/0xc10
   rtnetlink_rcv_msg+0x188/0x780
   netlink_rcv_skb+0xcd/0x150
   netlink_unicast+0x1ba/0x290
   netlink_sendmsg+0x242/0x4d0
   ____sys_sendmsg+0x39e/0x3e0
   ___sys_sendmsg+0xe1/0x130
   __sys_sendmsg+0xad/0x110
   do_syscall_64+0x14f/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7fbc0188b08e
  Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa
  RSP: 002b:00007fff593260e0 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc0188b08e
  RDX: 0000000000000000 RSI: 00007fff59326190 RDI: 0000000000000003
  RBP: 00007fff593260f0 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000202 R12: 000055f06124f260
  R13: 0000000069fca043 R14: 000055f061255640 R15: 000055f06124d3f8
   </TASK>
  Modules linked in: sch_dualpi2(E)
  CR2: 0000000000000010

[1] https://lore.kernel.org/netdev/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com/
[2] https://lore.kernel.org/netdev/20200725201707.16909-1-xiyou.wangcong@gmail.com/

v2:
 - rebased on top of latest net.git

Fixes: 320d031 ("sched: Struct definition and parsing of dualpi2 qdisc")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/1faca91179702b31da5d87653e1e036543e32722.1778259798.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot.  Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.

The safety mechanism depends on the origin of the reset.  For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event.  In that case no sleep is possible as a device halt is reported
at the hardirq level.

A timer is therefore set up to ensure progress in case no adapter state
change interrupt has arrived in time, but as from commit 168f6b6
("timers: Use del_timer_sync() even on UP") a warning is issued as the
timer is deleted in the hardirq handler upon an expected state change:

  defza: v.1.1.4  Oct  6 2018  Maciej W. Rozycki
  tc2: DEC FDDIcontroller 700 or 700-C at 0x18000000, irq 4
  tc2: resetting the board...
  ------------[ cut here ]------------
  WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0
  Modules linked in:
  CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-dirty #2 VOLUNTARY
  Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468
          0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458
          ffffffff80897400 9800000002027b88 0000000000000000 7070617773203a6d
          0000000000000000 9800000002027ba4 0000000000001000 6465746e69617420
          0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009
          000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000
          0000000000000000 0000000000000020 0000000000000000 ffffffff80910000
          ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0
          0000000000000000 0000000000000000 0000000000000000 0000000000000009
          000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0
          ...
  Call Trace:
  [<ffffffff800502b8>] show_stack+0x28/0xf0
  [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c
  [<ffffffff80068c98>] __warn+0xa0/0x128
  [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4
  [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120
  [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8
  [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228
  [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78
  [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80
  [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0
  [<ffffffff806e956c>] do_IRQ+0x1c/0x30
  [<ffffffff8004ad4c>] handle_int+0x148/0x154
  [<ffffffff800ab7c0>] do_idle+0x40/0x108
  [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38
  [<ffffffff806dfec8>] kernel_init+0x0/0x108

  ---[ end trace 0000000000000000 ]---
  tc2: OK
  tc2: model 700 (DEFZA-AA), MMF PMD, address 08-00-2b-xx-xx-xx
  tc2: ROM rev. 1.0, firmware rev. 1.2, RMC rev. A, SMT ver. 1
  tc2: link unavailable
  ------------[ cut here ]------------
  WARNING: kernel/time/timer.c:1611 at __timer_delete_sync+0x104/0x120, CPU#0: swapper/0/0
  Modules linked in:
  CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W           7.0.0-dirty #2 VOLUNTARY
  Tainted: [W]=WARN
  Stack : 9800000002027d08 00000000140120e0 0000000000000000 ffffffff8089d468
          0000000000000000 0000000000000000 ffffffff807ed6b8 ffffffff80897458
          ffffffff80897400 9800000002027b88 0000000000000000 0000000000000000
          0000000000000000 9800000002027ba4 0000000000001000 0000000000000000
          0000000000000000 ffffffff807ed6b8 00000000140120e0 0000000000000009
          000000000000064b ffffffff800dd14c 0000000000000036 9800000002184000
          0000000000000000 0000000000000020 0000000000000000 ffffffff80910000
          ffffffff8085c000 9800000002027c70 0000000000000001 ffffffff80045fa0
          0000000000000000 0000000000000000 0000000000000000 0000000000000009
          000000000000064b ffffffff800502b8 ffffffff807ed6b8 ffffffff80045fa0
          ...
  Call Trace:
  [<ffffffff800502b8>] show_stack+0x28/0xf0
  [<ffffffff80045fa0>] dump_stack_lvl+0x48/0x7c
  [<ffffffff80068c98>] __warn+0xa0/0x128
  [<ffffffff8004120c>] warn_slowpath_fmt+0x64/0xa4
  [<ffffffff800dd14c>] __timer_delete_sync+0x104/0x120
  [<ffffffff804934ac>] fza_interrupt+0xc74/0xeb8
  [<ffffffff800c6390>] __handle_irq_event_percpu+0x70/0x228
  [<ffffffff800c6560>] handle_irq_event_percpu+0x18/0x78
  [<ffffffff800cc320>] handle_percpu_irq+0x50/0x80
  [<ffffffff800c5970>] generic_handle_irq+0x90/0xd0
  [<ffffffff806e956c>] do_IRQ+0x1c/0x30
  [<ffffffff8004ad4c>] handle_int+0x148/0x154
  [<ffffffff806de8a4>] arch_local_irq_disable+0x4/0x28
  [<ffffffff800ab7d0>] do_idle+0x50/0x108
  [<ffffffff800abb0c>] cpu_startup_entry+0x2c/0x38
  [<ffffffff806dfec8>] kernel_init+0x0/0x108

  ---[ end trace 0000000000000000 ]---
  tc2: registered as fddi0

The immediate origin of the new warning is the switch away from aliasing
del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete()
in terms of current function names) for UP configurations, which however
is the only choice for this driver anyway as no SMP hardware supports
the TURBOchannel bus this device interfaces to.  Therefore there is a
very remote issue only this is a sign of.

Specifically if an adapter reset issued upon a transition to the halted
state times out and first triggers fza_reset_timer() for another reset
assertion, which then schedules fza_reset_timer() for reset deassertion
and then that second call is pre-empted after poking at the hardware,
but before the timer has been rearmed and owing to high system load
causing exceedingly high scheduling latency control is not handed back
before a transition to the uninitialised state has caused the timer to
be deleted even before it has been started, then fza_reset_timer() will
be called yet again and issue another reset even though by then the
adapter has already recovered.

Prevent this situation from happening by switching to timer_delete() for
the transition to the halted state and protect the code region affected
with a spinlock, also to make sure add_timer() has not been called twice
in a row due to an execution race between the interrupt handler and the
timer handler (though it could only happen on SMP, but let's keep the
driver clean).  It's a very unlikely sequence of events to happen and
therefore there's no point in trying to be overly clever about it, such
as by placing printk() calls outside the protection.  For the transition
to the uninitialised state switch to timer_delete_sync_try() instead, so
that a timer isn't deleted that's just been rearmed by the timer handler
and needs to watch for the device to come out of reset again (again, an
SMP scenario only).

Retain timer_delete_sync() invocations outside the hardirq context for a
stray timer not to fire once device structures have been released.

Fixes: 61414f5 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
Mathias Stearn reports that since v6.19, there are two big issues
affecting rseq:

(1) On arm64 specifically, rseq critical sections aren't aborted when
    they should be.

(2) The 'cpu_id_start' field is no longer written by the kernel in all
    cases it used to be, including some cases where TCMalloc depends on
    the kernel clobbering the field.

This patch fixes issue #1. This patch DOES NOT fix issue #2, which will
need to be addressed by other patches.

The arm64-specific brokenness is a result of commits:

  2fc0e4b ("rseq: Record interrupt from user space")
  39a1675 ("rseq: Optimize event setting")

The first commit failed to add a call to rseq_note_user_irq_entry() on
arm64. Thus arm64 never sets rseq_event::user_irq to record that it may
be necessary to abort an active rseq critical section upon return to
userspace. On its own, this commit had no functional impact as the value
of rseq_event::user_irq was not consumed.

The second commit relied upon rseq_event::user_irq to determine whether
or not to bother to perform rseq work when returning to userspace. As
rseq_event::user_irq wasn't set on arm64, this work would be skipped,
and consequently an active rseq critical section would not be aborted.

Fix this by giving arm64 syscall-specific entry/exit paths, and
performing the relevant logic in syscall and non-syscall paths,
including calling rseq_note_user_irq_entry() for non-syscall entry.

Currently arm64 cannot use syscall_enter_from_user_mode(),
syscall_exit_to_user_mode(), and irqentry_exit_to_user_mode(), due to
ordering constraints with exception masking, and risk of ABI breakage
for syscall tracing/audit/etc. For the moment the entry/exit logic is
left as arm64-specific, directly using enter_from_user_mode() and
exit_to_user_mode(), but mirroring the generic code.

I intend to follow up with refactoring/cleanup, as we did for kernel
mode entry paths in commit:

  041aa7a ("entry: Split preemption from irqentry_exit_to_kernel_mode()")

... which will allow arm64 to use the GENERIC_IRQ_ENTRY functions directly.

Fixes: 39a1675 ("rseq: Optimize event setting")
Reported-by: Mathias Stearn <mathias@mongodb.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/regressions/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com/
Link: https://patch.msgid.link/20260508142023.3268622-1-mark.rutland@arm.com
rppt pushed a commit that referenced this pull request Jun 2, 2026
…g-a-bridge-port'

Ido Schimmel says:

====================
bridge: mcast: Fix a possible use-after-free when removing a bridge port

Patch #1 fixes a possible use-after-free when removing a bridge port.

Patch #2 adds a test case that triggers the problem.

In net-next we can:

1. Add DEBUG_NET_WARN_ON_ONCE() when a port multicast context is
de-initialized while enabled.

2. When de-initializing a port multicast context, synchronously shutdown
all the timers that were initialized when the context was initialized.
====================

Link: https://patch.msgid.link/20260517121122.188333-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
Sashiko reported that the PF driver accepts arbitrary MAC address from
from VF mailbox messages without proper validation, creating a security
vulnerability [1].

In enetc_msg_pf_set_vf_primary_mac_addr(), the MAC address is extracted
directly from the message buffer (cmd->mac.sa_data) and programmed into
hardware via pf->ops->set_si_primary_mac() without any validity checks.
A malicious VF can configure a multicast, broadcast, or all-zero MAC
address. Therefore, a validation to check the MAC address provided by VF
is required.

However, simply checking the MAC address is not enough, because it also
has the potential TOCTOU race [2]: The code reads the MAC address from
the DMA buffer to validate it via is_valid_ether_addr(), if validation
passes, reads the same DMA buffer a second time when calling
enetc_pf_set_primary_mac_addr() to program the hardware. A malicious VF
can exploit this window by overwriting the MAC address in the DMA buffer
between the validation check and the hardware programming, bypassing the
validation entirely.

Therefore, allocate a local buffer in enetc_msg_handle_rxmsg() and copy
the message content from the DMA buffer via memcpy() before processing.
This ensures the PF operates on a stable snapshot that the VF cannot
modify.

Link: https://sashiko.dev/#/patchset/20260511080805.2052495-1-wei.fang%40nxp.com #1
Link: https://sashiko.dev/#/patchset/20260513103021.2190593-1-wei.fang%40nxp.com #2
Fixes: beb74ac ("enetc: Add vf to pf messaging support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20260520064421.91569-5-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 2, 2026
With PROVE_LOCKING on an Snapdragon X1 and VM reclaim pressure, we see:

   ======================================================
   WARNING: possible circular locking dependency detected
   7.0.0-debug+ #43 Tainted: G        W
   ------------------------------------------------------
   kswapd0/82 is trying to acquire lock:
   ffff800080ec3870 (reservation_ww_class_acquire){+.+.}-{0:0}, at: msm_gem_shrinker_scan+0x17c/0x400 [msm]

   but task is already holding lock:
   ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (fs_reclaim){+.+.}-{0:0}:
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          fs_reclaim_acquire+0xd0/0xf0
          dma_resv_lockdep+0x224/0x348
          do_one_initcall+0x84/0x5d0
          do_initcalls+0x194/0x1d8
          kernel_init_freeable+0x128/0x180
          kernel_init+0x2c/0x160
          ret_from_fork+0x10/0x20

   -> #1 (reservation_ww_class_mutex){+.+.}-{4:4}:
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          dma_resv_lockdep+0x1a8/0x348
          do_one_initcall+0x84/0x5d0
          do_initcalls+0x194/0x1d8
          kernel_init_freeable+0x128/0x180
          kernel_init+0x2c/0x160
          ret_from_fork+0x10/0x20

   -> #0 (reservation_ww_class_acquire){+.+.}-{0:0}:
          check_prev_add+0x114/0x790
          validate_chain+0x594/0x6f0
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          drm_gem_lru_scan+0x1ac/0x440
          msm_gem_shrinker_scan+0x17c/0x400 [msm]
          do_shrink_slab+0x150/0x4a0
          shrink_slab+0x144/0x460
          shrink_one+0x9c/0x1b0
          shrink_many+0x27c/0x5c0
          shrink_node+0x344/0x550
          balance_pgdat+0x2c0/0x988
          kswapd+0x11c/0x318
          kthread+0x10c/0x128
          ret_from_fork+0x10/0x20

   other info that might help us debug this:
   Chain exists of:
     reservation_ww_class_acquire --> reservation_ww_class_mutex --> fs_reclaim
    Possible unsafe locking scenario:
          CPU0                    CPU1
          ----                    ----
     lock(fs_reclaim);
                                  lock(reservation_ww_class_mutex);
                                  lock(fs_reclaim);
     lock(reservation_ww_class_acquire);

    *** DEADLOCK ***
   1 lock held by kswapd0/82:
    #0: ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988

   stack backtrace:
   CPU: 4 UID: 0 PID: 82 Comm: kswapd0 Tainted: G        W           7.0.0-debug+ #43 PREEMPT(full)
   Tainted: [W]=WARN
   Hardware name: LENOVO 21BX0016US/21BX0016US, BIOS N3HET94W (1.66 ) 09/15/2025
   Call trace:
    show_stack+0x20/0x40 (C)
    dump_stack_lvl+0x9c/0xd0
    dump_stack+0x18/0x30
    print_circular_bug+0x114/0x120
    check_noncircular+0x178/0x198
    check_prev_add+0x114/0x790
    validate_chain+0x594/0x6f0
    __lock_acquire+0x4d0/0xad0
    lock_acquire.part.0+0xc4/0x248
    lock_acquire+0x8c/0x248
    drm_gem_lru_scan+0x1ac/0x440
    msm_gem_shrinker_scan+0x17c/0x400 [msm]
    do_shrink_slab+0x150/0x4a0
    shrink_slab+0x144/0x460
    shrink_one+0x9c/0x1b0
    shrink_many+0x27c/0x5c0
    shrink_node+0x344/0x550
    balance_pgdat+0x2c0/0x988
    kswapd+0x11c/0x318
    kthread+0x10c/0x128
    ret_from_fork+0x10/0x20

kswapd0 holding fs_reclaim calls the MSM shrinker, which calls
dma_resv_lock. This in turn acquires fs_reclaim.

Fix this deadlock by using dma_resv_trylock() instead, dropping the
subsequently unused passed wait-wound lock 'ticket'.

Cc: stable@vger.kernel.org
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Fixes: fe4952b ("drm/msm: Convert vm locking")
Patchwork: https://patchwork.freedesktop.org/patch/723564/
Message-ID: <20260508065722.18785-1-daniel@quora.org>
[rob: fixup compile errors, replace lockdep splat with something legible]
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
rppt pushed a commit that referenced this pull request Jun 2, 2026
…softlockup

We hit a real softlockup in an internal stress test environment.  The
workload was LTP memory/swap stress on a large arm64 machine, with 320
CPUs, about 1TB memory and an 8.6GB swap device.  The system was under
heavy load and the swap device had a large number of full clusters.  The
softlockup was triggered during a stress test after about 3 days.

So, add periodic cond_resched() calls during large full_clusters
reclaim operations to prevent softlockup issues.

Detailed call trace as follow:

PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
   #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
   #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
   #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
   ...
  #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
  #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
  #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
  #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
  #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
  #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
  #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c

Link: https://lore.kernel.org/20260506130919.2298807-1-kerayhuang@tencent.com
Fixes: 5168a68 ("mm, swap: avoid over reclaim of full clusters")
Signed-off-by: Zijiang Huang <kerayhuang@tencent.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: albinwyang <albinwyang@tencent.com>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
Since the start of the git history, brport_store() always acquired the
bridge lock. Back then this decision made sense: The bridge lock
protects the STP state of the bridge and its ports and at that time the
function was only used by two STP related attributes (cost and
priority).

Nowadays, brport_store() processes a lot more attributes and most of
them do not need the bridge lock:

* Bridge flags: Only require RTNL. Read locklessly by the data path.
  Annotations can be added in net-next.

* FDB port flushing: Only requires the FDB lock.

* Multicast attributes: Only require the multicast lock.

* Group forward mask: Only requires RTNL. Read locklessly by the data
  path. Annotations can be added in net-next.

* Backup port: Only requires RTNL. Read locklessly by the data path.

This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].

Fix this by reducing the scope of the bridge lock and only take it when
processing the two STP related attributes that require it. Remove the
now stale comment from br_switchdev_set_port_flag(). The
SWITCHDEV_F_DEFER flag can be removed in net-next.

[1]
BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 372, name: bash
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
5 locks held by bash/372:
#0: ffff88810c51c3f0 (sb_writers#7){.+.+}-{0:0}, at: ksys_write (fs/read_write.c:740)
#1: ffff888115ce9480 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter (fs/kernfs/file.c:343)
#2: ffff88810b9fd330 (kn->active#37){.+.+}-{0:0}, at: kernfs_fop_write_iter (fs/kernfs/file.c:80 fs/kernfs/file.c:344)
#3: ffffffffa59473a0 (rtnl_mutex){+.+.}-{4:4}, at: brport_store (net/bridge/br_sysfs_if.c:326)
#4: ffff8881099d2d58 (&br->lock){+...}-{3:3}, at: brport_store (./include/linux/spinlock.h:348 net/bridge/br_sysfs_if.c:345)
Preemption disabled at:
 0x0
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
__might_resched.cold (kernel/sched/core.c:9163)
netif_rx_mode_run (net/core/dev_addr_lists.c:1262)
netif_rx_mode_sync (net/core/dev_addr_lists.c:1428)
dev_set_promiscuity (net/core/dev_api.c:289)
br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172)
br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747)
store_learning (net/bridge/br_sysfs_if.c:79 net/bridge/br_sysfs_if.c:235)
brport_store (net/bridge/br_sysfs_if.c:346)
kernfs_fop_write_iter (fs/kernfs/file.c:352)
new_sync_write (fs/read_write.c:595)
vfs_write (fs/read_write.c:688)
ksys_write (fs/read_write.c:740)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)

Fixes: 78cd408 ("net: add missing instance lock to dev_set_promiscuity")
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526064818.272516-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
Ido Schimmel says:

====================
bridge: Fix sleep in atomic context

Under certain circumstances the bridge driver can call
dev_set_promiscuity() while holding the bridge spin lock. This is a
problem as dev_set_promiscuity() might sleep.

Patches #1-#2 fix the problem in the netlink and sysfs configuration
paths by only taking the lock where it is actually needed, thereby
avoiding calling dev_set_promiscuity() from an atomic context.

Patch #3 adds test cases for both configuration paths in rtnetlink.sh
which already includes test cases for similar issues.

Note that dev_set_promiscuity() can sleep either when it takes the net
device mutex or when calling netif_rx_mode_sync(). I encountered the
problem with the latter, but blamed the former since it came earlier.
====================

Link: https://patch.msgid.link/20260526064818.272516-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
…softlockup

We hit a real softlockup in an internal stress test environment.  The
workload was LTP memory/swap stress on a large arm64 machine, with 320
CPUs, about 1TB memory and an 8.6GB swap device.  The system was under
heavy load and the swap device had a large number of full clusters.  The
softlockup was triggered during a stress test after about 3 days.

So, add periodic cond_resched() calls during large full_clusters
reclaim operations to prevent softlockup issues.

Detailed call trace as follow:

PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
   #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
   #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
   #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
   ...
  #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
  #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
  #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
  #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
  #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
  #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
  #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c

Link: https://lore.kernel.org/20260506130919.2298807-1-kerayhuang@tencent.com
Fixes: 5168a68 ("mm, swap: avoid over reclaim of full clusters")
Signed-off-by: Zijiang Huang <kerayhuang@tencent.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: albinwyang <albinwyang@tencent.com>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
The KCOV selftest enables coverage by setting current->kcov_mode to
KCOV_MODE_TRACE_PC without installing a coverage area.  If an interrupt
records coverage in that window, the access should fault and expose the
bug.

When building for QEMU raspi0 (Raspberry Pi Zero, ARMv6, CONFIG_CPU_V6K=y,
CONFIG_CURRENT_POINTER_IN_TPIDRURO=y) with GCC 13.3.0, the store that
enables the mode is removed.  The generated kcov_init() code only stores
zero after the wait loop:

  mrc	15, 0, r3, cr13, cr0, {3}
  str	r4, [r3, #2028]

where r4 is zero.  There is no store of KCOV_MODE_TRACE_PC before the
loop, so the selftest reports success without exercising coverage.

Use WRITE_ONCE() for the temporary mode stores.  With the same compiler
and config, kcov_init() contains the intended mode store:

  mov	r3, #2
  mrc	15, 0, r2, cr13, cr0, {3}
  str	r3, [r2, #2028]

Now that the KCOV selftest is actually executed, it may expose KCOV
instrumentation issues depending on the kernel config.  That is expected
for a selftest that was intended to catch coverage from interrupt paths.

Link: https://lore.kernel.org/20260526114715.38280-1-kmehltretter@gmail.com
Fixes: 6cd0dd9 ("kcov: Add interrupt handling self test")
Assisted-by: Codex:gpt-5
Signed-off-by: Karl Mehltretter <kmehltretter@gmail.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
Revisit and reorganize the locking and lock coverage of the
ap->lock spinlock as used in the two sysfs functions
se_bind_store() and se_associate_store().

A kernel run reported a possible deadlock situation, caused by
holding the spinlock (ap->lock) while triggering a uevent.
The fix rearranges the code protected by the spinlock by excluding
the uevent invocation, which does not require protection.

Additionally, the start of the protected region is moved earlier
to cover more lines, ensuring a consistent view of the AP queue
state between reading and updating its struct fields.

	=====================================================
	WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
	7.1.0-20260601.rc6.git12.516b5dbd4d4a.300.fc44.s390x+debug #1 Not tainted
	-----------------------------------------------------
	setupseguest.sh/11034 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
	000001c991f498e8 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x5a/0x6d0
	and this task is already holding:
	000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
	which would create a new lock dependency:
	 (&aq->lock){+.-.}-{2:2} -> (fs_reclaim){+.+.}-{0:0}
	but this new dependency connects a SOFTIRQ-irq-safe lock:
	 (&aq->lock){+.-.}-{2:2}
	... which became SOFTIRQ-irq-safe at:
	  __lock_acquire+0x5ae/0x15a0
	  lock_acquire+0x14c/0x400
	  _raw_spin_lock_bh+0x58/0xb0
	  ap_tasklet_fn+0x72/0xd0
	  tasklet_action_common+0x174/0x1b0
	  handle_softirqs+0x180/0x5c0
	  irq_exit_rcu+0x196/0x200
	  do_ext_irq+0x12a/0x4d0
	  ext_int_handler+0xc6/0xf0
	  folio_zero_user+0x1c6/0x240
	  folio_zero_user+0x182/0x240
	  vma_alloc_anon_folio_pmd+0xa0/0x1d0
	  __do_huge_pmd_anonymous_page+0x3a/0x200
	  __handle_mm_fault+0x56c/0x590
	  handle_mm_fault+0xa2/0x370
	  do_exception+0x292/0x590
	  __do_pgm_check+0x136/0x3e0
	  pgm_check_handler+0x114/0x160
	to a SOFTIRQ-irq-unsafe lock:
	 (fs_reclaim){+.+.}-{0:0}
	... which became SOFTIRQ-irq-unsafe at:
	...
	  __lock_acquire+0x5ae/0x15a0
	  lock_acquire+0x14c/0x400
	  __fs_reclaim_acquire+0x44/0x50
	  fs_reclaim_acquire+0xbe/0x100
	  fs_reclaim_correct_nesting+0x20/0x70
	  dotest+0x5e/0x148
	  locking_selftest+0x2854/0x2a88
	  start_kernel+0x3b2/0x4f0
	  startup_continue+0x2e/0x40
	other info that might help us debug this:
	 Possible interrupt unsafe locking scenario:
	       CPU0                    CPU1
	       ----                    ----
	  lock(fs_reclaim);
				       local_irq_disable();
				       lock(&aq->lock);
				       lock(fs_reclaim);
	  <Interrupt>
	    lock(&aq->lock);
	 *** DEADLOCK ***
	4 locks held by setupseguest.sh/11034:
	 #0: 000000c485d01440 (sb_writers#4){.+.+}-{0:0}, at: vfs_write+0x2fc/0x380
	 #1: 000000c4d2283288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x12a0x270
	 #2: 000000c4a1830e48 (kn->active#172){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1e/0x270
	 #3: 000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
	the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
	-> (&aq->lock){+.-.}-{2:2} {
	   HARDIRQ-ON-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    _raw_spin_lock_bh+0x58/0xb0
			    ap_queue_init_state+0x2e/0x50
			    ap_scan_domains+0x5d6/0x620
			    ap_scan_adapter+0x4c0/0x810
			    ap_scan_bus+0x70/0x350
			    ap_scan_bus_wq_callback+0x56/0x80
			    process_one_work+0x2ba/0x820
			    worker_thread+0x21a/0x400
			    kthread+0x164/0x190
			    __ret_from_fork+0x4c/0x340
			    ret_from_fork+0xa/0x30
	   IN-SOFTIRQ-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    _raw_spin_lock_bh+0x58/0xb0
			    ap_tasklet_fn+0x72/0xd0
			    tasklet_action_common+0x174/0x1b0
			    handle_softirqs+0x180/0x5c0
			    irq_exit_rcu+0x196/0x200
			    do_ext_irq+0x12a/0x4d0
			    ext_int_handler+0xc6/0xf0
			    folio_zero_user+0x1c6/0x240
			    folio_zero_user+0x182/0x240
			    vma_alloc_anon_folio_pmd+0xa0/0x1d0
			    __do_huge_pmd_anonymous_page+0x3a/0x200
			    __handle_mm_fault+0x56c/0x590
			    handle_mm_fault+0xa2/0x370
			    do_exception+0x292/0x590
			    __do_pgm_check+0x136/0x3e0
			    pgm_check_handler+0x114/0x160
	   INITIAL USE at:
			   __lock_acquire+0x5ae/0x15a0
			   lock_acquire+0x14c/0x400
			   _raw_spin_lock_bh+0x58/0xb0
			   ap_queue_init_state+0x2e/0x50
			   ap_scan_domains+0x5d6/0x620
			   ap_scan_adapter+0x4c0/0x810
			   ap_scan_bus+0x70/0x350
			   ap_scan_bus_wq_callback+0x56/0x80
			   process_one_work+0x2ba/0x820
			   worker_thread+0x21a/0x400
			   kthread+0x164/0x190
			   __ret_from_fork+0x4c/0x340
			   ret_from_fork+0xa/0x30
	 }
	 ... key      at: [<000001c9936e8aa0>] __key.7+0x0/0x10
	the dependencies between the lock to be acquired
	 and SOFTIRQ-irq-unsafe lock:
	-> (fs_reclaim){+.+.}-{0:0} {
	   HARDIRQ-ON-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    __fs_reclaim_acquire+0x44/0x50
			    fs_reclaim_acquire+0xbe/0x100
			    fs_reclaim_correct_nesting+0x20/0x70
			    dotest+0x5e/0x148
			    locking_selftest+0x2854/0x2a88
			    start_kernel+0x3b2/0x4f0
			    startup_continue+0x2e/0x40
	   SOFTIRQ-ON-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    __fs_reclaim_acquire+0x44/0x50
			    fs_reclaim_acquire+0xbe/0x100
			    fs_reclaim_correct_nesting+0x20/0x70
			    dotest+0x5e/0x148
			    locking_selftest+0x2854/0x2a88
			    start_kernel+0x3b2/0x4f0
			    startup_continue+0x2e/0x40
	   INITIAL USE at:
			   __lock_acquire+0x5ae/0x15a0
			   lock_acquire+0x14c/0x400
			   __fs_reclaim_acquire+0x44/0x50
			   fs_reclaim_acquire+0xbe/0x100
			   fs_reclaim_correct_nesting+0x20/0x70
			   dotest+0x5e/0x148
			   locking_selftest+0x2854/0x2a88
			   start_kernel+0x3b2/0x4f0
			   startup_continue+0x2e/0x40
	 }
	 ... key      at: [<000001c991f498e8>] __fs_reclaim_map+0x0/0x30
	 ... acquired at:
	   check_prev_add+0x178/0xf40
	   __lock_acquire+0x12aa/0x15a0
	   lock_acquire+0x14c/0x400
	   __fs_reclaim_acquire+0x44/0x50
	   fs_reclaim_acquire+0xbe/0x100
	   __kmalloc_cache_noprof+0x5a/0x6d0
	   kobject_uevent_env+0xd4/0x420
	   ap_send_se_bind_uevent+0x48/0x70
	   se_bind_store+0x146/0x3a0
	   kernfs_fop_write_iter+0x18c/0x270
	   vfs_write+0x23c/0x380
	   ksys_write+0x88/0x120
	   __do_syscall+0x170/0x750
	   system_call+0x72/0x90
	stack backtrace:
	CPU: 6 UID: 0 PID: 11034 Comm: setupseguest.sh Not tainted 7.1.0-20260601.rc6.git2.516b5dbd4d4a.300.fc44.s390x+debug #1 PREEMPT
	Hardware name: IBM 9175 ME1 701 (KVM/Linux)
	Call Trace:
	 [<000001c98ffa0a7e>] dump_stack_lvl+0xae/0x108
	 [<000001c9900a6d7a>] print_bad_irq_dependency+0x47a/0x480
	 [<000001c9900a7184>] check_irq_usage+0x404/0x4c0
	 [<000001c9900a73b8>] check_prev_add+0x178/0xf40
	 [<000001c9900aaf1a>] __lock_acquire+0x12aa/0x15a0
	 [<000001c9900ab35c>] lock_acquire+0x14c/0x400
	 [<000001c9903be454>] __fs_reclaim_acquire+0x44/0x50
	 [<000001c9903be51e>] fs_reclaim_acquire+0xbe/0x100
	 [<000001c9903cf4ca>] __kmalloc_cache_noprof+0x5a/0x6d0
	 [<000001c9910ca9d4>] kobject_uevent_env+0xd4/0x420
	 [<000001c990d84098>] ap_send_se_bind_uevent+0x48/0x70
	 [<000001c990d87416>] se_bind_store+0x146/0x3a0
	 [<000001c99057da7c>] kernfs_fop_write_iter+0x18c/0x270
	 [<000001c99047712c>] vfs_write+0x23c/0x380
	 [<000001c990477438>] ksys_write+0x88/0x120
	 [<000001c9910f64e0>] __do_syscall+0x170/0x750
	 [<000001c99110a412>] system_call+0x72/0x90
	INFO: lockdep is turned off.

Fixes: 4179c39 ("s390/ap: Implement SE bind and associate uevents")
Reported-by: Ingo Franzki <ifranzki@linux.ibm.com>
Suggested-by: Finn Callies <fcallies@linux.ibm.com>
Reviewed-by: Finn Callies <fcallies@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
rppt pushed a commit that referenced this pull request Jun 14, 2026
vntb_epf_peer_db_set() raises an MSI interrupt to notify the RC side of
a doorbell event. pci_epc_raise_irq(..., PCI_IRQ_MSI, interrupt_num)
takes a 1-based MSI interrupt number.

The ntb_hw_epf driver reserves MSI #1 for link events, so doorbells
would naturally start at MSI #2 (doorbell bit 0 -> MSI #2). However,
pci-epf-vntb has historically applied an extra offset and mapped doorbell
bit 0 to MSI #3. This matches the legacy behavior of ntb_hw_epf and has
been preserved since commit e35f56b ("PCI: endpoint: Support NTB
transfer between RC and EP").

This offset has not surfaced as a functional issue because:
- ntb_hw_epf typically allocates enough MSI vectors, so the off-by-one
  still hits a valid MSI vector, and
- ntb_hw_epf does not implement .db_vector_count()/.db_vector_mask(), so
  client drivers such as ntb_transport effectively ignore the vector
  number and schedule all QPs.

Correcting the MSI number would break interoperability with peers
running older kernels.

Document the legacy offset to avoid confusion when enabling
per-db-vector handling.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260513024923.451765-2-den@valinux.co.jp
rppt pushed a commit that referenced this pull request Jun 14, 2026
…ernel/git/ath/ath

Jeff Johnson says:
==================
ath.git patches for v7.2 (PR #2)

For ath12k:
- Add thermal throttling and cooling device support
- Add support for handling incumbent signal interference in 6 GHz
- Add support for channel 177 in the 5 GHz band

In addition, a large number of cleanup and minor bug fixing across
all supported drivers.
==================

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
rppt pushed a commit that referenced this pull request Jun 14, 2026
strset_add_str_mem() might reallocate the strset data buffer in order to
accommodate the provided string 's'. However, if 's' points to a string
already present in the buffer, it becomes dangling after the realloc.
This leads to a use-after-free when attempting to memcpy() the string
into the new buffer.

One scenario that triggers this problematic path is when resolve_btfids
attempts to patch kfunc prototypes using existing BTF parameter names:

 | resolve_btfids: function bpf_list_push_back_impl already exists in BTF
 | Segmentation fault (core dumped)

Compiling resolve_btfids with fsanitize=address generates a detailed
report of the UAF:

 | =================================================================
 | ERROR: AddressSanitizer: heap-use-after-free on address 0x7f4c4a500bd4
 | ==1507892==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f4c4a500bd4 at pc 0x55d25155a2a8 bp 0x7ffcef879060 sp 0x7ffcef878818
 | READ of size 5 at 0x7f4c4a500bd4 thread T0
 |     #0 0x55d25155a2a7 in memcpy (tools/bpf/resolve_btfids/resolve_btfids+0xcf2a7)
 |     #1 0x55d2515d708e in strset__add_str tools/lib/bpf/strset.c:162:2
 |     #2 0x55d2515c730b in btf__add_str tools/lib/bpf/btf.c:2109:8
 |     #3 0x55d2515c9020 in btf__add_func_param tools/lib/bpf/btf.c:3108:14
 |     #4 0x55d25159f0b5 in process_kfunc_with_implicit_args tools/bpf/resolve_btfids/main.c:1196:9
 |     #5 0x55d25159e004 in btf2btf tools/bpf/resolve_btfids/main.c:1229:9
 |     #6 0x55d25159cee7 in main tools/bpf/resolve_btfids/main.c:1535:6
 |     #7 0x7f4c78e29f76 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
 |     #8 0x7f4c78e2a026 in __libc_start_main csu/../csu/libc-start.c:360:3
 |     #9 0x55d2514bb860 in _start (tools/bpf/resolve_btfids/resolve_btfids+0x30860)
 |
 | 0x7f4c4a500bd4 is located 13268 bytes inside of 2829000-byte region [0x7f4c4a4fd800,0x7f4c4a7b02c8)
 | freed by thread T0 here:
 |     #0 0x55d25155b700 in realloc (tools/bpf/resolve_btfids/resolve_btfids+0xd0700)
 |     #1 0x55d2515c426c in libbpf_reallocarray tools/lib/bpf/./libbpf_internal.h:220:9
 |     #2 0x55d2515c426c in libbpf_add_mem tools/lib/bpf/btf.c:224:13
 |
 | previously allocated by thread T0 here:
 |     #0 0x55d25155b2e3 in malloc (tools/bpf/resolve_btfids/resolve_btfids+0xd02e3)
 |     #1 0x55d2515d6e7d in strset__new tools/lib/bpf/strset.c:58:20

While resolve_btfids could be refactored to avoid this call path, let's
instead fix this issue at the source in strset__add_str() and avoid
similar scenarios.

Let's check if set->strs_data was reallocated and whether 's' points to
an internal string within the old strset buffer. In such case, 's' is
reconstructed to point to the new buffer.

While already here, also fix strset__find_str() which suffers from the
same problem by factoring out the common operations into a new helper
function strset_str_append().

Fixes: 90d76d3 ("libbpf: Extract internal set-of-strings datastructure APIs")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Suggested-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260523162722.2718940-1-cmllamas@google.com
rppt pushed a commit that referenced this pull request Jun 14, 2026
Ihor Solodrai says:

====================
bpf: Implement stack_map_get_build_id_offset_sleepable()

The series introduces stack_map_get_build_id_offset_sleepable(),
fixing a gap with parsing build_id in sleepable context in stackmap.c

In particular, this fixes a deadlock in
stack_map_get_build_id_offset() doing a blocking __kernel_read(),
which happens since commit 777a856 ("lib/buildid: use
__kernel_read() for sleepable context").

See previous revisions for more details.
---

v6->v7:
  * Addressed feedback from Andrii (mostly patch #2):
    * implement proper CONFIG_PER_VMA_LOCK=n support, following a
      VMA locking pattern similar to one used in PROCMAP_QUERY
    * change the contract of stack_map_lock_vma(): if a non-NULL VMA
      is returned, then a read lock is held
      * remove now unnecessary vma_locked flag
    * and various other nits
  * Add vma_is_anonymous() checks where appropriate (AIs)
v6: https://lore.kernel.org/bpf/20260521225022.2695755-1-ihor.solodrai@linux.dev/

v5->v6:
  * Misc refactoring (Andrii):
    * add stack_map_build_id_set_valid() helper
    * simplify control flow in stack_map_get_build_id_offset_sleepable()
v5: https://lore.kernel.org/bpf/20260515005244.1333013-1-ihor.solodrai@linux.dev/

v4->v5:
  * Add comments explaining mmap_read_trylock() (Shakeel)
  * Rebase on bpf-next (Alexei)
v4: https://lore.kernel.org/bpf/20260514184727.1067141-1-ihor.solodrai@linux.dev/

v3->v4:
  * Change Fixes tag in patch #2 (AI)
  * Nit in caching implementation (Mykyta)
v3: https://lore.kernel.org/bpf/20260512032906.2670326-1-ihor.solodrai@linux.dev/

v2->v3:
  * Split patch #2 in two: stack_map_get_build_id_offset_sleepable()
    implementation, and then introduce caching
  * Drop taking mmap_lock if CONFIG_PER_VMA_LOCK=n, fall back to raw
    IPs instead
  * Cache vm_{start,end} in addition to prev_file (Mykyta)
v2: https://lore.kernel.org/bpf/20260409010604.1439087-1-ihor.solodrai@linux.dev/

v1->v2:
  * Addressed feedback from Puranjay:
    * split out a small refactoring patch
    * use mmap_read_trylock()
    * take into account CONFIG_PER_VMA_LOCK
    * replace find_vma() with vma_lookup()
    * cache prev_build_id to avoid re-parsing the same file
  * Snapshot vm_pgoff and vm_start before unlocking (AI)
  * To avoid repetitive unlocking statements, introduce struct
    stack_map_vma_lock to hold relevant lock state info and add an
    unlock helper
v1: https://lore.kernel.org/bpf/20260407223003.720428-1-ihor.solodrai@linux.dev/

---
====================

Link: https://patch.msgid.link/20260525223948.1920986-1-ihor.solodrai@linux.dev
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
Daniel Borkmann says:

====================
More gen_loader fixes #2

Another small follow-up from the sashiko findings about signed loaders.
In particular, closing the gap to reject exclusive maps in iterators.
====================

Link: https://patch.msgid.link/20260602133052.423725-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
Leon Hwang says:

====================
bpf: Check tail zero of bpf_map_info and bpf_prog_info

Check the tail bytes of bpf_map_info and bpf_prog_info due to padding
when getting map info and prog info via BPF_OBJ_GET_INFO_BY_FD, which
was discussed in the thread
"bpf: Check tail zero of bpf_common_attr using offsetofend" [1].

Links:
[1] https://lore.kernel.org/bpf/20260518145446.6794-2-leon.hwang@linux.dev/

Changes:
v2 -> v3:
* Add "__u32 :32" to bpf_map_info and bpf_prog_info (per Alexei).
* v2: https://lore.kernel.org/bpf/20260604150505.99129-1-leon.hwang@linux.dev/

v1 -> v2:
* Collect Acked-by tags from Mykyta, thanks.
* Update Fixes tag in patch #2 (per bot+bpf-ci)
* v1: https://lore.kernel.org/bpf/20260603144518.67065-1-leon.hwang@linux.dev/
====================

Link: https://patch.msgid.link/20260605155249.20772-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
…el-in-lwt'

Leon Hwang says:

====================
bpf: Update transport_header when encapsulating UDP tunnel in lwt

Currently, bpf_lwt_push_ip_encap() does not update skb->transport_header.
When a driver, e.g. ice, reuses the stale skb->transport_header to
offload checksum computation to NIC hardware, VxLAN packets encapsulated
by bpf_lwt_push_encap() helper may be dropped due to incorrect checksum.

Update skb->transport_header in bpf_lwt_push_ip_encap() whenever the
encapsulated packet uses UDP, so checksum offload works correctly.

Changes:
v3 -> v4:
* Address comments from Emil:
  * Make the logic of skb_set_transport_header() clearer in patch #1.
  * Fold the code of fexit_lwt_push_ip_encap() into test_lwt_ip_encap.c in
    patch #2.
  * Resolve assorted issues of test in patch #2.
* v3: https://lore.kernel.org/bpf/20260601150203.20352-1-leon.hwang@linux.dev/

v2 -> v3:
* Drop patch #1 and #2 of v2 that aim to resolve potential issues
  reported by sashiko (per Alexei).
* Check target IP version and UDP tunnel in test (per sashiko).
* v2: https://lore.kernel.org/bpf/20260529151351.69911-1-leon.hwang@linux.dev/

v1 -> v2:
* Address sashiko's reviews:
  * Fix TOCTOU issue in lwt to avoid changing hdr after checks.
  * Add check iph->ihl < 5 in lwt to avoid infinite-loop in MIPS driver.
  * Update comment style in selftests with BPF comment style.
* v1: https://lore.kernel.org/bpf/20260525142650.2569-1-leon.hwang@linux.dev/
====================

Link: https://patch.msgid.link/20260602150931.49629-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
Richard Fitzgerald <rf@opensource.cirrus.com> says:

These are for-next.

They are not urgent because it only leaks memory if the driver failed to
component_probe or is removed, which wouldn't happen in normal use.

This series fixes some memory leaks:
- The memory allocated by wm_adsp/cs_dsp was not freed.
- If component_probe() failed it didn't clean up.

The addition of this cleanup in patch #3 exposes an existing possible
double-free of the debugfs, which is fixed in patch #2.

Link: https://patch.msgid.link/20260610093432.557375-1-rf@opensource.cirrus.com
rppt pushed a commit that referenced this pull request Jun 14, 2026
The length for the internal output buffer is calculated incorrectly, which
can result overflow when a too small buffer is provided.

Fix the bug by allocating internal output with the size of the maximum
length of the cryptographic primitive instead of caller provided size.

Link: https://lore.kernel.org/keyrings/20260531024914.3712130-1-jarkko@kernel.org/
Cc: stable@vger.kernel.org # v4.20+
Fixes: 00d60fd ("KEYS: Provide keyctls to drive the new key type ops for asymmetric keys [ver #2]")
Reported-by: Alessandro Groppo <ale.grpp@gmail.com>
Tested-by: Alessandro Groppo <ale.grpp@gmail.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
rppt pushed a commit that referenced this pull request Jun 14, 2026
A deadlock occurs in the audit subsystem when duplicating
executable-related rules.

When a file is moved (e.g., via do_renameat2()), the VFS layer locks
the parent directory (I_MUTEX_PARENT), which synchronously triggers an
fsnotify_move event. If an existing executable audit rule matches the
file being moved, the audit subsystem catches this event and calls
audit_dupe_exe() to duplicate the watch and update the rule. Then,
audit_alloc_mark() would call kern_path_parent() to resolve the path,
leading to a blind attempt to acquire the exact same I_MUTEX_PARENT lock
already held by the task, resulting in the following recursive locking
deadlock:

 ============================================
 WARNING: possible recursive locking detected
 6.12.0-55.27.1.el10_0.x86_64+debug #1 Not tainted
 --------------------------------------------
 mv/5099 is trying to acquire lock:
 ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
 at: __kern_path_locked+0x10a/0x2f0

 but task is already holding lock:
 ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
 at: lock_two_directories+0x13f/0x2b0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&inode->i_sb->s_type->i_mutex_dir_key/1);
   lock(&inode->i_sb->s_type->i_mutex_dir_key/1);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

  6 locks held by mv/5099:
  #0: ffff888112a9c440 (sb_writers#13)
  at: do_renameat2+0x34c/0xbc0
  #1: ffff888112a9c790 (&type->s_vfs_rename_key#3)
  at: do_renameat2+0x415/0xbc0
  #2: ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1)
  at: lock_two_directories+0x13f/0x2b0
  #3: ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/5)
  at: lock_two_directories+0x175/0x2b0
  #4: ffffffffb3a1fb10 (&fsnotify_mark_srcu)
  at: fsnotify+0x454/0x28a0
  #5: ffffffffaf886230 (audit_filter_mutex)
  at: audit_update_watch+0x36/0x11e0

 stack backtrace:
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6f/0xb0
  print_deadlock_bug.cold+0xbd/0xca
  validate_chain+0x83a/0xf00
  __lock_acquire+0xcac/0x1d20
  lock_acquire.part.0+0x11b/0x360
  down_write_nested+0x9f/0x230
  __kern_path_locked+0x10a/0x2f0
  kern_path_locked+0x26/0x40
  audit_alloc_mark+0xfb/0x4f0
  audit_dupe_exe+0x6c/0xe0
  audit_dupe_rule+0x6c2/0xc00
  audit_update_watch+0x4cc/0x11e0
  audit_watch_handle_event+0x12c/0x1b0
  send_to_group+0x5d0/0x8b0
  fsnotify+0x615/0x28a0
  fsnotify_move+0x1d8/0x630
  vfs_rename+0xdcd/0x1df0
  do_renameat2+0x9d4/0xbc0
  __x64_sys_renameat+0x192/0x260
  do_syscall_64+0x92/0x180
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f0491fe8c4e
 Code: 0f 1f 40 00 48 8b 15 c1 e1 16 00 f7 d8 64 89 02 b8 ff ff ff ff
 c3 66 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 08 01 00 00 0f 05 <48>
 3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 8b 15 89
 RSP: 002b:00007ffc7210bf38 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0491fe8c4e
 RDX: 0000000000000003 RSI: 00007ffc7210e6c8 RDI: 00000000ffffff9c
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
 R10: 00005575eb2dae2a R11: 0000000000000246 R12: 00005575eb2dae2a
 R13: 00007ffc7210e6c8 R14: 0000000000000003 R15: 00000000ffffff9c
  </TASK>

The aforementioned deadlock can be consistently reproduced by running
the script below:

 audit-dupe-exe-deadlock.sh
 --------------------------
 #!/bin/bash
 auditctl -D
 mkdir -p /tmp/foo
 touch /tmp/file
 auditctl -a always,exit -F exe=/tmp/file -F path=/tmp/file -S all -k dr
 mv /tmp/file /tmp/foo/file
 rm -Rf /tmp/foo

This patch fixes the issue by introducing struct audit_watch_ctx to pass
the fsnotify event context down to audit_alloc_mark(). By utilizing the
already-resolved directory inode provided by the event, we bypass the
kern_path_parent() path resolution entirely, safely avoiding the
recursive lock. Furthermore, it explicitly allows duplicate fsnotify
marks (allow_dups = 1) during the rename update, allowing the new rule's
mark to safely coexist with the old rule's mark until the old rule is
freed.

P.S.: This issue was identified and reproduced during a comprehensive
code coverage analysis of the audit subsystem. The full report is
available at the link below:

https://people.redhat.com/rrobaina/audit-code-coverage-analysis.pdf

P.P.S: With the permission of both Ricardo and Nathan, I've squashed a
fixup patch from Nathan that addresses a compile time error when
CONFIG_AUDITSYSCALL=n.

Cc: stable@kernel.org
Fixes: 34d99af ("audit: implement audit by executable")
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
[PM: my link metadata into the msg, apply fix from NC]
Signed-off-by: Paul Moore <paul@paul-moore.com>
rppt pushed a commit that referenced this pull request Jun 14, 2026
tl;dr: Use stop_machine() and a state machine based on the
"MULTI_STOP" pattern to implement core TDX module update logic.

Long version:

TDX module updates require careful synchronization with other TDX
operations. The requirements are (#1/#2 reflect current behavior that
must be preserved):

1. SEAMCALLs need to be callable from both process and IRQ contexts.
2. SEAMCALLs need to be able to run concurrently across CPUs
3. During updates, only update-related SEAMCALLs are permitted; all
   other SEAMCALLs shouldn't be called.
4. During updates, all online CPUs must participate in the update work.

No single lock primitive satisfies all requirements. For instance,
rwlock_t handles #1/#2 but fails #4: CPUs spinning with IRQs disabled
cannot be directed to perform update work.

Use stop_machine() as it is the only well-understood mechanism that can
meet all requirements.

And TDX module updates consist of several steps (See Intel Trust Domain
Extensions (Intel TDX) Module Base Architecture Specification, Chapter
"TD-Preserving TDX module Update"). Ordering requirements between steps
mandate lockstep synchronization across all CPUs.

multi_cpu_stop() provides a good example of executing a multi-step task
in lockstep across CPUs, but it does not synchronize the individual
steps inside the callback itself.

Implement a similar state machine as the skeleton for TDX module
updates. Each state represents one step in the update flow, and the
state advances only after all CPUs acknowledge completion of the current
step. This acknowledgment mechanism provides the required lockstep
execution.

The update flow is intentionally simpler than multi_cpu_stop() in two ways:

  a) use a spinlock to protect the control data instead of atomic_t and
     explicit memory barriers.

  b) omit touch_nmi_watchdog() and rcu_momentary_eqs(), which exist
     there for debugging and are not strictly needed for this update flow

Potential alternative to stop_machine()
=======================================
An alternative approach is to lock all KVM entry points and kick all
vCPUs. Here, KVM entry points refer to KVM VM/vCPU ioctl entry points,
implemented in KVM common code (virt/kvm). Adding a locking mechanism
there would affect all architectures KVM supports. And to lock only TDX
vCPUs, new logic would be needed to identify TDX vCPUs, which the KVM
common code currently lacks. This would add significant complexity and
maintenance overhead to KVM for this TDX-specific use case, so don't take
this approach.

[ dhansen: normal changelog/style munging ]

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260520133909.409394-15-chao.gao@intel.com
rppt pushed a commit that referenced this pull request Jun 14, 2026
Check for full 64-bit mode, not just long mode, when truncating the
virtual address as part of INVLPGA emulation.  Compatibility mode doesn't
support 64-bit addressing.

Note, the FIXME still applies, e.g. if the guest deliberately targeted
EAX while in 64-bit via an address size override.  That flaw isn't worth
fixing as it would require decoding the code stream, which would open an
entirely different can of worms, and in practice no sane guest would shove
garbage into RAX[63:32] and execute INVLPGA.

Note #2, VMSAVE, VMLOAD, and VMRUN all suffer from the same architectural
flaw of not providing the full linear address in a VMCB exit information
field, because, quoting the APM verbatim:

  the linear address is available directly from the guest rAX register

(VMSAVE, VMLOAD, and VMRUN take a physical address, but their behavior
with respect to rAX is otherwise identical).

Fixes: bc9eff6 ("KVM: SVM: Use default rAX size for INVLPGA emulation")
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260529222223.870923-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
rppt pushed a commit that referenced this pull request Jun 14, 2026
…L changes

When (conditionally) reprogramming fixed counters, use the hardware value
of FIXED_CTR_CTRL to detect changes, not the guest's original value.  For
guests with a mediated PMU, overwriting fixed_ctr_ctrl_hw at the start of
reprogramming without actually reacting to changes in fixed_ctr_ctrl_hw can
lead to KVM ignoring PMU event filters.

E.g. if the guest attempts to enable a fixed PMC that is disallowed, and
then toggles a different PMC in a subsequent WRMSR, KVM will update
pmu->fixed_ctr_ctrl_hw and reprogram the PMC that is changing, but not the
others that are now effectively enabled in pmu->fixed_ctr_ctrl_hw.

Note, the perf-based PMU is unaffected, as it doesn't use fixed_ctr_ctrl_hw
(which is also why keying off fixed_ctr_ctrl_hw works for both PMUs.

Note #2, fixed_ctr_ctrl_hw won't mess up pmc_in_use either, because the
latter isn't used by the mediated PMU.  Its purpose is solely to release
perf events that are no longer being actively used, and the meadiated PMU
obviously doesn't create perf events.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260528005419.0228F1F00A3A@smtp.kernel.org
Link: https://patch.msgid.link/20260603231905.1738487-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
rppt pushed a commit that referenced this pull request Jun 14, 2026
The etb10 miscdevice uses drvdata->reading as a shared exclusivity gate
for userspace buffer access. etb_open() claims that gate with
local_cmpxchg(), and etb_release() clears it with local_set().

That gate is shared per-device state rather than CPU-local state. A
running system can reach it whenever /dev/<etb> is opened, closed, and
reopened by different tasks while the device remains registered, so the
same drvdata->reading variable may be claimed on one CPU and later
cleared on another.

This code used to use atomic_t for the same gate, but commit
27b10da ("coresight: etb10: moving to local atomic operations")
changed it to local_t even though the access pattern remained cross-task
and cross-CPU. Restore atomic_t together with atomic_cmpxchg() and
atomic_set() so the exclusivity gate again uses a primitive intended
for shared state.

The issue was found on Linux v6.18.21 by our static analysis tool while
scanning surviving local_t-on-shared-state sites, and then manually
reviewed against the live etb10 file-op path.

It was runtime-validated with a reproducible QEMU no-device KCSAN PoC
that kept the same report-local contract:

  1. use one shared struct etb_drvdata carrier and its
     drvdata->reading gate;
  2. call etb_open() and etb_release() sequentially on that gate to
     confirm the original claim/clear path;
  3. bind the open side to CPU0 and the release side to CPU1 for the
     same gate to show cross-CPU ownership;
  4. run bound workers that repeatedly race etb_open() and
     etb_release() on the same gate until KCSAN reports a target hit.

The harness recorded:

  L1 passed open=1 release=1
  reading_after_open=1 reading_after_release=0
  L2 passed open_cpu=0 release_cpu=1
  cross_cpu_release=1 reading_after=0 open_ret=0

Representative KCSAN excerpt from the no-device validation run:

  BUG: KCSAN: data-race in etb_open.constprop.0.isra.0 [vuln_msv]

  write to 0xffffffffc0003810 of 4 bytes by task 216 on cpu 1:
   etb_open.constprop.0.isra.0+0x38/0x80 [vuln_msv]
   l3_worker_thread_fn+0x4f/0xf0 [vuln_msv]
   kthread+0x17e/0x1c0
   ret_from_fork+0x22/0x30

  read to 0xffffffffc0003810 of 4 bytes by task 215 on cpu 0:
   etb_open.constprop.0.isra.0+0x18/0x80 [vuln_msv]
   l3_worker_thread_fn+0x4f/0xf0 [vuln_msv]
   kthread+0x17e/0x1c0
   ret_from_fork+0x22/0x30

  value changed: 0x00000000 -> 0x00000001

  Reported by Kernel Concurrency Sanitizer on:
  CPU: 0 PID: 215 Comm: etb10_l3_a Tainted: G           O       6.1.66 #2

This no-device harness is not a real ETB10 hardware end-to-end run, but
it preserves the same shared drvdata->reading gate and the same
etb_open()/etb_release() claim/clear contract. No real ETB10 hardware
was available for runtime testing.

Build-tested with:
  make olddefconfig
  make -j"$(nproc)" drivers/hwtracing/coresight/coresight-etb10.o

Fixes: 27b10da ("coresight: etb10: moving to local atomic operations")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20260528165201.319452-1-runyu.xiao@seu.edu.cn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant