linux/net/core
Taehee Yoo 59a931c5b7 xdp: fix invalid wait context of page_pool_destroy()
If the driver uses a page pool, it creates a page pool with
page_pool_create().
The reference count of page pool is 1 as default.
A page pool will be destroyed only when a reference count reaches 0.
page_pool_destroy() is used to destroy page pool, it decreases a
reference count.
When a page pool is destroyed, ->disconnect() is called, which is
mem_allocator_disconnect().
This function internally acquires mutex_lock().

If the driver uses XDP, it registers a memory model with
xdp_rxq_info_reg_mem_model().
The xdp_rxq_info_reg_mem_model() internally increases a page pool
reference count if a memory model is a page pool.
Now the reference count is 2.

To destroy a page pool, the driver should call both page_pool_destroy()
and xdp_unreg_mem_model().
The xdp_unreg_mem_model() internally calls page_pool_destroy().
Only page_pool_destroy() decreases a reference count.

If a driver calls page_pool_destroy() then xdp_unreg_mem_model(), we
will face an invalid wait context warning.
Because xdp_unreg_mem_model() calls page_pool_destroy() with
rcu_read_lock().
The page_pool_destroy() internally acquires mutex_lock().

Splat looks like:
=============================
[ BUG: Invalid wait context ]
6.10.0-rc6+ #4 Tainted: G W
-----------------------------
ethtool/1806 is trying to lock:
ffffffff90387b90 (mem_id_lock){+.+.}-{4:4}, at: mem_allocator_disconnect+0x73/0x150
other info that might help us debug this:
context-{5:5}
3 locks held by ethtool/1806:
stack backtrace:
CPU: 0 PID: 1806 Comm: ethtool Tainted: G W 6.10.0-rc6+ #4 f916f41f172891c800f2fed
Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0603 11/01/2021
Call Trace:
<TASK>
dump_stack_lvl+0x7e/0xc0
__lock_acquire+0x1681/0x4de0
? _printk+0x64/0xe0
? __pfx_mark_lock.part.0+0x10/0x10
? __pfx___lock_acquire+0x10/0x10
lock_acquire+0x1b3/0x580
? mem_allocator_disconnect+0x73/0x150
? __wake_up_klogd.part.0+0x16/0xc0
? __pfx_lock_acquire+0x10/0x10
? dump_stack_lvl+0x91/0xc0
__mutex_lock+0x15c/0x1690
? mem_allocator_disconnect+0x73/0x150
? __pfx_prb_read_valid+0x10/0x10
? mem_allocator_disconnect+0x73/0x150
? __pfx_llist_add_batch+0x10/0x10
? console_unlock+0x193/0x1b0
? lockdep_hardirqs_on+0xbe/0x140
? __pfx___mutex_lock+0x10/0x10
? tick_nohz_tick_stopped+0x16/0x90
? __irq_work_queue_local+0x1e5/0x330
? irq_work_queue+0x39/0x50
? __wake_up_klogd.part.0+0x79/0xc0
? mem_allocator_disconnect+0x73/0x150
mem_allocator_disconnect+0x73/0x150
? __pfx_mem_allocator_disconnect+0x10/0x10
? mark_held_locks+0xa5/0xf0
? rcu_is_watching+0x11/0xb0
page_pool_release+0x36e/0x6d0
page_pool_destroy+0xd7/0x440
xdp_unreg_mem_model+0x1a7/0x2a0
? __pfx_xdp_unreg_mem_model+0x10/0x10
? kfree+0x125/0x370
? bnxt_free_ring.isra.0+0x2eb/0x500
? bnxt_free_mem+0x5ac/0x2500
xdp_rxq_info_unreg+0x4a/0xd0
bnxt_free_mem+0x1356/0x2500
bnxt_close_nic+0xf0/0x3b0
? __pfx_bnxt_close_nic+0x10/0x10
? ethnl_parse_bit+0x2c6/0x6d0
? __pfx___nla_validate_parse+0x10/0x10
? __pfx_ethnl_parse_bit+0x10/0x10
bnxt_set_features+0x2a8/0x3e0
__netdev_update_features+0x4dc/0x1370
? ethnl_parse_bitset+0x4ff/0x750
? __pfx_ethnl_parse_bitset+0x10/0x10
? __pfx___netdev_update_features+0x10/0x10
? mark_held_locks+0xa5/0xf0
? _raw_spin_unlock_irqrestore+0x42/0x70
? __pm_runtime_resume+0x7d/0x110
ethnl_set_features+0x32d/0xa20

To fix this problem, it uses rhashtable_lookup_fast() instead of
rhashtable_lookup() with rcu_read_lock().
Using xa without rcu_read_lock() here is safe.
xa is freed by __xdp_mem_allocator_rcu_free() and this is called by
call_rcu() of mem_xa_remove().
The mem_xa_remove() is called by page_pool_destroy() if a reference
count reaches 0.
The xa is already protected by the reference count mechanism well in the
control plane.
So removing rcu_read_lock() for page_pool_destroy() is safe.

Fixes: c3f812cea0 ("page_pool: do not release pool until inflight == 0.")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240712095116.3801586-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-14 20:40:21 -07:00
..
bpf_sk_storage.c netlink: introduce type-checking attribute iteration 2024-03-29 15:06:02 -07:00
datagram.c net: fix rc7's __skb_datagram_iter() 2024-07-09 11:24:54 -07:00
dev_addr_lists_test.c net: dev_addr_lists: move locking out of init/exit in kunit 2024-04-15 10:26:35 +01:00
dev_addr_lists.c
dev_ioctl.c net: partial revert of the "Make timestamping selectable: series 2023-11-18 18:42:37 -08:00
dev.c net: add softirq safety to netdev_rename_lock 2024-06-21 12:18:34 +01:00
dev.h net: move sysctl_skb_defer_max to net_hotdata 2024-04-30 18:46:52 -07:00
drop_monitor.c drop_monitor: replace spin_lock by raw_spin_lock 2024-04-15 09:54:15 +01:00
dst_cache.c net: dst_cache: add two DEBUG_NET warnings 2024-06-03 18:50:09 -07:00
dst.c net: dst: Make dst_destroy() static and return void. 2024-02-06 11:45:53 +01:00
failover.c net: failover: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf 2022-12-12 15:18:25 -08:00
fib_notifier.c
fib_rules.c fib: rules: no longer hold RTNL in fib_nl_dumprule() 2024-04-12 19:09:31 -07:00
filter.c bpf: Avoid splat in pskb_pull_reason 2024-06-14 17:20:21 +02:00
flow_dissector.c ip_tunnel: convert __be16 tunnel flags to bitmaps 2024-04-01 10:49:28 +01:00
flow_offload.c tc: flower: Enable offload support IPSEC SPI field. 2023-08-02 10:09:32 +01:00
gen_estimator.c treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
gen_stats.c net: Remove the obsolte u64_stats_fetch_*_irq() users (net). 2022-10-28 20:13:54 -07:00
gro_cells.c net: move netdev_max_backlog to net_hotdata 2024-03-07 21:12:42 -08:00
gro.c net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment 2024-05-13 14:44:06 -07:00
gso.c net: introduce struct net_hotdata 2024-03-07 21:12:41 -08:00
hotdata.c net: move sysctl_mem_pcpu_rsv to net_hotdata 2024-04-30 18:46:52 -07:00
hwbm.c
ieee8021q_helpers.c net: add IEEE 802.1q specific helpers 2024-05-08 10:35:09 +01:00
link_watch.c net: add netdev_set_operstate() helper 2024-02-14 11:20:13 +00:00
lwt_bpf.c lwt: Fix return values of BPF xmit ops 2023-08-18 16:05:26 +02:00
lwtunnel.c xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not available 2022-10-12 10:45:51 +02:00
Makefile net: add IEEE 802.1q specific helpers 2024-05-08 10:35:09 +01:00
neighbour.c net: Remove the now superfluous sentinel elements from ctl_table array 2024-05-03 13:29:41 +01:00
net_namespace.c netns: Make get_net_ns() handle zero refcount net 2024-06-18 10:59:52 +02:00
net_test.c pfcp: always set pfcp metadata 2024-04-01 10:49:28 +01:00
net-procfs.c net: make softnet_data.dropped an atomic_t 2024-04-01 11:28:32 +01:00
net-sysfs.c net: no longer acquire RTNL in threaded_show() 2024-05-03 15:14:01 -07:00
net-sysfs.h
net-traces.c udp6: add a missing call into udp_fail_queue_rcv_skb tracepoint 2023-07-07 09:16:52 +01:00
netclassid_cgroup.c cgroup, netclassid: on modifying netclassid in cgroup, only consider the main process. 2023-10-16 16:36:53 -07:00
netdev-genl-gen.c netdev: support dumping a single netdev in qstats 2024-04-23 10:09:49 -07:00
netdev-genl-gen.h netdev: add per-queue statistics 2024-03-07 21:13:25 -08:00
netdev-genl.c netdev-genl: fix error codes when outputting XDP features 2024-06-14 18:04:29 -07:00
netevent.c
netpoll.c netpoll: Fix race condition in netpoll_owner_active 2024-04-30 19:03:47 -07:00
netprio_cgroup.c
of_net.c net: Explicitly include correct DT includes 2023-07-27 20:33:16 -07:00
page_pool_priv.h net: page_pool: report when page pool was destroyed 2023-11-28 15:48:39 +01:00
page_pool_user.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-03-07 10:29:36 -08:00
page_pool.c dma-mapping updates for Linux 6.10 2024-05-20 10:23:39 -07:00
pktgen.c net: pktgen: Use wait_event_freezable_timeout() for freezable kthread 2023-12-27 14:34:52 +00:00
ptp_classifier.c
request_sock.c tcp: make sure init the accept_queue's spinlocks once 2024-01-19 21:13:25 -08:00
rtnetlink.c rtnetlink: make the "split" NLM_DONE handling generic 2024-06-05 12:34:54 +01:00
scm.c af_unix: Add dead flag to struct scm_fp_list. 2024-05-10 18:52:45 -07:00
secure_seq.c
selftests.c net: fill in MODULE_DESCRIPTION()s under net/core 2023-10-28 11:29:27 +01:00
skbuff.c Revert "net: mirror skb frag ref/unref helpers" 2024-05-03 16:05:53 -07:00
skmsg.c skmsg: Skip zero length skb in sk_msg_recvmsg 2024-07-09 10:24:50 +02:00
sock_destructor.h
sock_diag.c sock_diag: remove sock_diag_mutex 2024-01-23 15:13:55 +01:00
sock_map.c sock_map: avoid race between sock_map_close and sk_psock_put 2024-05-28 12:05:19 +02:00
sock_reuseport.c soreuseport: Fix socket selection for SO_INCOMING_CPU. 2022-10-25 11:35:16 +02:00
sock.c net: do not leave a dangling sk pointer, when socket creation fails 2024-06-20 10:43:14 +02:00
stream.c net: Return error from sk_stream_wait_connect() if sk_wait_event() fails 2023-12-15 10:48:51 +00:00
sysctl_net_core.c net: Remove the now superfluous sentinel elements from ctl_table array 2024-05-03 13:29:41 +01:00
timestamping.c net: partial revert of the "Make timestamping selectable: series 2023-11-18 18:42:37 -08:00
tso.c net: tso: inline tso_count_descs() 2022-12-12 15:04:39 -08:00
utils.c net: core: inet[46]_pton strlen len types 2022-11-01 21:14:39 -07:00
xdp.c xdp: fix invalid wait context of page_pool_destroy() 2024-07-14 20:40:21 -07:00