-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmbk/nIACgkQ6rmadz2v
bTqxuBAAnqW81Rr0nORIxeJMbyo4EiFuYHGk6u5BYP9NPzqHroUPCLVmSP7Hp/Ta
CJjsiZeivZsGa6Qlc3BCa4hHNpqP5WE1C/73svSDn7/99EfxdSBtirpMVFUPsUtn
DDb5chNpvnxKNS8Mw5Ty8wBrdbXHMlSx+IfaFHpv0Yn6EAcuF4UdoEUq2l3PqhfD
Il9Zm127eViPGAP+o+TBZFfW+rRw8d0ngqeRq2GvJ8ibNEDWss+GmBI1Dod7d+fC
dUDg96Ipdm1a5Xz7dnH80eXz9JHdpu6qhQrQMKKArnlpJElrKiOf9b17ZcJoPQOR
ZnstEnUyVnrWROZxUuKY72+2tx3TuSf+L9uZqFHNx3Ix5FIoS+tFbHf4b8SxtsOb
hb2X7SigdGqhQDxUT+IPeO5hsJlIvG1/VYxMXxgc++rh9DjL06hDLUSH1WBSU0fC
kFQ7HrcpAlVHtWmGbwwUyVjD+KC/qmZBTAnkcYT4C62WZVytSCnihIuSFAvV1tpZ
SSIhVPyQ599UoZIiQYihp0S4qP74FotCtErWSrThneh2Cl8kDsRq//lV1nj/PTV8
CpTvz4VCFDFTgthCfd62fP95EwW5K+aE3NjGTPW/9Hx/0+J/1tT+yqWsrToGaruf
TbrqtzQhpclz9UEqA+696cVAXNj9uRU4AoD3YIg72kVnRlkgYd0=
=MDwh
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Introduce '__attribute__((bpf_fastcall))' for helpers and kfuncs with
corresponding support in LLVM.
It is similar to existing 'no_caller_saved_registers' attribute in
GCC/LLVM with a provision for backward compatibility. It allows
compilers generate more efficient BPF code assuming the verifier or
JITs will inline or partially inline a helper/kfunc with such
attribute. bpf_cast_to_kern_ctx, bpf_rdonly_cast,
bpf_get_smp_processor_id are the first set of such helpers.
- Harden and extend ELF build ID parsing logic.
When called from sleepable context the relevants parts of ELF file
will be read to find and fetch .note.gnu.build-id information. Also
harden the logic to avoid TOCTOU, overflow, out-of-bounds problems.
- Improvements and fixes for sched-ext:
- Allow passing BPF iterators as kfunc arguments
- Make the pointer returned from iter_next method trusted
- Fix x86 JIT convergence issue due to growing/shrinking conditional
jumps in variable length encoding
- BPF_LSM related:
- Introduce few VFS kfuncs and consolidate them in
fs/bpf_fs_kfuncs.c
- Enforce correct range of return values from certain LSM hooks
- Disallow attaching to other LSM hooks
- Prerequisite work for upcoming Qdisc in BPF:
- Allow kptrs in program provided structs
- Support for gen_epilogue in verifier_ops
- Important fixes:
- Fix uprobe multi pid filter check
- Fix bpf_strtol and bpf_strtoul helpers
- Track equal scalars history on per-instruction level
- Fix tailcall hierarchy on x86 and arm64
- Fix signed division overflow to prevent INT_MIN/-1 trap on x86
- Fix get kernel stack in BPF progs attached to tracepoint:syscall
- Selftests:
- Add uprobe bench/stress tool
- Generate file dependencies to drastically improve re-build time
- Match JIT-ed and BPF asm with __xlated/__jited keywords
- Convert older tests to test_progs framework
- Add support for RISC-V
- Few fixes when BPF programs are compiled with GCC-BPF backend
(support for GCC-BPF in BPF CI is ongoing in parallel)
- Add traffic monitor
- Enable cross compile and musl libc
* tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (260 commits)
btf: require pahole 1.21+ for DEBUG_INFO_BTF with default DWARF version
btf: move pahole check in scripts/link-vmlinux.sh to lib/Kconfig.debug
btf: remove redundant CONFIG_BPF test in scripts/link-vmlinux.sh
bpf: Call the missed kfree() when there is no special field in btf
bpf: Call the missed btf_record_free() when map creation fails
selftests/bpf: Add a test case to write mtu result into .rodata
selftests/bpf: Add a test case to write strtol result into .rodata
selftests/bpf: Rename ARG_PTR_TO_LONG test description
selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test
bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types
bpf: Fix helper writes to read-only maps
bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers
bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit
selftests/bpf: Add tests for sdiv/smod overflow cases
bpf: Fix a sdiv overflow issue
libbpf: Add bpf_object__token_fd accessor
docs/bpf: Add missing BPF program types to docs
docs/bpf: Add constant values for linkages
bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
...
In cases when synchronizing DMA operations is necessary,
xsk_buff_alloc_batch() returns a single buffer instead of the requested
count. This puts the pressure on drivers that use batch API as they have
to check for this corner case on their side and take care of allocations
by themselves, which feels counter productive. Let us improve the core
by looping over xp_alloc() @max times when slow path needs to be taken.
Another issue with current interface, as spotted and fixed by Dries, was
that when driver called xsk_buff_alloc_batch() with @max == 0, for slow
path case it still allocated and returned a single buffer, which should
not happen. By introducing the logic from first paragraph we kill two
birds with one stone and address this problem as well.
Fixes: 47e4075df3 ("xsk: Batched buffer allocation for the pool")
Reported-and-tested-by: Dries De Winter <ddewinter@synamedia.com>
Co-developed-by: Dries De Winter <ddewinter@synamedia.com>
Signed-off-by: Dries De Winter <ddewinter@synamedia.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://patch.msgid.link/20240911191019.296480-1-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZuH9UQAKCRDbK58LschI
g0/zAP99WOcCBp1M/jSTUOba230+eiol7l5RirDEA6wu7TqY2QEAuvMG0KfCCpTI
I0WqStrK1QMbhwKPodJC1k+17jArKgw=
=jfMU
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-09-11
We've added 12 non-merge commits during the last 16 day(s) which contain
a total of 20 files changed, 228 insertions(+), 30 deletions(-).
There's a minor merge conflict in drivers/net/netkit.c:
00d066a4d4 ("netdev_features: convert NETIF_F_LLTX to dev->lltx")
d966087948 ("netkit: Disable netpoll support")
The main changes are:
1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used
to easily parse skbs in BPF programs attached to tracepoints,
from Philo Lu.
2) Add a cond_resched() point in BPF's sock_hash_free() as there have
been several syzbot soft lockup reports recently, from Eric Dumazet.
3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which
got noticed when zero copy ice driver started to use it,
from Maciej Fijalkowski.
4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs
up via netif_receive_skb_list() to better measure latencies,
from Daniel Xu.
5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann.
6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but
instead gather the actual value via /proc/sys/net/core/max_skb_frags,
also from Maciej Fijalkowski.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
sock_map: Add a cond_resched() in sock_hash_free()
selftests/bpf: Expand skb dynptr selftests for tp_btf
bpf: Allow bpf_dynptr_from_skb() for tp_btf
tcp: Use skb__nullable in trace_tcp_send_reset
selftests/bpf: Add test for __nullable suffix in tp_btf
bpf: Support __nullable argument suffix for tp_btf
bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob
xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc()
tcp_bpf: Remove an unused parameter for bpf_tcp_ingress()
bpf, sockmap: Correct spelling skmsg.c
netkit: Disable netpoll support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a netdev_dmabuf_binding struct which represents the
dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
rx queues on the netdevice. On the binding, the dma_buf_attach
& dma_buf_map_attachment will occur. The entries in the sg_table from
mapping will be inserted into a genpool to make it ready
for allocation.
The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
holds the dma-buf offset of the base of the chunk and the dma_addr of
the chunk. Both are needed to use allocations that come from this chunk.
We create a new type that represents an allocation from the genpool:
net_iov. We setup the net_iov allocation size in the
genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
allocated by the page pool and given to the drivers.
The user can unbind the dmabuf from the netdevice by closing the netlink
socket that established the binding. We do this so that the binding is
automatically unbound even if the userspace process crashes.
The binding and unbinding leaves an indicator in struct netdev_rx_queue
that the given queue is bound, and the binding is actuated by resetting
the rx queue using the queue API.
The netdev_dmabuf_binding struct is refcounted, and releases its
resources only when all the refs are released.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # excluding netlink
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240910171458.219195-4-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We have STAT_FILL_EMPTY test case in xskxceiver that tries to process
traffic with fill queue being empty which currently fails for zero copy
ice driver after it started to use xsk_buff_can_alloc() API. That is
because xsk_queue::queue_empty_descs is currently only increased from
alloc APIs and right now if driver sees that xsk_buff_pool will be
unable to provide the requested count of buffers, it bails out early,
skipping calls to xsk_buff_alloc{_batch}().
Mentioned statistic should be handled in xsk_buff_can_alloc() from the
very beginning, so let's add this logic now. Do it by open coding
xskq_cons_has_entries() and bumping queue_empty_descs in the middle when
fill queue currently has no entries.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240904162808.249160-1-maciej.fijalkowski@intel.com
We recently found out that extending xsk_umem_reg might be a bit
complicated due to not enforcing padding to be zero [0]. Add
a couple of things to make it less error-prone:
1. Remove xdp_umem_reg_v2 since its sizeof is the same as xdp_umem_reg
2. Add BUILD_BUG_ON that checks that the size of xdp_umem_reg_v1 is less
than xdp_umem_reg; presumably, when we get to v2, there is gonna
be a similar line to enforce that sizeof(v2) > sizeof(v1)
3. Add BUILD_BUG_ON to make sure the last field plus its size matches
the overall struct size. The intent is to demonstrate that we don't
have any lingering padding.
0: https://lore.kernel.org/bpf/ZqI29QE+5JnkdPmE@boxer/T/#me03113f7c2458fd08f3c4114a7a9472ac3646c98
Reported-by: Julian Schindel <mail@arctic-alpaca.de>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20240726222048.1397869-1-sdf@fomichev.me
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Julian reports that commit 341ac980ea ("xsk: Support tx_metadata_len")
can break existing use cases which don't zero-initialize xdp_umem_reg
padding. Introduce new XDP_UMEM_TX_METADATA_LEN to make sure we
interpret the padding as tx_metadata_len only when being explicitly
asked.
Fixes: 341ac980ea ("xsk: Support tx_metadata_len")
Reported-by: Julian Schindel <mail@arctic-alpaca.de>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20240713015253.121248-2-sdf@fomichev.me
The bpf_net_ctx_get_.*_flush_list() are used at the top of the function.
This means the variable is always assigned even if unused. By moving the
function to where it is used, it is possible to delay the initialisation
until it is unavoidable.
Not sure how much this gains in reality but by looking at bq_enqueue()
(in devmap.c) gcc pushes one register less to the stack. \o/.
Move flush list retrieval to where it is used.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Every NIC driver utilizing XDP should invoke xdp_do_flush() after
processing all packages. With the introduction of the bpf_net_context
logic the flush lists (for dev, CPU-map and xsk) are lazy initialized
only if used. However xdp_do_flush() tries to flush all three of them so
all three lists are always initialized and the likely empty lists are
"iterated".
Without the usage of XDP but with CONFIG_DEBUG_NET the lists are also
initialized due to xdp_do_check_flushed().
Jakub suggest to utilize the hints in bpf_net_context and avoid invoking
the flush function. This will also avoiding initializing the lists which
are otherwise unused.
Introduce bpf_net_ctx_get_all_used_flush_lists() to return the
individual list if not-empty. Use the logic in xdp_do_flush() and
xdp_do_check_flushed(). Remove the not needed .*_check_flush().
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The per-CPU flush lists, which are accessed from within the NAPI callback
(xdp_do_flush() for instance), are per-CPU. There are subject to the
same problem as struct bpf_redirect_info.
Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
access. The lists initialized on first usage (similar to
bpf_net_ctx_get_ri()).
Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-16-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 2863d665ea.
This patch introduced a potential kernel crash when multiple napi instances
redirect to the same AF_XDP socket. By removing the queue_index check, it is
possible for multiple napi instances to access the Rx ring at the same time,
which will result in a corrupted ring state which can lead to a crash when
flushing the rings in __xsk_flush(). This can happen when the linked list of
sockets to flush gets corrupted by concurrent accesses. A quick and small fix
is not possible, so let us revert this for now.
Reported-by: Yuval El-Hanany <YuvalE@radware.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/xdp-newbies/8100DBDC-0B7C-49DB-9995-6027F6E63147@radware.com
Link: https://lore.kernel.org/bpf/20240604122927.29080-2-magnus.karlsson@gmail.com
XSk infra's been using its own DMA sync shortcut to try avoiding
redundant function calls. Now that there is a generic one, remove
the custom implementation and rely on the generic helpers.
xsk_buff_dma_sync_for_cpu() doesn't need the second argument anymore,
remove it.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
syzbot reported an illegal copy in xsk_setsockopt() [1]
Make sure to validate setsockopt() @optlen parameter.
[1]
BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline]
BUG: KASAN: slab-out-of-bounds in xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420
Read of size 4 at addr ffff888028c6cde3 by task syz-executor.0/7549
CPU: 0 PID: 7549 Comm: syz-executor.0 Not tainted 6.8.0-syzkaller-08951-gfe46a7dd189e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
copy_from_sockptr include/linux/sockptr.h:55 [inline]
xsk_setsockopt+0x909/0xa40 net/xdp/xsk.c:1420
do_sock_setsockopt+0x3af/0x720 net/socket.c:2311
__sys_setsockopt+0x1ae/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fb40587de69
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb40665a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007fb4059abf80 RCX: 00007fb40587de69
RDX: 0000000000000005 RSI: 000000000000011b RDI: 0000000000000006
RBP: 00007fb4058ca47a R08: 0000000000000002 R09: 0000000000000000
R10: 0000000020001980 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fb4059abf80 R15: 00007fff57ee4d08
</TASK>
Allocated by task 7549:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
kasan_kmalloc include/linux/kasan.h:211 [inline]
__do_kmalloc_node mm/slub.c:3966 [inline]
__kmalloc+0x233/0x4a0 mm/slub.c:3979
kmalloc include/linux/slab.h:632 [inline]
__cgroup_bpf_run_filter_setsockopt+0xd2f/0x1040 kernel/bpf/cgroup.c:1869
do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293
__sys_setsockopt+0x1ae/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6d/0x75
The buggy address belongs to the object at ffff888028c6cde0
which belongs to the cache kmalloc-8 of size 8
The buggy address is located 1 bytes to the right of
allocated 2-byte region [ffff888028c6cde0, ffff888028c6cde2)
The buggy address belongs to the physical page:
page:ffffea0000a31b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888028c6c9c0 pfn:0x28c6c
anon flags: 0xfff00000000800(slab|node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xffffffff()
raw: 00fff00000000800 ffff888014c41280 0000000000000000 dead000000000001
raw: ffff888028c6c9c0 0000000080800057 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112cc0(GFP_USER|__GFP_NOWARN|__GFP_NORETRY), pid 6648, tgid 6644 (syz-executor.0), ts 133906047828, free_ts 133859922223
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1ea/0x210 mm/page_alloc.c:1533
prep_new_page mm/page_alloc.c:1540 [inline]
get_page_from_freelist+0x33ea/0x3580 mm/page_alloc.c:3311
__alloc_pages+0x256/0x680 mm/page_alloc.c:4569
__alloc_pages_node include/linux/gfp.h:238 [inline]
alloc_pages_node include/linux/gfp.h:261 [inline]
alloc_slab_page+0x5f/0x160 mm/slub.c:2175
allocate_slab mm/slub.c:2338 [inline]
new_slab+0x84/0x2f0 mm/slub.c:2391
___slab_alloc+0xc73/0x1260 mm/slub.c:3525
__slab_alloc mm/slub.c:3610 [inline]
__slab_alloc_node mm/slub.c:3663 [inline]
slab_alloc_node mm/slub.c:3835 [inline]
__do_kmalloc_node mm/slub.c:3965 [inline]
__kmalloc_node+0x2db/0x4e0 mm/slub.c:3973
kmalloc_node include/linux/slab.h:648 [inline]
__vmalloc_area_node mm/vmalloc.c:3197 [inline]
__vmalloc_node_range+0x5f9/0x14a0 mm/vmalloc.c:3392
__vmalloc_node mm/vmalloc.c:3457 [inline]
vzalloc+0x79/0x90 mm/vmalloc.c:3530
bpf_check+0x260/0x19010 kernel/bpf/verifier.c:21162
bpf_prog_load+0x1667/0x20f0 kernel/bpf/syscall.c:2895
__sys_bpf+0x4ee/0x810 kernel/bpf/syscall.c:5631
__do_sys_bpf kernel/bpf/syscall.c:5738 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5736 [inline]
__x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6d/0x75
page last free pid 6650 tgid 6647 stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1140 [inline]
free_unref_page_prepare+0x95d/0xa80 mm/page_alloc.c:2346
free_unref_page_list+0x5a3/0x850 mm/page_alloc.c:2532
release_pages+0x2117/0x2400 mm/swap.c:1042
tlb_batch_pages_flush mm/mmu_gather.c:98 [inline]
tlb_flush_mmu_free mm/mmu_gather.c:293 [inline]
tlb_flush_mmu+0x34d/0x4e0 mm/mmu_gather.c:300
tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:392
exit_mmap+0x4b6/0xd40 mm/mmap.c:3300
__mmput+0x115/0x3c0 kernel/fork.c:1345
exit_mm+0x220/0x310 kernel/exit.c:569
do_exit+0x99e/0x27e0 kernel/exit.c:865
do_group_exit+0x207/0x2c0 kernel/exit.c:1027
get_signal+0x176e/0x1850 kernel/signal.c:2907
arch_do_signal_or_restart+0x96/0x860 arch/x86/kernel/signal.c:310
exit_to_user_mode_loop kernel/entry/common.c:105 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:201 [inline]
syscall_exit_to_user_mode+0xc9/0x360 kernel/entry/common.c:212
do_syscall_64+0x10a/0x240 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x6d/0x75
Memory state around the buggy address:
ffff888028c6cc80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
ffff888028c6cd00: fa fc fc fc fa fc fc fc 00 fc fc fc 06 fc fc fc
>ffff888028c6cd80: fa fc fc fc fa fc fc fc fa fc fc fc 02 fc fc fc
^
ffff888028c6ce00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
ffff888028c6ce80: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
Fixes: 423f38329d ("xsk: add umem fill queue support and mmap")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240404202738.3634547-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZeEKVAAKCRDbK58LschI
g7oYAQD5Jlv4fIVTvxvfZrTTZ2tU+OsPa75mc8SDKwpash3YygEA8kvESy8+t6pg
D6QmSf1DIZdFoSp/bV+pfkNWMeR8gwg=
=mTAj
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-02-29
We've added 119 non-merge commits during the last 32 day(s) which contain
a total of 150 files changed, 3589 insertions(+), 995 deletions(-).
The main changes are:
1) Extend the BPF verifier to enable static subprog calls in spin lock
critical sections, from Kumar Kartikeya Dwivedi.
2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
in BPF global subprogs, from Andrii Nakryiko.
3) Larger batch of riscv BPF JIT improvements and enabling inlining
of the bpf_kptr_xchg() for RV64, from Pu Lehui.
4) Allow skeleton users to change the values of the fields in struct_ops
maps at runtime, from Kui-Feng Lee.
5) Extend the verifier's capabilities of tracking scalars when they
are spilled to stack, especially when the spill or fill is narrowing,
from Maxim Mikityanskiy & Eduard Zingerman.
6) Various BPF selftest improvements to fix errors under gcc BPF backend,
from Jose E. Marchesi.
7) Avoid module loading failure when the module trying to register
a struct_ops has its BTF section stripped, from Geliang Tang.
8) Annotate all kfuncs in .BTF_ids section which eventually allows
for automatic kfunc prototype generation from bpftool, from Daniel Xu.
9) Several updates to the instruction-set.rst IETF standardization
document, from Dave Thaler.
10) Shrink the size of struct bpf_map resp. bpf_array,
from Alexei Starovoitov.
11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
from Benjamin Tissoires.
12) Fix bpftool to be more portable to musl libc by using POSIX's
basename(), from Arnaldo Carvalho de Melo.
13) Add libbpf support to gcc in CORE macro definitions,
from Cupertino Miranda.
14) Remove a duplicate type check in perf_event_bpf_event,
from Florian Lehner.
15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
with notrace correctly, from Yonghong Song.
16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
array to fix build warnings, from Kees Cook.
17) Fix resolve_btfids cross-compilation to non host-native endianness,
from Viktor Malik.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
selftests/bpf: Test if shadow types work correctly.
bpftool: Add an example for struct_ops map and shadow type.
bpftool: Generated shadow variables for struct_ops maps.
libbpf: Convert st_ops->data to shadow type.
libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
bpf, arm64: use bpf_prog_pack for memory management
arm64: patching: implement text_poke API
bpf, arm64: support exceptions
arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
bpf: add is_async_callback_calling_insn() helper
bpf: introduce in_sleepable() helper
bpf: allow more maps in sleepable bpf programs
selftests/bpf: Test case for lacking CFI stub functions.
bpf: Check cfi_stubs before registering a struct_ops type.
bpf: Clarify batch lookup/lookup_and_delete semantics
bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
bpf, docs: Fix typos in instruction-set.rst
selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
bpf: Shrink size of struct bpf_map/bpf_array.
...
====================
Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xsk_build_skb() allocates a page and adds it to the skb via
skb_add_rx_frag() and specifies 0 for truesize. This leads to a warning
in skb_add_rx_frag() with CONFIG_DEBUG_NET enabled because size is
larger than truesize.
Increasing truesize requires to add the same amount to socket's
sk_wmem_alloc counter in order not to underflow the counter during
release in the destructor (sock_wfree()).
Pass the size of the allocated page as truesize to skb_add_rx_frag().
Add this mount to socket's sk_wmem_alloc counter.
Fixes: cf24f5a5fe ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20240202163221.2488589-1-bigeasy@linutronix.de
Add support for directing a packet to any socket bound to the same
umem. This makes it possible to use the XDP program to select what
socket the packet should be received on. The user can populate the
XSKMAP with various sockets and as long as they share the same umem,
the XDP program can pick any one of them.
Suggested-by: Yuval El-Hanany <yuvale@radware.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240205123553.22180-2-magnus.karlsson@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
used by drivers to notify data path whether xdp_buff contains fragments
or not. Data path looks up mentioned flag on first buffer that occupies
the linear part of xdp_buff, so drivers only modify it there. This is
sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
stack or it resides within struct representing driver's queue and
fragments are carried via skb_frag_t structs. IOW, we are dealing with
only one xdp_buff.
ZC mode though relies on list of xdp_buff structs that is carried via
xsk_buff_pool::xskb_list, so ZC data path has to make sure that
fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
xsk_buff_free() could misbehave if it would be executed against xdp_buff
that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
take place when within supplied XDP program bpf_xdp_adjust_tail() is
used with negative offset that would in turn release the tail fragment
from multi-buffer frame.
Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
result in releasing all the nodes from xskb_list that were produced by
driver before XDP program execution, which is not what is intended -
only tail fragment should be deleted from xskb_list and then it should
be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
make it up to user space, so from AF_XDP application POV there would be
no traffic running, however due to free_list getting constantly new
nodes, driver will be able to feed HW Rx queue with recycled buffers.
Bottom line is that instead of traffic being redirected to user space,
it would be continuously dropped.
To fix this, let us clear the mentioned flag on xsk_buff_pool side
during xdp_buff initialization, which is what should have been done
right from the start of XSK multi-buffer support.
Fixes: 1bbc04de60 ("ice: xsk: add RX multi-buffer support")
Fixes: 1c9ba9c146 ("i40e: xsk: add RX multi-buffer support")
Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce
descriptor to XSK Rx queue.
Fixes: 24ea50127e ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Following patch is going to use RCU instead of
sock_diag_table_mutex acquisition.
This patch is a preparation, no change of behavior yet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v
bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN
oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7
uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1
TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH
NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R
VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO
BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR
SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861
xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS
CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt
l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw=
=stU2
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-12-18
This PR is larger than usual and contains changes in various parts
of the kernel.
The main changes are:
1) Fix kCFI bugs in BPF, from Peter Zijlstra.
End result: all forms of indirect calls from BPF into kernel
and from kernel into BPF work with CFI enabled. This allows BPF
to work with CONFIG_FINEIBT=y.
2) Introduce BPF token object, from Andrii Nakryiko.
It adds an ability to delegate a subset of BPF features from privileged
daemon (e.g., systemd) through special mount options for userns-bound
BPF FS to a trusted unprivileged application. The design accommodates
suggestions from Christian Brauner and Paul Moore.
Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.
- Complete precision tracking support for register spills
- Fix verification of possibly-zero-sized stack accesses
- Fix access to uninit stack slots
- Track aligned STACK_ZERO cases as imprecise spilled registers.
It improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs.
- Fix verifier retval logic
4) Support for VLAN tag in XDP hints, from Larysa Zaremba.
5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.
End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.
6) Fix race between BPF prog accessing inner map and parallel delete,
from Hou Tao.
7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.
It allows BPF interact with IPSEC infra. The intent is to support
software RSS (via XDP) for the upcoming ipsec pcpu work.
Experiments on AWS demonstrate single tunnel pcpu ipsec reaching
line rate on 100G ENA nics.
8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.
9) BPF file verification via fsverity, from Song Liu.
It allows BPF progs get fsverity digest.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits)
bpf: Ensure precise is reset to false in __mark_reg_const_zero()
selftests/bpf: Add more uprobe multi fail tests
bpf: Fail uprobe multi link with negative offset
selftests/bpf: Test the release of map btf
s390/bpf: Fix indirect trampoline generation
selftests/bpf: Temporarily disable dummy_struct_ops test on s390
x86/cfi,bpf: Fix bpf_exception_cb() signature
bpf: Fix dtor CFI
cfi: Add CFI_NOSEAL()
x86/cfi,bpf: Fix bpf_struct_ops CFI
x86/cfi,bpf: Fix bpf_callback_t CFI
x86/cfi,bpf: Fix BPF JIT call
cfi: Flip headers
selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
bpf: Limit the number of kprobes when attaching program to multiple kprobes
bpf: Limit the number of uprobes when attaching program to multiple uprobes
bpf: xdp: Register generic_kfunc_set with XDP programs
selftests/bpf: utilize string values for delegate_xxx mount options
...
====================
Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 94ecc5ca4d ("xsk: Add cb area to struct xdp_buff_xsk") has added
a buffer for custom data to xdp_buff_xsk. Particularly, this memory is used
for data, consumed by XDP hints kfuncs. It does not always change on
a per-packet basis and some parts can be set for example, at the same time
as RX queue info.
Add functions to fill all cbs in xsk_buff_pool with the same metadata.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20231205210847.28460-8-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In xsk_poll(), checking available events and setting mask bits should
be executed only when a socket has been bound. Setting mask bits for
unbound socket is meaningless.
Currently, it checks events even when xsk_check_common() failed.
To prevent this, we move goto location (skip_tx) after that checking.
Fixes: 1596dae2f1 ("xsk: check IFF_UP earlier in Tx path")
Signed-off-by: Yewon Choi <woni9911@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20231201061048.GA1510@libra05
For XDP_COPY mode, add a UMEM option XDP_UMEM_TX_SW_CSUM
to call skb_checksum_help in transmit path. Might be useful
to debugging issues with real hardware. I also use this mode
in the selftests.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231127190319.1190813-9-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Accept only the flags that the kernel knows about to make
sure we can extend this field in the future. Note that only
in XDP_COPY mode we propagate the error signal back to the user
(via sendmsg). For zerocopy mode we silently skip the metadata
for the descriptors that have wrong flags (since we process
the descriptors deep in the driver).
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231127190319.1190813-8-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).
The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.
The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
offloads are implemented; if the callback is implemented - the offload
is supported (used by netlink reporting code)
Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
area upon completion (tx_timestamp field)
XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).
The struct is forward-compatible and can be extended in the future
by appending more fields.
Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For zerocopy mode, tx_desc->addr can point to an arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.
Introduce new tx_metadata_len umem config option that indicates how many
bytes to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).
The size of the metadata has mostly the same constraints as XDP:
- less than 256 bytes
- 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte
timestamp in the completion)
- non-zero
This data is not interpreted in any way right now.
Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231127190319.1190813-2-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to all the sock diag modules in one fell swoop.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the previous implementation, when multiple xsk sockets were
associated with a single xsk_buff_pool, a situation could arise
where the xsk_tx_list maintained data at the front for one xsk
socket while starving the xsk sockets at the back of the list.
This could result in issues such as the inability to transmit packets,
increased latency, and jitter. To address this problem, we introduce
a new variable called tx_budget_spent, which limits each xsk to transmit
a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures
equitable opportunities for subsequent xsk sockets to send tx descriptors.
The value of MAX_PER_SOCKET_BUDGET is set to 32.
Signed-off-by: Albert Huang <huangjie.albert@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20231023125732.82261-1-huangjie.albert@bytedance.com
A few drivers were missing a xdp_do_flush() invocation after
XDP_REDIRECT.
Add three helper functions each for one of the per-CPU lists. Return
true if the per-CPU list is non-empty and flush the list.
Add xdp_do_check_flushed() which invokes each helper functions and
creates a warning if one of the functions had a non-empty list.
Hide everything behind CONFIG_DEBUG_NET.
Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
kernel/bpf/verifier.c
829955981c ("bpf: Fix verifier log for async callback return values")
a923819fb2 ("bpf: Treat first argument as return value for bpf_throw")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reported the following issue:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 2807 at mm/vmalloc.c:3247 __vmalloc_node_range (mm/vmalloc.c:3361)
Modules linked in:
CPU: 0 PID: 2807 Comm: repro Not tainted 6.6.0-rc2+ #12
Hardware name: Generic DT based system
unwind_backtrace from show_stack (arch/arm/kernel/traps.c:258)
show_stack from dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
dump_stack_lvl from __warn (kernel/panic.c:633 kernel/panic.c:680)
__warn from warn_slowpath_fmt (./include/linux/context_tracking.h:153 kernel/panic.c:700)
warn_slowpath_fmt from __vmalloc_node_range (mm/vmalloc.c:3361 (discriminator 3))
__vmalloc_node_range from vmalloc_user (mm/vmalloc.c:3478)
vmalloc_user from xskq_create (net/xdp/xsk_queue.c:40)
xskq_create from xsk_setsockopt (net/xdp/xsk.c:953 net/xdp/xsk.c:1286)
xsk_setsockopt from __sys_setsockopt (net/socket.c:2308)
__sys_setsockopt from ret_fast_syscall (arch/arm/kernel/entry-common.S:68)
xskq_get_ring_size() uses struct_size() macro to safely calculate the
size of struct xsk_queue and q->nentries of desc members. But the
syzkaller repro was able to set q->nentries with the value initially
taken from copy_from_sockptr() high enough to return SIZE_MAX by
struct_size(). The next PAGE_ALIGN(size) is such case will overflow
the size_t value and set it to 0. This will trigger WARN_ON_ONCE in
vmalloc_user() -> __vmalloc_node_range().
The issue is reproducible on 32-bit arm kernel.
Fixes: 9f78bf330a ("xsk: support use vaddr as ring")
Reported-by: syzbot+fae676d3cf469331fc89@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000c84b4705fb31741e@google.com/T/
Reported-by: syzbot+b132693e925cbbd89e26@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000e20df20606ebab4f@google.com/T/
Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+fae676d3cf469331fc89@syzkaller.appspotmail.com
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://syzkaller.appspot.com/bug?extid=fae676d3cf469331fc89
Link: https://lore.kernel.org/bpf/20231007075148.1759-1-andrew.kanner@gmail.com
This is a followup of 8bf43be799 ("net: annotate data-races
around sk->sk_priority").
sk->sk_priority can be read and written without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userspace applications indicate their multi-buffer capability to xsk
using XSK_USE_SG socket bind flag. For sockets using shared umem the
bind flag may contain XSK_USE_SG only for the first socket. For any
subsequent socket the only option supported is XDP_SHARED_UMEM.
Add option XDP_UMEM_SG_FLAG in umem config flags to store the
multi-buffer handling capability when indicated by XSK_USE_SG option in
bing flag by the first socket. Use this to derive multi-buffer capability
for subsequent sockets in xsk core.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Fixes: 81470b5c3c ("xsk: introduce XSK_USE_SG bind flag for xsk socket")
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230907035032.2627879-1-tirthendu.sarkar@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix a use-after-free error that is possible if the xsk_diag interface
is used after the socket has been unbound from the device. This can
happen either due to the socket being closed or the device
disappearing. In the early days of AF_XDP, the way we tested that a
socket was not bound to a device was to simply check if the netdevice
pointer in the xsk socket structure was NULL. Later, a better system
was introduced by having an explicit state variable in the xsk socket
struct. For example, the state of a socket that is on the way to being
closed and has been unbound from the device is XSK_UNBOUND.
The commit in the Fixes tag below deleted the old way of signalling
that a socket is unbound, setting dev to NULL. This in the belief that
all code using the old way had been exterminated. That was
unfortunately not true as the xsk diagnostics code was still using the
old way and thus does not work as intended when a socket is going
down. Fix this by introducing a test against the state variable. If
the socket is in the state XSK_UNBOUND, simply abort the diagnostic's
netlink operation.
Fixes: 18b1ab7aa7 ("xsk: Fix race at socket teardown")
Reported-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230831100119.17408-1-magnus.karlsson@gmail.com
Currently, xsk_build_skb() is a function that builds skb in two possible
ways and then is ended with common error handling.
We can distinguish four possible error paths and handling in xsk_build_skb():
1. sock_alloc_send_skb fails: Retry (skb is NULL).
2. skb_store_bits fails : Free skb and retry.
3. MAX_SKB_FRAGS exceeded: Free skb, cleanup and drop packet.
4. alloc_page fails for frag: Retry page allocation w/o freeing skb
1] and 3] can happen in xsk_build_skb_zerocopy(), which is one of the
two code paths responsible for building skb. Common error path in
xsk_build_skb() assumes that in case errno != -EAGAIN, skb is a valid
pointer, which is wrong as kernel test robot reports that in
xsk_build_skb_zerocopy() other errno values are returned for skb being
NULL.
To fix this, set -EOVERFLOW as error when MAX_SKB_FRAGS are exceeded
and packet needs to be dropped in both xsk_build_skb() and
xsk_build_skb_zerocopy() and use this to distinguish against all other
error cases. Also, add explicit kfree_skb() for 3] so that handling
of 1], 2], and 3] becomes identical where allocation needs to be retried.
Fixes: cf24f5a5fe ("xsk: add support for AF_XDP multi-buffer on Tx path")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Closes: https://lore.kernel.org/r/202307210434.OjgqFcbB-lkp@intel.com
Link: https://lore.kernel.org/bpf/20230823144713.2231808-1-tirthendu.sarkar@intel.com
Fix a refcount underflow problem reported by syzbot that can happen
when a system is running out of memory. If xp_alloc_tx_descs() fails,
and it can only fail due to not having enough memory, then the error
path is triggered. In this error path, the refcount of the pool is
decremented as it has incremented before. However, the reference to
the pool in the socket was not nulled. This means that when the socket
is closed later, the socket teardown logic will think that there is a
pool attached to the socket and try to decrease the refcount again,
leading to a refcount underflow.
I chose this fix as it involved adding just a single line. Another
option would have been to move xp_get_pool() and the assignment of
xs->pool to after the if-statement and using xs_umem->pool instead of
xs->pool in the whole if-statement resulting in somewhat simpler code,
but this would have led to much more churn in the code base perhaps
making it harder to backport.
Fixes: ba3beec2ec ("xsk: Fix possible crash when multiple sockets are created")
Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
=/kVE
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2023-08-03
We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).
The main changes are:
1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
Daniel Borkmann
2) Support new insns from cpu v4 from Yonghong Song
3) Non-atomically allocate freelist during prefill from YiFei Zhu
4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu
5) Add tracepoint to xdp attaching failure from Leon Hwang
6) struct netdev_rx_queue and xdp.h reshuffling to reduce
rebuild time from Jakub Kicinski
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
net: invert the netdevice.h vs xdp.h dependency
net: move struct netdev_rx_queue out of netdevice.h
eth: add missing xdp.h includes in drivers
selftests/bpf: Add testcase for xdp attaching failure tracepoint
bpf, xdp: Add tracepoint to xdp attaching failure
selftests/bpf: fix static assert compilation issue for test_cls_*.c
bpf: fix bpf_probe_read_kernel prototype mismatch
riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
libbpf: fix typos in Makefile
tracing: bpf: use struct trace_entry in struct syscall_tp_t
bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
netfilter: bpf: Only define get_proto_defrag_hook() if necessary
bpf: Fix an array-index-out-of-bounds issue in disasm.c
net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
docs/bpf: Fix malformed documentation
bpf: selftests: Add defrag selftests
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Support not connecting client socket
netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
...
====================
Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.
In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.
We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
sk->sk_mark is often read while another thread could change the value.
Fixes: 4a19ec5800 ("[NET]: Introducing socket mark socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify xskq_cons_read_desc_batch() in a way that each processed
descriptor will be checked if it is an EOP one or not and act
accordingly to that.
Change the behavior of mentioned function to break the processing when
stumbling upon invalid descriptor instead of skipping it. Furthermore,
let us give only full packets down to ZC driver.
With these two assumptions ZC drivers will not have to take care of an
intermediate state of incomplete frames, which will simplify its
implementations a lot.
Last but not least, stop processing when count of frags would exceed
max supported segments on underlying device.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-15-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Given that skb_shared_info relies on skb_frag_t, in order to support
xskb chaining, introduce xdp_buff_xsk::xskb_list_node and
xsk_buff_pool::xskb_list.
This is needed so ZC drivers can add frags as xskb nodes which will make
it possible to handle it both when producing AF_XDP Rx descriptors as
well as freeing/recycling all the frags that a single frame carries.
Speaking of latter, update xsk_buff_free() to take care of list nodes.
For the former (adding as frags), introduce xsk_buff_add_frag() for ZC
drivers usage that is going to be used to add a frag to xskb list from
pool.
xsk_buff_get_frag() will be utilized by XDP_TX and, on contrary, will
return xdp_buff.
One of the previous patches added a wrapper for ZC Rx so implement xskb
list walk and production of Rx descriptors there.
On bind() path, bail out if socket wants to use ZC multi-buffer but
underlying netdev does not support it.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-12-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Descriptors with zero length are not supported by many NICs. To preserve
uniform behavior discard any zero length desc as invvalid desc.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-10-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For transmitting an AF_XDP packet, allocate skb while processing the
first desc and copy data to it. The 'XDP_PKT_CONTD' flag in 'options'
field of the desc indicates the EOP status of the packet. If the current
desc is not EOP, store the skb, release the current desc and go
on to read the next descs.
Allocate a page for each subsequent desc, copy data to it and add it as
a frag in the skb stored in xsk. On processing EOP, transmit the skb
with frags. Addresses contained in descs have been already queued in
consumer queue and skb destructor updated the completion count.
On transmit failure cancel the releases, clear the descs from the
completion queue and consume the skb for retrying packet transmission.
For any invalid descriptor (invalid length/address/options) in the middle
of a packet, all pending descriptors will be dropped by xsk core along
with the invalid one and the next descriptor is treated as the start of
a new packet.
Maximum supported frames for a packet is MAX_SKB_FRAGS + 1. If it is
exceeded, all descriptors accumulated so far are dropped.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-9-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In Tx path, xsk core reserves space for each desc to be transmitted in
the completion queue and it's address contained in it is stored in the
skb destructor arg. After successful transmission the skb destructor
submits the addr marking completion.
To handle multiple descriptors per packet, now along with reserving
space for each descriptor, the corresponding address is also stored in
completion queue. The number of pending descriptors are stored in skb
destructor arg and is used by the skb destructor to update completions.
Introduce 'skb' in xdp_sock to store a partially built packet when
__xsk_generic_xmit() must return before it sees the EOP descriptor for
the current packet so that packet building can resume in next call of
__xsk_generic_xmit().
Helper functions are introduced to set and get the pending descriptors
in the skb destructor arg. Also, wrappers are introduced for storing
descriptor addresses, submitting and cancelling (for unsuccessful
transmissions) the number of completions.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-7-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>