mirror of
https://github.com/torvalds/linux.git
synced 2024-11-11 14:42:24 +00:00
3055ddd654
1172483 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Andrii Nakryiko
|
3055ddd654 |
libbpf: misc internal libbpf clean ups around log fixup
Normalize internal constants, field names, and comments related to log fixup. Also add explicit `ext_idx` alias for relocation where relocation is pointing to extern description for additional information. No functional changes, just a clean up before subsequent additions. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230418002148.3255690-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
49859de997 |
selftests/bpf: Add a selftest for checking subreg equality
Add a selftest to ensure subreg equality if source register upper 32bit is 0. Without previous patch, the test will fail verification. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yonghong Song
|
3be49f7955 |
bpf: Improve verifier u32 scalar equality checking
In [1], I tried to remove bpf-specific codes to prevent certain llvm optimizations, and add llvm TTI (target transform info) hooks to prevent those optimizations. During this process, I found if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode transformation, I will hit the following verification failure with selftests: ... 8: (18) r1 = 0xffffc900001b2230 ; R1_w=map_value(off=560,ks=4,vs=564,imm=0) 10: (61) r1 = *(u32 *)(r1 +0) ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 11: (79) r2 = *(u64 *)(r6 +152) ; R2_w=scalar() R6=ctx(off=0,imm=0) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 12: (55) if r2 != 0xb9fbeef goto pc+10 ; R2_w=195018479 13: (bc) w2 = w1 ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (test < __NR_TESTS) 14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0 ; 16: (27) r2 *= 28 ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) 17: (18) r3 = 0xffffc900001b2118 ; R3_w=map_value(off=280,ks=4,vs=564,imm=0) 19: (0f) r3 += r2 ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) 20: (61) r2 = *(u32 *)(r3 +0) R3 unbounded memory access, make sure to bounds check any such access processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6 -- END PROG LOAD LOG -- libbpf: prog 'ingress_fwdns_prio100': failed to load: -13 libbpf: failed to load object 'test_tc_dtime' libbpf: failed to load BPF skeleton 'test_tc_dtime': -13 ... At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if the previous mov is alu32 ('w2 = w1'). If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value is 0, so it is safe to claim r1/r2 equality. This patch exactly did this. For a 32bit subreg mov, if the src register upper 32bit is 0, it is okay to claim equality between src and dst registers. With this patch, the above verification sequence becomes ... 8: (18) r1 = 0xffffc9000048e230 ; R1_w=map_value(off=560,ks=4,vs=564,imm=0) 10: (61) r1 = *(u32 *)(r1 +0) ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 11: (79) r2 = *(u64 *)(r6 +152) ; R2_w=scalar() R6=ctx(off=0,imm=0) ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC) 12: (55) if r2 != 0xb9fbeef goto pc+10 ; R2_w=195018479 13: (bc) w2 = w1 ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) ; if (test < __NR_TESTS) 14: (a6) if w1 < 0x9 goto pc+1 ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff)) ... from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0 16: (27) r2 *= 28 ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc)) 17: (18) r3 = 0xffffc9000048e118 ; R3_w=map_value(off=280,ks=4,vs=564,imm=0) 19: (0f) r3 += r2 20: (61) r2 = *(u32 *)(r3 +0) ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252) ... and eventually the bpf program can be verified successfully. [1] https://reviews.llvm.org/D147968 Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Sean Young
|
69a8c792cd |
bpf: lirc program type should not require SYS_CAP_ADMIN
Make it possible to load lirc program type with just CAP_BPF. There is nothing exceptional about lirc programs that means they require SYS_CAP_ADMIN. In order to attach or detach a lirc program type you need permission to open /dev/lirc0; if you have permission to do that, you can alter all sorts of lirc receiving options. Changing the IR protocol decoder is no different. Right now on a typical distribution /dev/lirc devices are only read/write by root. Ideally we would make them group read/write like other devices so that local users can use them without becoming root. Signed-off-by: Sean Young <sean@mess.org> Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Daniel Borkmann
|
59e498a328 |
bpf: Set skb redirect and from_ingress info in __bpf_tx_skb
There are some use-cases where it is desirable to use bpf_redirect() in combination with ifb device, which currently is not supported, for example, around filtering inbound traffic with BPF to then push it to ifb which holds the qdisc for shaping in contrast to doing that on the egress device. Toke mentions the following case related to OpenWrt: Because there's not always a single egress on the other side. These are mainly home routers, which tend to have one or more WiFi devices bridged to one or more ethernet ports on the LAN side, and a single upstream WAN port. And the objective is to control the total amount of traffic going over the WAN link (in both directions), to deal with bufferbloat in the ISP network (which is sadly still all too prevalent). In this setup, the traffic can be split arbitrarily between the links on the LAN side, and the only "single bottleneck" is the WAN link. So we install both egress and ingress shapers on this, configured to something like 95-98% of the true link bandwidth, thus moving the queues into the qdisc layer in the router. It's usually necessary to set the ingress bandwidth shaper a bit lower than the egress due to being "downstream" of the bottleneck link, but it does work surprisingly well. We usually use something like a matchall filter to put all ingress traffic on the ifb, so doing the redirect from BPF has not been an immediate requirement thus far. However, it does seem a bit odd that this is not possible, and we do have a BPF-based filter that layers on top of this kind of setup, which currently uses u32 as the ingress filter and so it could presumably be improved to use BPF instead if that was available. Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Reported-by: Yafang Shao <laoar.shao@gmail.com> Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Alexei Starovoitov
|
d40f4f6813 |
Merge branch 'Remove KF_KPTR_GET kfunc flag'
David Vernet says: ==================== We've managed to improve the UX for kptrs significantly over the last 9 months. All of the existing use cases which previously had KF_KPTR_GET kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *) have all been updated to be synchronized using RCU. In other words, their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU | KF_ACQUIRE kfuncs, with the pointers themselves also being readable from maps in an RCU read region thanks to the types being RCU safe. While KF_KPTR_GET was a logical starting point for kptrs, it's become clear that they're not the correct abstraction. KF_KPTR_GET is a flag that essentially does nothing other than enforcing that the argument to a function is a pointer to a referenced kptr map value. At first glance, that's a useful thing to guarantee to a kfunc. It gives kfuncs the ability to try and acquire a reference on that kptr without requiring the BPF prog to do something like this: struct kptr_type *in_map, *new = NULL; in_map = bpf_kptr_xchg(&map->value, NULL); if (in_map) { new = bpf_kptr_type_acquire(in_map); in_map = bpf_kptr_xchg(&map->value, in_map); if (in_map) bpf_kptr_type_release(in_map); } That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is the only alternative, it's better than nothing. However, the problem with any KF_KPTR_GET kfunc lies in the fact that it always requires some kind of synchronization in order to safely do an opportunistic acquire of the kptr in the map. This is because a BPF program running on another CPU could do a bpf_kptr_xchg() on that map value, and free the kptr after it's been read by the KF_KPTR_GET kfunc. For example, the now-removed bpf_task_kptr_get() kfunc did the following: struct task_struct *bpf_task_kptr_get(struct task_struct **pp) { struct task_struct *p; rcu_read_lock(); p = READ_ONCE(*pp); /* If p is non-NULL, it could still be freed by another CPU, * so we have to do an opportunistic refcount_inc_not_zero() * and return NULL if the task will be freed after the * current RCU read region. */ |f (p && !refcount_inc_not_zero(&p->rcu_users)) p = NULL; rcu_read_unlock(); return p; } In other words, the kfunc uses RCU to ensure that the task remains valid after it's been peeked from the map. However, this is completely redundant with just defining a KF_RCU kfunc that itself does a refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now does. So, the question of whether KF_KPTR_GET is useful is actually, "Are there any synchronization mechanisms / safety flags that are required by certain kptrs, but which are not provided by the verifier to kfuncs?" The answer to that question today is "No", because every kptr we currently care about is RCU protected. Even if the answer ever became "yes", the proper way to support that referenced kptr type would be to add support for whatever synchronization mechanism it requires in the verifier, rather than giving kfuncs a flag that says, "Here's a pointer to a referenced kptr in a map, do whatever you need to do." With all that said -- so as to allow us to consolidate the kfunc API, and simplify the verifier, this patchset removes the KF_KPTR_GET kfunc flag. --- This is v2 of this patchset v1: https://lore.kernel.org/all/20230415103231.236063-1-void@manifault.com/ Changelog: ---------- v1 -> v2: - Fix KF_RU -> KF_RCU typo in commit summary for patch 2/3, and in cover letter (Alexei) - In order to reduce churn, don't shift all KF_* flags down by 1. We'll just fill the now-empty slot the next time we add a flag (Alexei) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
David Vernet
|
530474e6d0 |
bpf,docs: Remove KF_KPTR_GET from documentation
A prior patch removed KF_KPTR_GET from the kernel. Now that it's no longer accessible to kfunc authors, this patch removes it from the BPF kfunc documentation. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-4-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
David Vernet
|
7b4ddf3920 |
bpf: Remove KF_KPTR_GET kfunc flag
We've managed to improve the UX for kptrs significantly over the last 9 months. All of the existing use cases which previously had KF_KPTR_GET kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *) have all been updated to be synchronized using RCU. In other words, their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU | KF_ACQUIRE kfuncs, with the pointers themselves also being readable from maps in an RCU read region thanks to the types being RCU safe. While KF_KPTR_GET was a logical starting point for kptrs, it's become clear that they're not the correct abstraction. KF_KPTR_GET is a flag that essentially does nothing other than enforcing that the argument to a function is a pointer to a referenced kptr map value. At first glance, that's a useful thing to guarantee to a kfunc. It gives kfuncs the ability to try and acquire a reference on that kptr without requiring the BPF prog to do something like this: struct kptr_type *in_map, *new = NULL; in_map = bpf_kptr_xchg(&map->value, NULL); if (in_map) { new = bpf_kptr_type_acquire(in_map); in_map = bpf_kptr_xchg(&map->value, in_map); if (in_map) bpf_kptr_type_release(in_map); } That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is the only alternative, it's better than nothing. However, the problem with any KF_KPTR_GET kfunc lies in the fact that it always requires some kind of synchronization in order to safely do an opportunistic acquire of the kptr in the map. This is because a BPF program running on another CPU could do a bpf_kptr_xchg() on that map value, and free the kptr after it's been read by the KF_KPTR_GET kfunc. For example, the now-removed bpf_task_kptr_get() kfunc did the following: struct task_struct *bpf_task_kptr_get(struct task_struct **pp) { struct task_struct *p; rcu_read_lock(); p = READ_ONCE(*pp); /* If p is non-NULL, it could still be freed by another CPU, * so we have to do an opportunistic refcount_inc_not_zero() * and return NULL if the task will be freed after the * current RCU read region. */ |f (p && !refcount_inc_not_zero(&p->rcu_users)) p = NULL; rcu_read_unlock(); return p; } In other words, the kfunc uses RCU to ensure that the task remains valid after it's been peeked from the map. However, this is completely redundant with just defining a KF_RCU kfunc that itself does a refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now does. So, the question of whether KF_KPTR_GET is useful is actually, "Are there any synchronization mechanisms / safety flags that are required by certain kptrs, but which are not provided by the verifier to kfuncs?" The answer to that question today is "No", because every kptr we currently care about is RCU protected. Even if the answer ever became "yes", the proper way to support that referenced kptr type would be to add support for whatever synchronization mechanism it requires in the verifier, rather than giving kfuncs a flag that says, "Here's a pointer to a referenced kptr in a map, do whatever you need to do." With all that said -- so as to allow us to consolidate the kfunc API, and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all relevant logic from the verifier. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
David Vernet
|
09b501d905 |
bpf: Remove bpf_kfunc_call_test_kptr_get() test kfunc
We've managed to improve the UX for kptrs significantly over the last 9 months. All of the prior main use cases, struct bpf_cpumask *, struct task_struct *, and struct cgroup *, have all been updated to be synchronized mainly using RCU. In other words, their KF_ACQUIRE kfunc calls are all KF_RCU, and the pointers themselves are MEM_RCU and can be accessed in an RCU read region in BPF. In a follow-on change, we'll be removing the KF_KPTR_GET kfunc flag. This patch prepares for that by removing the bpf_kfunc_call_test_kptr_get() kfunc, and all associated selftests. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230416084928.326135-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Alexei Starovoitov
|
7a0788fe83 |
Merge branch 'Shared ownership for local kptrs'
Dave Marchevsky says:
====================
This series adds support for refcounted local kptrs to the verifier. A local
kptr is 'refcounted' if its type contains a struct bpf_refcount field:
struct refcounted_node {
long data;
struct bpf_list_node ll;
struct bpf_refcount ref;
};
bpf_refcount is used to implement shared ownership for local kptrs.
Motivating usecase
==================
If a struct has two collection node fields, e.g.:
struct node {
long key;
long val;
struct bpf_rb_node rb;
struct bpf_list_node ll;
};
It's not currently possible to add a node to both the list and rbtree:
long bpf_prog(void *ctx)
{
struct node *n = bpf_obj_new(typeof(*n));
if (!n) { /* ... */ }
bpf_spin_lock(&lock);
bpf_list_push_back(&head, &n->ll);
bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
bpf_spin_unlock(&lock);
}
The above program will fail verification due to current owning / non-owning ref
logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
passed to bpf_rbtree_add. The only way to get an owning reference for the node
that was added is to bpf_list_pop_{front,back} it.
More generally, verifier ownership semantics expect that a node has one
owner (program, collection, or stashed in map) with exclusive ownership
of the node's lifetime. The owner free's the node's underlying memory when it
itself goes away.
Without a shared ownership concept it's impossible to express many real-world
usecases such that they pass verification.
Semantic Changes
================
Before this series, the verifier could make this statement: "whoever has the
owning reference has exclusive ownership of the referent's lifetime". As
demonstrated in the previous section, this implies that a BPF program can't
have an owning reference to some node if that node is in a collection. If
such a state were possible, the node would have multiple owners, each thinking
they have exclusive ownership. In order to support shared ownership it's
necessary to modify the exclusive ownership semantic.
After this series' changes, an owning reference has ownership of the referent's
lifetime, but it's not necessarily exclusive. The referent's underlying memory
is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
used for collection insert.
This change doesn't affect UX of owning or non-owning references much:
* insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
an owning reference arg, as ownership still must be passed to the
collection in a shared-ownership world.
* non-owning references still refer to valid memory without claiming
any ownership.
One important conclusion that followed from "exclusive ownership" statement
is no longer valid, though. In exclusive-ownership world, if a BPF prog has
an owning reference to a node, the verifier can conclude that no collection has
ownership of it. This conclusion was used to avoid runtime checking in the
implementations of insert and remove operations (""has the node already been
{inserted, removed}?").
In a shared-ownership world the aforementioned conclusion is no longer valid,
which necessitates doing runtime checking in insert and remove operation
kfuncs, and those functions possibly failing to insert or remove anything.
Luckily the verifier changes necessary to go from exclusive to shared ownership
were fairly minimal. Patches in this series which do change verifier semantics
generally have some summary dedicated to explaining why certain usecases
Just Work for shared ownership without verifier changes.
Implementation
==============
The changes in this series can be categorized as follows:
* struct bpf_refcount opaque field + plumbing
* support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
* bpf_refcount_acquire kfunc
* enables shared ownershp by bumping refcount + acquiring owning ref
* support for possibly-failing collection insertion and removal
* insertion changes are more complex
If a patch's changes have some nuance to their effect - or lack of effect - on
verifier behavior, the patch summary talks about it at length.
Patch contents:
* Patch 1 removes btf_field_offs struct
* Patch 2 adds struct bpf_refcount and associated plumbing
* Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
refcounted kptrs
* Patch 4 adds bpf_refcount_acquire
* Patches 5-7 add support for possibly-failing collection insert and remove
* Patch 8 centralizes constructor-like functionality for local kptr types
* Patch 9 adds tests for new functionality
base-commit:
|
||
Dave Marchevsky
|
6147f15131 |
selftests/bpf: Add refcounted_kptr tests
Test refcounted local kptr functionality added in previous patches in the series. Usecases which pass verification: * Add refcounted local kptr to both tree and list. Then, read and - possibly, depending on test variant - delete from tree, then list. * Also test doing read-and-maybe-delete in opposite order * Stash a refcounted local kptr in a map_value, then add it to a rbtree. Read from both, possibly deleting after tree read. * Add refcounted local kptr to both tree and list. Then, try reading and deleting twice from one of the collections. * bpf_refcount_acquire of just-added non-owning ref should work, as should bpf_refcount_acquire of owning ref just out of bpf_obj_new Usecases which fail verification: * The simple successful bpf_refcount_acquire cases from above should both fail to verify if the newly-acquired owning ref is not dropped Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-10-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
3e81740a90 |
bpf: Centralize btf_field-specific initialization logic
All btf_fields in an object are 0-initialized by memset in bpf_obj_init. This might not be a valid initial state for some field types, in which case kfuncs that use the type will properly initialize their input if it's been 0-initialized. Some BPF graph collection types and kfuncs do this: bpf_list_{head,node} and bpf_rb_node. An earlier patch in this series added the bpf_refcount field, for which the 0 state indicates that the refcounted object should be free'd. bpf_obj_init treats this field specially, setting refcount to 1 instead of relying on scattered "refcount is 0? Must have just been initialized, let's set to 1" logic in kfuncs. This patch extends this treatment to list and rbtree field types, allowing most scattered initialization logic in kfuncs to be removed. Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case they'll be 0-initialized without passing through the newly-added logic, so scattered initialization logic must remain for these collection root types. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
404ad75a36 |
bpf: Migrate bpf_rbtree_remove to possibly fail
This patch modifies bpf_rbtree_remove to account for possible failure due to the input rb_node already not being in any collection. The function can now return NULL, and does when the aforementioned scenario occurs. As before, on successful removal an owning reference to the removed node is returned. Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL | KF_ACQUIRE - provides the desired verifier semantics: * retval must be checked for NULL before use * if NULL, retval's ref_obj_id is released * retval is a "maybe acquired" owning ref, not a non-owning ref, so it will live past end of critical section (bpf_spin_unlock), and thus can be checked for NULL after the end of the CS BPF programs must add checks ============================ This does change bpf_rbtree_remove's verifier behavior. BPF program writers will need to add NULL checks to their programs, but the resulting UX looks natural: bpf_spin_lock(&glock); n = bpf_rbtree_first(&ghead); if (!n) { /* ... */} res = bpf_rbtree_remove(&ghead, &n->node); bpf_spin_unlock(&glock); if (!res) /* Newly-added check after this patch */ return 1; n = container_of(res, /* ... */); /* Do something else with n */ bpf_obj_drop(n); return 0; The "if (!res)" check above is the only addition necessary for the above program to pass verification after this patch. bpf_rbtree_remove no longer clobbers non-owning refs ==================================================== An issue arises when bpf_rbtree_remove fails, though. Consider this example: struct node_data { long key; struct bpf_list_node l; struct bpf_rb_node r; struct bpf_refcount ref; }; long failed_sum; void bpf_prog() { struct node_data *n = bpf_obj_new(/* ... */); struct bpf_rb_node *res; n->key = 10; bpf_spin_lock(&glock); bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref */ res = bpf_rbtree_remove(&some_tree, &n->r, /* ... */); if (!res) failed_sum += n->key; /* not possible */ bpf_spin_unlock(&glock); /* if (res) { do something useful and drop } ... */ } The bpf_rbtree_remove in this example will always fail. Similarly to bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference invalidation point. The verifier clobbers all non-owning refs after a bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail verification, and in fact there's no good way to get information about the node which failed to add after the invalidation. This patch removes non-owning reference invalidation from bpf_rbtree_remove to allow the above usecase to pass verification. The logic for why this is now possible is as follows: Before this series, bpf_rbtree_add couldn't fail and thus assumed that its input, a non-owning reference, was in the tree. But it's easy to construct an example where two non-owning references pointing to the same underlying memory are acquired and passed to rbtree_remove one after another (see rbtree_api_release_aliasing in selftests/bpf/progs/rbtree_fail.c). So it was necessary to clobber non-owning refs to prevent this case and, more generally, to enforce "non-owning ref is definitely in some collection" invariant. This series removes that invariant and the failure / runtime checking added in this patch provide a clean way to deal with the aliasing issue - just fail to remove. Because the aliasing issue prevented by clobbering non-owning refs is no longer an issue, this patch removes the invalidate_non_owning_refs call from verifier handling of bpf_rbtree_remove. Note that bpf_spin_unlock - the other caller of invalidate_non_owning_refs - clobbers non-owning refs for a different reason, so its clobbering behavior remains unchanged. No BPF program changes are necessary for programs to remain valid as a result of this clobbering change. A valid program before this patch passed verification with its non-owning refs having shorter (or equal) lifetimes due to more aggressive clobbering. Also, update existing tests to check bpf_rbtree_remove retval for NULL where necessary, and move rbtree_api_release_aliasing from progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass verification. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-8-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
de67ba3968 |
selftests/bpf: Modify linked_list tests to work with macro-ified inserts
The linked_list tests use macros and function pointers to reduce code duplication. Earlier in the series, bpf_list_push_{front,back} were modified to be macros, expanding to invoke actual kfuncs bpf_list_push_{front,back}_impl. Due to this change, a code snippet like: void (*p)(void *, void *) = (void *)&bpf_list_##op; p(hexpr, nexpr); meant to do bpf_list_push_{front,back}(hexpr, nexpr), will no longer work as it's no longer valid to do &bpf_list_push_{front,back} since they're no longer functions. This patch fixes issues of this type, along with two other minor changes - one improvement and one fix - both related to the node argument to list_push_{front,back}. * The fix: migration of list_push tests away from (void *, void *) func ptr uncovered that some tests were incorrectly passing pointer to node, not pointer to struct bpf_list_node within the node. This patch fixes such issues (CHECK(..., f) -> CHECK(..., &f->node)) * The improvement: In linked_list tests, the struct foo type has two list_node fields: node and node2, at byte offsets 0 and 40 within the struct, respectively. Currently node is used in ~all tests involving struct foo and lists. The verifier needs to do some work to account for the offset of bpf_list_node within the node type, so using node2 instead of node exercises that logic more in the tests. This patch migrates linked_list tests to use node2 instead of node. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-7-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
d2dcc67df9 |
bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail
Consider this code snippet: struct node { long key; bpf_list_node l; bpf_rb_node r; bpf_refcount ref; } int some_bpf_prog(void *ctx) { struct node *n = bpf_obj_new(/*...*/), *m; bpf_spin_lock(&glock); bpf_rbtree_add(&some_tree, &n->r, /* ... */); m = bpf_refcount_acquire(n); bpf_rbtree_add(&other_tree, &m->r, /* ... */); bpf_spin_unlock(&glock); /* ... */ } After bpf_refcount_acquire, n and m point to the same underlying memory, and that node's bpf_rb_node field is being used by the some_tree insert, so overwriting it as a result of the second insert is an error. In order to properly support refcounted nodes, the rbtree and list insert functions must be allowed to fail. This patch adds such support. The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to return an int indicating success/failure, with 0 -> success, nonzero -> failure. bpf_obj_drop on failure ======================= Currently the only reason an insert can fail is the example above: the bpf_{list,rb}_node is already in use. When such a failure occurs, the insert kfuncs will bpf_obj_drop the input node. This allows the insert operations to logically fail without changing their verifier owning ref behavior, namely the unconditional release_reference of the input owning ref. With insert that always succeeds, ownership of the node is always passed to the collection, since the node always ends up in the collection. With a possibly-failed insert w/ bpf_obj_drop, ownership of the node is always passed either to the collection (success), or to bpf_obj_drop (failure). Regardless, it's correct to continue unconditionally releasing the input owning ref, as something is always taking ownership from the calling program on insert. Keeping owning ref behavior unchanged results in a nice default UX for insert functions that can fail. If the program's reaction to a failed insert is "fine, just get rid of this owning ref for me and let me go on with my business", then there's no reason to check for failure since that's default behavior. e.g.: long important_failures = 0; int some_bpf_prog(void *ctx) { struct node *n, *m, *o; /* all bpf_obj_new'd */ bpf_spin_lock(&glock); bpf_rbtree_add(&some_tree, &n->node, /* ... */); bpf_rbtree_add(&some_tree, &m->node, /* ... */); if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) { important_failures++; } bpf_spin_unlock(&glock); } If we instead chose to pass ownership back to the program on failed insert - by returning NULL on success or an owning ref on failure - programs would always have to do something with the returned ref on failure. The most likely action is probably "I'll just get rid of this owning ref and go about my business", which ideally would look like: if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */)) bpf_obj_drop(n); But bpf_obj_drop isn't allowed in a critical section and inserts must occur within one, so in reality error handling would become a hard-to-parse mess. For refcounted nodes, we can replicate the "pass ownership back to program on failure" logic with this patch's semantics, albeit in an ugly way: struct node *n = bpf_obj_new(/* ... */), *m; bpf_spin_lock(&glock); m = bpf_refcount_acquire(n); if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) { /* Do something with m */ } bpf_spin_unlock(&glock); bpf_obj_drop(m); bpf_refcount_acquire is used to simulate "return owning ref on failure". This should be an uncommon occurrence, though. Addition of two verifier-fixup'd args to collection inserts =========================================================== The actual bpf_obj_drop kfunc is bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop macro populating the second arg with 0 and the verifier later filling in the arg during insn fixup. Because bpf_rbtree_add and bpf_list_push_{front,back} now might do bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be passed to bpf_obj_drop_impl. Similarly, because the 'node' param to those insert functions is the bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a pointer to the beginning of the node, the insert functions need to be able to find the beginning of the node struct. A second verifier-populated param is necessary: the offset of {list,rb}_node within the node type. These two new params allow the insert kfuncs to correctly call __bpf_obj_drop_impl: beginning_of_node = bpf_rb_node_ptr - offset if (already_inserted) __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record); Similarly to other kfuncs with "hidden" verifier-populated params, the insert functions are renamed with _impl prefix and a macro is provided for common usage. For example, bpf_rbtree_add kfunc is now bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets "hidden" args to 0. Due to the two new args BPF progs will need to be recompiled to work with the new _impl kfuncs. This patch also rewrites the "hidden argument" explanation to more directly say why the BPF program writer doesn't need to populate the arguments with anything meaningful. How does this new logic affect non-owning references? ===================================================== Currently, non-owning refs are valid until the end of the critical section in which they're created. We can make this guarantee because, if a non-owning ref exists, the referent was added to some collection. The collection will drop() its nodes when it goes away, but it can't go away while our program is accessing it, so that's not a problem. If the referent is removed from the collection in the same CS that it was added in, it can't be bpf_obj_drop'd until after CS end. Those are the only two ways to free the referent's memory and neither can happen until after the non-owning ref's lifetime ends. On first glance, having these collection insert functions potentially bpf_obj_drop their input seems like it breaks the "can't be bpf_obj_drop'd until after CS end" line of reasoning. But we care about the memory not being _freed_ until end of CS end, and a previous patch in the series modified bpf_obj_drop such that it doesn't free refcounted nodes until refcount == 0. So the statement can be more accurately rewritten as "can't be free'd until after CS end". We can prove that this rewritten statement holds for any non-owning reference produced by collection insert functions: * If the input to the insert function is _not_ refcounted * We have an owning reference to the input, and can conclude it isn't in any collection * Inserting a node in a collection turns owning refs into non-owning, and since our input type isn't refcounted, there's no way to obtain additional owning refs to the same underlying memory * Because our node isn't in any collection, the insert operation cannot fail, so bpf_obj_drop will not execute * If bpf_obj_drop is guaranteed not to execute, there's no risk of memory being free'd * Otherwise, the input to the insert function is refcounted * If the insert operation fails due to the node's list_head or rb_root already being in some collection, there was some previous successful insert which passed refcount to the collection * We have an owning reference to the input, it must have been acquired via bpf_refcount_acquire, which bumped the refcount * refcount must be >= 2 since there's a valid owning reference and the node is already in a collection * Insert triggering bpf_obj_drop will decr refcount to >= 1, never resulting in a free So although we may do bpf_obj_drop during the critical section, this will never result in memory being free'd, and no changes to non-owning ref logic are needed in this patch. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
7c50b1cb76 |
bpf: Add bpf_refcount_acquire kfunc
Currently, BPF programs can interact with the lifetime of refcounted local kptrs in the following ways: bpf_obj_new - Initialize refcount to 1 as part of new object creation bpf_obj_drop - Decrement refcount and free object if it's 0 collection add - Pass ownership to the collection. No change to refcount but collection is responsible for bpf_obj_dropping it In order to be able to add a refcounted local kptr to multiple collections we need to be able to increment the refcount and acquire a new owning reference. This patch adds a kfunc, bpf_refcount_acquire, implementing such an operation. bpf_refcount_acquire takes a refcounted local kptr and returns a new owning reference to the same underlying memory as the input. The input can be either owning or non-owning. To reinforce why this is safe, consider the following code snippets: struct node *n = bpf_obj_new(typeof(*n)); // A struct node *m = bpf_refcount_acquire(n); // B In the above snippet, n will be alive with refcount=1 after (A), and since nothing changes that state before (B), it's obviously safe. If n is instead added to some rbtree, we can still safely refcount_acquire it: struct node *n = bpf_obj_new(typeof(*n)); struct node *m; bpf_spin_lock(&glock); bpf_rbtree_add(&groot, &n->node, less); // A m = bpf_refcount_acquire(n); // B bpf_spin_unlock(&glock); In the above snippet, after (A) n is a non-owning reference, and after (B) m is an owning reference pointing to the same memory as n. Although n has no ownership of that memory's lifetime, it's guaranteed to be alive until the end of the critical section, and n would be clobbered if we were past the end of the critical section, so it's safe to bump refcount. Implementation details: * From verifier's perspective, bpf_refcount_acquire handling is similar to bpf_obj_new and bpf_obj_drop. Like the former, it returns a new owning reference matching input type, although like the latter, type can be inferred from concrete kptr input. Verifier changes in {check,fixup}_kfunc_call and check_kfunc_args are largely copied from aforementioned functions' verifier changes. * An exception to the above is the new KF_ARG_PTR_TO_REFCOUNTED_KPTR arg, indicated by new "__refcounted_kptr" kfunc arg suffix. This is necessary in order to handle both owning and non-owning input without adding special-casing to "__alloc" arg handling. Also a convenient place to confirm that input type has bpf_refcount field. * The implemented kfunc is actually bpf_refcount_acquire_impl, with 'hidden' second arg that the verifier sets to the type's struct_meta in fixup_kfunc_call. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
1512217c47 |
bpf: Support refcounted local kptrs in existing semantics
A local kptr is considered 'refcounted' when it is of a type that has a bpf_refcount field. When such a kptr is created, its refcount should be initialized to 1; when destroyed, the object should be free'd only if a refcount decr results in 0 refcount. Existing logic always frees the underlying memory when destroying a local kptr, and 0-initializes all btf_record fields. This patch adds checks for "is local kptr refcounted?" and new logic for that case in the appropriate places. This patch focuses on changing existing semantics and thus conspicuously does _not_ provide a way for BPF programs in increment refcount. That follows later in the series. __bpf_obj_drop_impl is modified to do the right thing when it sees a refcounted type. Container types for graph nodes (list, tree, stashed in map) are migrated to use __bpf_obj_drop_impl as a destructor for their nodes instead of each having custom destruction code in their _free paths. Now that "drop" isn't a synonym for "free" when the type is refcounted it makes sense to centralize this logic. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
d54730b50b |
bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types meant for use in BPF programs. Similarly to other opaque types like bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in user-defined struct types a bpf_refcount can be located, so necessary btf_record plumbing is added to enable this. bpf_refcount is sized to hold a refcount_t. Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in btf_record as refcount_off in addition to being in the field array. Caching refcount_off makes sense for this field because further patches in the series will modify functions that take local kptrs (e.g. bpf_obj_drop) to change their behavior if the type they're operating on is refcounted. So enabling fast "is this type refcounted?" checks is desirable. No such verifier behavior changes are introduced in this patch, just logic to recognize 'struct bpf_refcount' in btf_record. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Dave Marchevsky
|
cd2a807901 |
bpf: Remove btf_field_offs, use btf_record's fields instead
The btf_field_offs struct contains (offset, size) for btf_record fields, sorted by offset. btf_field_offs is always used in conjunction with btf_record, which has btf_field 'fields' array with (offset, type), the latter of which btf_field_offs' size is derived from via btf_field_type_size. This patch adds a size field to struct btf_field and sorts btf_record's fields by offset, making it possible to get rid of btf_field_offs. Less data duplication and less code complexity results. Since btf_field_offs' lifetime closely followed the btf_record used to populate it, most complexity wins are from removal of initialization code like: if (btf_record_successfully_initialized) { foffs = btf_parse_field_offs(rec); if (IS_ERR_OR_NULL(foffs)) // free the btf_record and return err } Other changes in this patch are pretty mechanical: * foffs->field_off[i] -> rec->fields[i].offset * foffs->field_sz[i] -> rec->fields[i].size * Sort rec->fields in btf_parse_fields before returning * It's possible that this is necessary independently of other changes in this patch. btf_record_find in syscall.c expects btf_record's fields to be sorted by offset, yet there's no explicit sorting of them before this patch, record's fields are populated in the order they're read from BTF struct definition. BTF docs don't say anything about the sortedness of struct fields. * All functions taking struct btf_field_offs * input now instead take struct btf_record *. All callsites of these functions already have access to the correct btf_record. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Rong Tao
|
4a1e885c6d |
samples/bpf: sampleip: Replace PAGE_OFFSET with _text address
Macro PAGE_OFFSET(0xffff880000000000) in sampleip_user.c is inaccurate, for example, in aarch64 architecture, this value depends on the CONFIG_ARM64_VA_BITS compilation configuration, this value defaults to 48, the corresponding PAGE_OFFSET is 0xffff800000000000, if we use the value defined in sampleip_user.c, then all KSYMs obtained by sampleip are (user) Symbol error due to PAGE_OFFSET error: $ sudo ./sampleip 1 Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends. ADDR KSYM COUNT 0xffff80000810ceb8 (user) 1 0xffffb28ec880 (user) 1 0xffff8000080c82b8 (user) 1 0xffffb23fed24 (user) 1 0xffffb28944fc (user) 1 0xffff8000084628bc (user) 1 0xffffb2a935c0 (user) 1 0xffff80000844677c (user) 1 0xffff80000857a3a4 (user) 1 ... A few examples of addresses in the CONFIG_ARM64_VA_BITS=48 environment in the aarch64 environment: $ sudo head /proc/kallsyms ffff8000080a0000 T _text ffff8000080b0000 t gic_handle_irq ffff8000080b0000 T _stext ffff8000080b0000 T __irqentry_text_start ffff8000080b00b0 t gic_handle_irq ffff8000080b0230 t gic_handle_irq ffff8000080b03b4 T __irqentry_text_end ffff8000080b03b8 T __softirqentry_text_start ffff8000080b03c0 T __do_softirq ffff8000080b0718 T __entry_text_start We just need to replace the PAGE_OFFSET with the address _text in /proc/kallsyms to solve this problem: $ sudo ./sampleip 1 Sampling at 99 Hertz for 1 seconds. Ctrl-C also ends. ADDR KSYM COUNT 0xffffb2892ab0 (user) 1 0xffffb2b1edfc (user) 1 0xffff800008462834 __arm64_sys_ppoll 1 0xffff8000084b87f4 eventfd_read 1 0xffffb28e6788 (user) 1 0xffff8000081e96d8 rcu_all_qs 1 0xffffb2ada878 (user) 1 ... Signed-off-by: Rong Tao <rongtao@cestc.cn> Link: https://lore.kernel.org/r/tencent_A0E82E0BEE925285F8156D540731DF805F05@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Ilya Leoshkevich
|
1cf3bfc60f |
bpf: Support 64-bit pointers to kfuncs
test_ksyms_module fails to emit a kfunc call targeting a module on s390x, because the verifier stores the difference between kfunc address and __bpf_call_base in bpf_insn.imm, which is s32, and modules are roughly (1 << 42) bytes away from the kernel on s390x. Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs, and storing the absolute address in bpf_kfunc_desc. Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new behavior to the s390x JIT. Otherwise other JITs need to be modified, which is not desired. Introduce bpf_get_kfunc_addr() instead of exposing both find_kfunc_desc() and struct bpf_kfunc_desc. In addition to sorting kfuncs by imm, also sort them by offset, in order to handle conflicting imms from different modules. Do this on all architectures in order to simplify code. Factor out resolving specialized kfuncs (XPD and dynptr) from fixup_kfunc_call(). This was required in the first place, because fixup_kfunc_call() uses find_kfunc_desc(), which returns a const pointer, so it's not possible to modify kfunc addr without stripping const, which is not nice. It also removes repetition of code like: if (bpf_jit_supports_far_kfunc_call()) desc->addr = func; else insn->imm = BPF_CALL_IMM(func); and separates kfunc_desc_tab fixups from kfunc_call fixups. Suggested-by: Jiri Olsa <olsajiri@gmail.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Yafang
|
c11bd04648 |
bpf: Add preempt_count_{sub,add} into btf id deny list
The recursion check in __bpf_prog_enter* and __bpf_prog_exit* leave preempt_count_{sub,add} unprotected. When attaching trampoline to them we get panic as follows, [ 867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28) [ 867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI [ 867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4 [ 867.843100] Call Trace: [ 867.843101] <TASK> [ 867.843104] asm_exc_int3+0x3a/0x40 [ 867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0 [ 867.843135] __bpf_prog_enter_recur+0x17/0x90 [ 867.843148] bpf_trampoline_6442468108_0+0x2e/0x1000 [ 867.843154] ? preempt_count_sub+0x1/0xa0 [ 867.843157] preempt_count_sub+0x5/0xa0 [ 867.843159] ? migrate_enable+0xac/0xf0 [ 867.843164] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843168] bpf_trampoline_6442468108_0+0x55/0x1000 ... [ 867.843788] preempt_count_sub+0x5/0xa0 [ 867.843793] ? migrate_enable+0xac/0xf0 [ 867.843829] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35) [ 867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c) [ 867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec) [ 867.843842] bpf_trampoline_6442468108_0+0x55/0x1000 ... That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are called after prog->active is decreased. Fixing this by adding these two functions into btf ids deny list. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang <laoar.shao@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Alexei Starovoitov
|
75860b5201 |
selftests/bpf: Workaround for older vm_sockets.h.
Some distros ship with older vm_sockets.h that doesn't have VMADDR_CID_LOCAL which causes selftests build to fail: /tmp/work/bpf/bpf/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c:261:18: error: ‘VMADDR_CID_LOCAL’ undeclared (first use in this function); did you mean ‘VMADDR_CID_HOST’? 261 | addr->svm_cid = VMADDR_CID_LOCAL; | ^~~~~~~~~~~~~~~~ | VMADDR_CID_HOST Workaround this issue by defining it on demand. Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Alexei Starovoitov
|
c04135ab35 |
selftests/bpf: Fix merge conflict due to SYS() macro change.
Fix merge conflict between bpf/bpf-next trees due to change of arguments in SYS() macro. Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jakub Kicinski
|
c2865b1122 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhSiwAKCRDbK58LschI g8cbAQCH4xrquOeDmYyGXFQGchHZAIj++tKg8ABU4+hYeJtrlwEA6D4W6wjoSZRk mLSptZ9qro8yZA86BvyPvlBT1h9ELQA= =StAc -----END PGP SIGNATURE----- Daniel Borkmann says: ==================== pull-request: bpf-next 2023-04-13 We've added 260 non-merge commits during the last 36 day(s) which contain a total of 356 files changed, 21786 insertions(+), 11275 deletions(-). The main changes are: 1) Rework BPF verifier log behavior and implement it as a rotating log by default with the option to retain old-style fixed log behavior, from Andrii Nakryiko. 2) Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params, from Christian Ehrig. 3) Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton, from Alexei Starovoitov. 4) Optimize hashmap lookups when key size is multiple of 4, from Anton Protopopov. 5) Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps, from David Vernet. 6) Add support for stashing local BPF kptr into a map value via bpf_kptr_xchg(). This is useful e.g. for rbtree node creation for new cgroups, from Dave Marchevsky. 7) Fix BTF handling of is_int_ptr to skip modifiers to work around tracing issues where a program cannot be attached, from Feng Zhou. 8) Migrate a big portion of test_verifier unit tests over to test_progs -a verifier_* via inline asm to ease {read,debug}ability, from Eduard Zingerman. 9) Several updates to the instruction-set.rst documentation which is subject to future IETF standardization (https://lwn.net/Articles/926882/), from Dave Thaler. 10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register known bits information propagation, from Daniel Borkmann. 11) Add skb bitfield compaction work related to BPF with the overall goal to make more of the sk_buff bits optional, from Jakub Kicinski. 12) BPF selftest cleanups for build id extraction which stand on its own from the upcoming integration work of build id into struct file object, from Jiri Olsa. 13) Add fixes and optimizations for xsk descriptor validation and several selftest improvements for xsk sockets, from Kal Conley. 14) Add BPF links for struct_ops and enable switching implementations of BPF TCP cong-ctls under a given name by replacing backing struct_ops map, from Kui-Feng Lee. 15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable offset stack read as earlier Spectre checks cover this, from Luis Gerhorst. 16) Fix issues in copy_from_user_nofault() for BPF and other tracers to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner and Alexei Starovoitov. 17) Add --json-summary option to test_progs in order for CI tooling to ease parsing of test results, from Manu Bretelle. 18) Batch of improvements and refactoring to prep for upcoming bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator, from Martin KaFai Lau. 19) Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations, from Quentin Monnet. 20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting the module name from BTF of the target and searching kallsyms of the correct module, from Viktor Malik. 21) Improve BPF verifier handling of '<const> <cond> <non_const>' to better detect whether in particular jmp32 branches are taken, from Yonghong Song. 22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock. A built-in cc or one from a kernel module is already able to write to app_limited, from Yixin Shen. Conflicts: Documentation/bpf/bpf_devel_QA.rst |
||
Jakub Kicinski
|
800e68c44f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts: tools/testing/selftests/net/config |
||
Linus Torvalds
|
829cca4d17 |
Including fixes from bpf, and bluetooth.
Not all that quiet given spring celebrations, but "current" fixes are thinning out, which is encouraging. One outstanding regression in the mlx5 driver when using old FW, not blocking but we're pushing for a fix. Current release - new code bugs: - eth: enetc: workaround for unresponsive pMAC after receiving express traffic Previous releases - regressions: - rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the pid/seq fields 0 for backward compatibility Previous releases - always broken: - sctp: fix a potential overflow in sctp_ifwdtsn_skip - mptcp: - use mptcp_schedule_work instead of open-coding it and make the worker check stricter, to avoid scheduling work on closed sockets - fix NULL pointer dereference on fastopen early fallback - skbuff: fix memory corruption due to a race between skb coalescing and releasing clones confusing page_pool reference counting - bonding: fix neighbor solicitation validation on backup slaves - bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp - bpf: arm64: fixed a BTI error on returning to patched function - openvswitch: fix race on port output leading to inf loop - sfp: initialize sfp->i2c_block_size at sfp allocation to avoid returning a different errno than expected - phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove - Bluetooth: fix printing errors if LE Connection times out - Bluetooth: assorted UaF, deadlock and data race fixes - eth: macb: fix memory corruption in extended buffer descriptor mode Misc: - adjust the XDP Rx flow hash API to also include the protocol layers over which the hash was computed Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmQ4ZrsACgkQMUZtbf5S IruWUQ/9F+HlnHf3Sv08zGlnV++vLaJ/Ld8C2YNYNxRwuoJvcCyikQ/ZfUKdKAoS kCf0XqFD2SMl8wHpCQBhK4uXvKBdBMx6L6wEp7dbDciGl/+5yihe9opBBXKekWbB ULRZcZE7RACri/QsXQhD7Y8p530xnYWQXO8ZMjR3vOAWxplJtBDNDnXi7hqtxQpW Vzwl1ntvD1msmxhvy0UZrgeesL8R3UckFvqYEqnINeyd8E8HB1dAOg899/ZPUbjA UoEw5VsSBSr9DE7+Fs6uD8trBxQ1CrdRVJjhRhwivHk8/Ro7dIzjcVV30ei3wucz 0RiNvCqypkeLeRrcVlSk8lR5r9FBGvhDMFbcGM8lHnxSc0WB+Sj+2iup4fpTE8/p VUIvhhzuBuXU4Sc022pm6BL5DBSKdnJRquFq6XCTwnQM6v7fvzu1yWNXsQom8Nbi 1/ZcFdj27FHwMpU0JPZ4PFxT7Ta830UyulVZuyWA+zEzlN7DvW3O7bGQC72GEuID Xc58D4kVtywzbntFmUjuhXCD/6vvD5WW5orLpMWg5SH9F14prt3C9OFSpTwTTq+W uPBEslhnhhCPecTNo2iFPLX3bN67n8KDVUWua1AHaqmcK7QFGs0njJGGLpFdHMll SuNgrNrtQE9EHH8XL6VbSD2zf35ZfoKVg6qvL3oeLzXkGkZrnls= =W+J2 -----END PGP SIGNATURE----- Merge tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, and bluetooth. Not all that quiet given spring celebrations, but "current" fixes are thinning out, which is encouraging. One outstanding regression in the mlx5 driver when using old FW, not blocking but we're pushing for a fix. Current release - new code bugs: - eth: enetc: workaround for unresponsive pMAC after receiving express traffic Previous releases - regressions: - rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the pid/seq fields 0 for backward compatibility Previous releases - always broken: - sctp: fix a potential overflow in sctp_ifwdtsn_skip - mptcp: - use mptcp_schedule_work instead of open-coding it and make the worker check stricter, to avoid scheduling work on closed sockets - fix NULL pointer dereference on fastopen early fallback - skbuff: fix memory corruption due to a race between skb coalescing and releasing clones confusing page_pool reference counting - bonding: fix neighbor solicitation validation on backup slaves - bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp - bpf: arm64: fixed a BTI error on returning to patched function - openvswitch: fix race on port output leading to inf loop - sfp: initialize sfp->i2c_block_size at sfp allocation to avoid returning a different errno than expected - phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove - Bluetooth: fix printing errors if LE Connection times out - Bluetooth: assorted UaF, deadlock and data race fixes - eth: macb: fix memory corruption in extended buffer descriptor mode Misc: - adjust the XDP Rx flow hash API to also include the protocol layers over which the hash was computed" * tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits) selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type veth: bpf_xdp_metadata_rx_hash add xdp rss hash type mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type xdp: rss hash types representation selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters skbuff: Fix a race between coalescing and releasing SKBs net: macb: fix a memory corruption in extended buffer descriptor mode selftests: add the missing CONFIG_IP_SCTP in net config udp6: fix potential access to stale information selftests: openvswitch: adjust datapath NL message declaration selftests: mptcp: userspace pm: uniform verify events mptcp: fix NULL pointer dereference on fastopen early fallback mptcp: stricter state check in mptcp_worker mptcp: use mptcp_schedule_work instead of open-coding it net: enetc: workaround for unresponsive pMAC after receiving express traffic sctp: fix a potential overflow in sctp_ifwdtsn_skip net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume() rtnetlink: Restore RTM_NEW/DELLINK notification behavior net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes ... |
||
Linus Torvalds
|
4413ad01e2 |
Devicetree fixes for v6.2, part 3:
- Fix interaction between fw_devlink and DT overlays causing devices to not be probed - Fix the compatible string for loongson,cpu-interrupt-controller -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmQ1t60ACgkQ+vtdtY28 YcNfvBAApQjSAtrvYlHH5Lp6ANCzNZiXJO2FX5kpPtcdIHMLB0Soo4z2IE6B9V1r e61dw5j21CuRDlsoOG1odg9//02/KK2Dgz7ebisKWVF+1UuWbps1stNuXO3MbLUj Cq4GH4EUs5JwED145jOhLWWq2/bkymJvgWVU8n3Q/q/uL+Fjm4aJxgx92p6IZdN3 CixxhXBAkWLOs9ij8f6bsSUts28XoPZsk+kucdXc+83UninAXJC2KzuvQga8mBPF MGuxQTXmD5vgdPyvqj1D9U3uqkDE6HudrUDXr1yq9esPObjvUTkh09/Wm7OqiDu9 GBUyhD+3GaBK18rKiL0JSDGbGamNR9BaFshovuPEmlAtaoHv8nbv/MmfTnCWwjjs 5lQ7LmOJCuRuQmcTOR8q1csVhUXxssNGaOaclOOXou/crItmSDlLAaj6XLRTPt45 4jPiNKgDH4Pj2vqGBeYnhNyPyc+Y5IVc88pV8kUxfsqnMluzsoLC+0JADXNMhk1T 3sfecpceQav4TFhPOcMIHAkgldBnPQomW6laEn4Ul+dAyAes7q6Y0SvjQy03gDKU LY5QIsHLs5YZXur8TYSbU7Yt44hr4SAm0uz1kcCArlmNtidcjYuw1tLAWnGTxZrx 5q+ZuQ6NTiPKwxTK0Zhf+HqZdzx2IE6JXaPBOeOdYxjTSWGmcKk= =x2A8 -----END PGP SIGNATURE----- Merge tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Fix interaction between fw_devlink and DT overlays causing devices to not be probed - Fix the compatible string for loongson,cpu-interrupt-controller * tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: treewide: Fix probing of devices in DT overlays dt-bindings: interrupt-controller: loongarch: Fix mismatched compatible |
||
Linus Torvalds
|
531f27ad5e |
This is just a revert of the AMD fix, because the fix fixed
broken some laptops. We are working on a proper solution. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmQ4YHcACgkQQRCzN7AZ XXO3QRAAlEE9OIeq6yhrnmQyhXTysZNCjMigkppIWAjY96twBifKo2PyaALb5v9O VPlbLDUB250RInrakOjjY3AFGiXP/UEYBsMk7L5UGS3bloyrBLyFEKoXccAO0qO6 pPlVlLN+iszsO+vx2fSiE65o6ZMZTU9FWcroTbvDfO228e9xeq1mjH4d4H88sr90 /PUhK/CT13zdTYX/eB8rF9IwlqAGRbuoNPr70M8cpJckVBs/BDt/EQLdBNzScY2l 5zcaYcpxQlLI+uiBjf3kHKXVcOZwSKYMFK04306blgWhSAxEqMsM3aiwk3C612Ri fc70ONHl7OXSnIREcobIEq22Ehd3L/TEI0br76DkWkurQB29YT4WyEWAj1lmlzCY afHhX5d+ipVybOW3rQfCTdf/U23jVLrvA+n1bwsJqEpACsEXHyfHA+0oidBeiW2O 62wmP9wxjUiO2AkIfNJuMdf12BdK+r0Rvk/5mRIZTUKOkx7B1T/zVHCnt9Qir2BT 0O/3wUTz23onrR/1OnLiSOYQfmlly0/jZu2jyFYoIFxlB2imeqKfFnB0QHjzXOxe BeBTXPa1Y7r4ESNjt5z78MvlGIZLXf+PLx7nfxfVgClu3SUauNPFTFVkv45RJLWU AcL33B5OpmDT1EFxToPNto/kFTa4SzPm1dxEHpNcdURNEaq2KfQ= =FCMQ -----END PGP SIGNATURE----- Merge tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fix from Linus Walleij: "This is just a revert of the AMD fix, because the fix broke some laptops. We are working on a proper solution" * tag 'pinctrl-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: Revert "pinctrl: amd: Disable and mask interrupts on resume" |
||
Linus Torvalds
|
f1be7b6c16 |
drm-fixes for -rc7
- two fbcon regressions - amdgpu: dp mst, smu13 - i915: dual link dsi for tgl+ - armada, nouveau, drm/sched, fbmem -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAmQ4XEYACgkQTA9ye/CY qnFDIA/9FZJx4rW3rNl1fSN9UufQuDBq86kOECbG53VM9QcZVdDaV35cWBEyqG4p Ue8M3aZwltZ0+x6qeZ0rV+qZKlD3KYidq/MgW0YGPdVvSdrdVk0OOc4N6uQU1P1v Phr0hFHVao0+bWQE5gdK9S7DW23m/ys0y0C+LXVRNWWf4kVh5DFSjaqEo+SU8Gj1 6GOUWf+rH44r4cHqx4sgZ8r79jWdc3Bjb6OLSE3YVhyU2cffVJ8+AgIA6JyStOR0 tw0NKfAMI0BBlEewvvNWqKcLvZ7qz5bG+byy6amzX8bLmdyyz2BarM5MKSf7F0NY BC2GiQKUEh+hLsvUhLdqYV2iLKRI2Qjd0ESYF9WO2UElsTEf/2tuGKD5WTMmdJLG 1BjEC1lQL/uSHWy+9qpppEKkmQHo+6MLlvbbbNVxgz8n/ysxnJGTe5ntGRWc88nR 7yrHILf5Ry7/4D+PBctZqalf/JfklrY3hIuTl6XTgLE6eKM5Wgc2xn0ItmUXQD9i B18TCUiElr/FlMyw/oTWcstrI9rjrAv9FBZ0qoEqkXCfu7TPT+tJ+6oLq2rCJKUh 4GuzOm7PYFBu0lof8Cmthz/Z8GDKJrdgdfDfp8yJp7jmIz2q8Xm3yqG66xSQKuAE dOJpHIrQgQAxgTvsSAALXmlPN2XAp5dHBaS4vXPgbFNKWkWi3H0= =aI2h -----END PGP SIGNATURE----- Merge tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm Pull drm fixes from Daniel Vetter: - two fbcon regressions - amdgpu: dp mst, smu13 - i915: dual link dsi for tgl+ - armada, nouveau, drm/sched, fbmem * tag 'drm-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm: fbcon: set_con2fb_map needs to set con2fb_map! fbcon: Fix error paths in set_con2fb_map drm/amd/pm: correct the pcie link state check for SMU13 drm/amd/pm: correct SMU13.0.7 max shader clock reporting drm/amd/pm: correct SMU13.0.7 pstate profiling clock settings drm/amd/display: Pass the right info to drm_dp_remove_payload drm/armada: Fix a potential double free in an error handling path fbmem: Reject FB_ACTIVATE_KD_TEXT from userspace drm/nouveau/fb: add missing sysmen flush callbacks drm/i915/dsi: fix DSS CTL register offsets for TGL+ drm/scheduler: Fix UAF race in drm_sched_entity_push_job() |
||
Jakub Kicinski
|
d0f89c4c1d |
bpf-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhVyQAKCRDbK58LschI g0lKAQDScgS8nBlgupnWVal4JzhzzJaoabETf2sIl6Sd0UJAQwD8C3DakFP7O24N 0YE3WGpHMEvVzeCnM5HTFbbdKCjgPQc= =QmoW -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-04-13 We've added 6 non-merge commits during the last 1 day(s) which contain a total of 14 files changed, 205 insertions(+), 38 deletions(-). The main changes are: 1) One late straggler fix on the XDP hints side which fixes bpf_xdp_metadata_rx_hash kfunc API before the release goes out in order to provide information on the RSS hash type, from Jesper Dangaard Brouer. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type veth: bpf_xdp_metadata_rx_hash add xdp rss hash type mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type xdp: rss hash types representation selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters ==================== Link: https://lore.kernel.org/r/20230413192939.10202-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Daniel Vetter
|
cab2932213 |
Short summary of fixes pull:
* armada: Fix double free * fb: Clear FB_ACTIVATE_KD_TEXT in ioctl * nouveau: Add missing callbacks * scheduler: Fix use-after-free error -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmQ4TOMACgkQaA3BHVML eiPw+wf9EpgmwUbs3gAb1gXTnXlum8zfkXrYNgZ7nU9G7e16tt5ToZnkGpJVEO5x Ep8mkvGHAathronn1kT/EjC2V6cVbFnWb2IQX5Hb7OhMxSwSy2lFCsPgMZlPCMCX 3hxxLxpWE+GqkxZDS8+99xif7FCLqgbAR77Ca0VXG5vfaKfd8sqh0tXxP/m23eyT CmvT28kw6+cgG32Etf52UN9RJLBLSIUfl/34DGu8hmoIaYK0+AVNwOQNXKnDX/MM HbAKDKdx2souQYrBz+5rHsfkGVXyZx2gmxH7TD4srIwgBuvNude+LICVpH1qWUlq ZUM8B88yxjMte9Wr1CBCWy76OqxcIQ== =Ko4T -----END PGP SIGNATURE----- Merge tag 'drm-misc-fixes-2023-04-13' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes Short summary of fixes pull: * armada: Fix double free * fb: Clear FB_ACTIVATE_KD_TEXT in ioctl * nouveau: Add missing callbacks * scheduler: Fix use-after-free error Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230413184233.GA8148@linux-uq9g |
||
Daniel Borkmann
|
8c5c2a4898 |
bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap
syzbot reported a splat and bisected it to recent commit |
||
Alexei Starovoitov
|
b65ef48c95 |
Merge branch 'XDP-hints: change RX-hash kfunc bpf_xdp_metadata_rx_hash'
Jesper Dangaard Brouer says: ==================== Current API for bpf_xdp_metadata_rx_hash() returns the raw RSS hash value, but doesn't provide information on the RSS hash type (part of 6.3-rc). This patchset proposal is to change the function call signature via adding a pointer value argument for providing the RSS hash type. Patchset also removes all bpf_printk's from xdp_hw_metadata program that we expect driver developers to use. Instead counters are introduced for relaying e.g. skip and fail info. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jesper Dangaard Brouer
|
0f26b74e7d |
selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
Update BPF selftests to use the new RSS type argument for kfunc bpf_xdp_metadata_rx_hash. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132894068.340624.8914711185697163690.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jesper Dangaard Brouer
|
9123397aee |
mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via matching individual Completion Queue Entry (CQE) status bits.
Fixes:
|
||
Jesper Dangaard Brouer
|
96b1a098f3 |
veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type.
The veth driver currently only support XDP-hints based on SKB code path.
The SKB have lost information about the RSS hash type, by compressing
the information down to a single bitfield skb->l4_hash, that only knows
if this was a L4 hash value.
In preparation for veth, the xdp_rss_hash_type have an L4 indication
bit that allow us to return a meaningful L4 indication when working
with SKB based packets.
Fixes:
|
||
Jesper Dangaard Brouer
|
67f245c2ec |
mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via mapping table.
The mlx5 hardware can also identify and RSS hash IPSEC. This indicate
hash includes SPI (Security Parameters Index) as part of IPSEC hash.
Extend xdp core enum xdp_rss_hash_type with IPSEC hash type.
Fixes:
|
||
Jesper Dangaard Brouer
|
0cd917a4a8 |
xdp: rss hash types representation
The RSS hash type specifies what portion of packet data NIC hardware used
when calculating RSS hash value. The RSS types are focused on Internet
traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
primarily TCP vs UDP, but some hardware supports SCTP.
Hardware RSS types are differently encoded for each hardware NIC. Most
hardware represent RSS hash type as a number. Determining L3 vs L4 often
requires a mapping table as there often isn't a pattern or sorting
according to ISO layer.
The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
contains both BITs for the L3/L4 types, and combinations to be used by
drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
to BPF via BTF, and it is up to the BPF-programmer to match using these
defines.
This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
a pointer value argument for provide the RSS hash type.
Change signature for all xmo_rx_hash calls in drivers to make it compile.
The RSS type implementations for each driver comes as separate patches.
Fixes:
|
||
Jesper Dangaard Brouer
|
e8163b98d9 |
selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
The tool xdp_hw_metadata can be used by driver developers implementing XDP-hints metadata kfuncs. Remove all bpf_printk calls, as the tool already transfers all the XDP-hints related information via metadata area to AF_XDP userspace process. Add counters for providing remaining information about failure and skipped packet events. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132891533.340624.7313781245316405141.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Daniel Vetter
|
fffb0b52d5 |
fbcon: set_con2fb_map needs to set con2fb_map!
I got really badly confused in |
||
Daniel Vetter
|
edf79dd217 |
fbcon: Fix error paths in set_con2fb_map
This is a regressoin introduced in |
||
Liang Chen
|
0646dc31ca |
skbuff: Fix a race between coalescing and releasing SKBs
Commit |
||
Roman Gushchin
|
e8b7445355 |
net: macb: fix a memory corruption in extended buffer descriptor mode
For quite some time we were chasing a bug which looked like a sudden
permanent failure of networking and mmc on some of our devices.
The bug was very sensitive to any software changes and even more to
any kernel debug options.
Finally we got a setup where the problem was reproducible with
CONFIG_DMA_API_DEBUG=y and it revealed the issue with the rx dma:
[ 16.992082] ------------[ cut here ]------------
[ 16.996779] DMA-API: macb ff0b0000.ethernet: device driver tries to free DMA memory it has not allocated [device address=0x0000000875e3e244] [size=1536 bytes]
[ 17.011049] WARNING: CPU: 0 PID: 85 at kernel/dma/debug.c:1011 check_unmap+0x6a0/0x900
[ 17.018977] Modules linked in: xxxxx
[ 17.038823] CPU: 0 PID: 85 Comm: irq/55-8000f000 Not tainted 5.4.0 #28
[ 17.045345] Hardware name: xxxxx
[ 17.049528] pstate: 60000005 (nZCv daif -PAN -UAO)
[ 17.054322] pc : check_unmap+0x6a0/0x900
[ 17.058243] lr : check_unmap+0x6a0/0x900
[ 17.062163] sp : ffffffc010003c40
[ 17.065470] x29: ffffffc010003c40 x28: 000000004000c03c
[ 17.070783] x27: ffffffc010da7048 x26: ffffff8878e38800
[ 17.076095] x25: ffffff8879d22810 x24: ffffffc010003cc8
[ 17.081407] x23: 0000000000000000 x22: ffffffc010a08750
[ 17.086719] x21: ffffff8878e3c7c0 x20: ffffffc010acb000
[ 17.092032] x19: 0000000875e3e244 x18: 0000000000000010
[ 17.097343] x17: 0000000000000000 x16: 0000000000000000
[ 17.102647] x15: ffffff8879e4a988 x14: 0720072007200720
[ 17.107959] x13: 0720072007200720 x12: 0720072007200720
[ 17.113261] x11: 0720072007200720 x10: 0720072007200720
[ 17.118565] x9 : 0720072007200720 x8 : 000000000000022d
[ 17.123869] x7 : 0000000000000015 x6 : 0000000000000098
[ 17.129173] x5 : 0000000000000000 x4 : 0000000000000000
[ 17.134475] x3 : 00000000ffffffff x2 : ffffffc010a1d370
[ 17.139778] x1 : b420c9d75d27bb00 x0 : 0000000000000000
[ 17.145082] Call trace:
[ 17.147524] check_unmap+0x6a0/0x900
[ 17.151091] debug_dma_unmap_page+0x88/0x90
[ 17.155266] gem_rx+0x114/0x2f0
[ 17.158396] macb_poll+0x58/0x100
[ 17.161705] net_rx_action+0x118/0x400
[ 17.165445] __do_softirq+0x138/0x36c
[ 17.169100] irq_exit+0x98/0xc0
[ 17.172234] __handle_domain_irq+0x64/0xc0
[ 17.176320] gic_handle_irq+0x5c/0xc0
[ 17.179974] el1_irq+0xb8/0x140
[ 17.183109] xiic_process+0x5c/0xe30
[ 17.186677] irq_thread_fn+0x28/0x90
[ 17.190244] irq_thread+0x208/0x2a0
[ 17.193724] kthread+0x130/0x140
[ 17.196945] ret_from_fork+0x10/0x20
[ 17.200510] ---[ end trace 7240980785f81d6f ]---
[ 237.021490] ------------[ cut here ]------------
[ 237.026129] DMA-API: exceeded 7 overlapping mappings of cacheline 0x0000000021d79e7b
[ 237.033886] WARNING: CPU: 0 PID: 0 at kernel/dma/debug.c:499 add_dma_entry+0x214/0x240
[ 237.041802] Modules linked in: xxxxx
[ 237.061637] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 5.4.0 #28
[ 237.068941] Hardware name: xxxxx
[ 237.073116] pstate: 80000085 (Nzcv daIf -PAN -UAO)
[ 237.077900] pc : add_dma_entry+0x214/0x240
[ 237.081986] lr : add_dma_entry+0x214/0x240
[ 237.086072] sp : ffffffc010003c30
[ 237.089379] x29: ffffffc010003c30 x28: ffffff8878a0be00
[ 237.094683] x27: 0000000000000180 x26: ffffff8878e387c0
[ 237.099987] x25: 0000000000000002 x24: 0000000000000000
[ 237.105290] x23: 000000000000003b x22: ffffffc010a0fa00
[ 237.110594] x21: 0000000021d79e7b x20: ffffffc010abe600
[ 237.115897] x19: 00000000ffffffef x18: 0000000000000010
[ 237.121201] x17: 0000000000000000 x16: 0000000000000000
[ 237.126504] x15: ffffffc010a0fdc8 x14: 0720072007200720
[ 237.131807] x13: 0720072007200720 x12: 0720072007200720
[ 237.137111] x11: 0720072007200720 x10: 0720072007200720
[ 237.142415] x9 : 0720072007200720 x8 : 0000000000000259
[ 237.147718] x7 : 0000000000000001 x6 : 0000000000000000
[ 237.153022] x5 : ffffffc010003a20 x4 : 0000000000000001
[ 237.158325] x3 : 0000000000000006 x2 : 0000000000000007
[ 237.163628] x1 : 8ac721b3a7dc1c00 x0 : 0000000000000000
[ 237.168932] Call trace:
[ 237.171373] add_dma_entry+0x214/0x240
[ 237.175115] debug_dma_map_page+0xf8/0x120
[ 237.179203] gem_rx_refill+0x190/0x280
[ 237.182942] gem_rx+0x224/0x2f0
[ 237.186075] macb_poll+0x58/0x100
[ 237.189384] net_rx_action+0x118/0x400
[ 237.193125] __do_softirq+0x138/0x36c
[ 237.196780] irq_exit+0x98/0xc0
[ 237.199914] __handle_domain_irq+0x64/0xc0
[ 237.204000] gic_handle_irq+0x5c/0xc0
[ 237.207654] el1_irq+0xb8/0x140
[ 237.210789] arch_cpu_idle+0x40/0x200
[ 237.214444] default_idle_call+0x18/0x30
[ 237.218359] do_idle+0x200/0x280
[ 237.221578] cpu_startup_entry+0x20/0x30
[ 237.225493] rest_init+0xe4/0xf0
[ 237.228713] arch_call_rest_init+0xc/0x14
[ 237.232714] start_kernel+0x47c/0x4a8
[ 237.236367] ---[ end trace 7240980785f81d70 ]---
Lars was fast to find an explanation: according to the datasheet
bit 2 of the rx buffer descriptor entry has a different meaning in the
extended mode:
Address [2] of beginning of buffer, or
in extended buffer descriptor mode (DMA configuration register [28] = 1),
indicates a valid timestamp in the buffer descriptor entry.
The macb driver didn't mask this bit while getting an address and it
eventually caused a memory corruption and a dma failure.
The problem is resolved by explicitly clearing the problematic bit
if hw timestamping is used.
Fixes:
|
||
Xin Long
|
3a0385be13 |
selftests: add the missing CONFIG_IP_SCTP in net config
The selftest sctp_vrf needs CONFIG_IP_SCTP set in config
when building the kernel, so add it.
Fixes:
|
||
Eric Dumazet
|
1c5950fc6f |
udp6: fix potential access to stale information
lena wang reported an issue caused by udpv6_sendmsg()
mangling msg->msg_name and msg->msg_namelen, which
are later read from ____sys_sendmsg() :
/*
* If this is sendmmsg() and sending to current destination address was
* successful, remember it.
*/
if (used_address && err >= 0) {
used_address->name_len = msg_sys->msg_namelen;
if (msg_sys->msg_name)
memcpy(&used_address->name, msg_sys->msg_name,
used_address->name_len);
}
udpv6_sendmsg() wants to pretend the remote address family
is AF_INET in order to call udp_sendmsg().
A fix would be to modify the address in-place, instead
of using a local variable, but this could have other side effects.
Instead, restore initial values before we return from udpv6_sendmsg().
Fixes:
|
||
Aaron Conole
|
306dc21361 |
selftests: openvswitch: adjust datapath NL message declaration
The netlink message for creating a new datapath takes an array
of ports for the PID creation. This shouldn't cause much issue
but correct it for future cases where we need to do decode of
datapath information that could include the per-cpu PID map.
Fixes:
|
||
Jakub Kicinski
|
ecfcc6fbeb |
Merge branch 'mptcp-more-fixes-for-6-3'
Matthieu Baerts says: ==================== mptcp: more fixes for 6.3 Patch 1 avoids scheduling the MPTCP worker on a closed socket on some edge cases. It fixes issues that can be visible from v5.11. Patch 2 makes sure the MPTCP worker doesn't try to manipulate disconnected sockets. This is also a fix for an issue that can be visible from v5.11. Patch 3 fixes a NULL pointer dereference when MPTCP FastOpen is used and an early fallback is done. A fix for v6.2. Patch 4 improves the stability of the userspace PM selftest for a subtest added in v6.2. ==================== Link: https://lore.kernel.org/r/20230411-upstream-net-20230411-mptcp-fixes-v1-0-ca540f3ef986@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Matthieu Baerts
|
711ae788cb |
selftests: mptcp: userspace pm: uniform verify events
Simply adding a "sleep" before checking something is usually not a good
idea because the time that has been picked can not be enough or too
much. The best is to wait for events with a timeout.
In this selftest, 'sleep 0.5' is used more than 40 times. It is always
used before calling a 'verify_*' function except for this
verify_listener_events which has been added later.
At the end, using all these 'sleep 0.5' seems to work: the slow CIs
don't complain so far. Also because it doesn't take too much time, we
can just add two more 'sleep 0.5' to uniform what is done before calling
a 'verify_*' function. For the same reasons, we can also delay a bigger
refactoring to replace all these 'sleep 0.5' by functions waiting for
events instead of waiting for a fix time and hope for the best.
Fixes:
|
||
Paolo Abeni
|
c0ff6f6da6 |
mptcp: fix NULL pointer dereference on fastopen early fallback
In case of early fallback to TCP, subflow_syn_recv_sock() deletes
the subflow context before returning the newly allocated sock to
the caller.
The fastopen path does not cope with the above unconditionally
dereferencing the subflow context.
Fixes:
|