When finding the socket to report an error on, if the invoking packet
is using Segment Routing, the IPv6 destination address is that of an
intermediate router, not the end destination. Extract the ultimate
destination address from the segment address.
This change allows traceroute to function in the presence of Segment
Routing.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC8754 says:
ICMP error packets generated within the SR domain are sent to source
nodes within the SR domain. The invoking packet in the ICMP error
message may contain an SRH. Since the destination address of a packet
with an SRH changes as each segment is processed, it may not be the
destination used by the socket or application that generated the
invoking packet.
For the source of an invoking packet to process the ICMP error
message, the ultimate destination address of the IPv6 header may be
required. The following logic is used to determine the destination
address for use by protocol-error handlers.
* Walk all extension headers of the invoking IPv6 packet to the
routing extension header preceding the upper-layer header.
- If routing header is type 4 Segment Routing Header (SRH)
o The SID at Segment List[0] may be used as the destination
address of the invoking packet.
Mangle the skb so the network header points to the invoking packet
inside the ICMP packet. The seg6 helpers can then be used on the skb
to find any segment routing headers. If found, mark this fact in the
IPv6 control block of the skb, and store the offset into the packet of
the SRH. Then restore the skb back to its old state.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An ICMP error message can contain in its message body part of an IPv6
packet which invoked the error. Such a packet might contain a segment
router header. Export get_srh() so the ICMP code can make use of it.
Since his changes the scope of the function from local to global, add
the seg6_ prefix to keep the namespace clean. And move it into seg6.c
so it is always available, not just when IPV6_SEG6_LWTUNNEL is
enabled.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To give drivers the originating device information for optimized
connection tracking offload, fill in act ct extension with
ifindex from skb.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Driver offloading ct tuples can use the information of which devices
received the packets that created the offloaded connections, to
more efficiently offload them only to the relevant device.
Add new act_ct nf conntrack extension, which is used to store the skb
devices before offloading the connection, and then fill in the tuple
iifindex so drivers can get the device via metadata dissector match.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- allow netlink usage in unprivileged containers, by Linus Lüssing
- remove unneeded variable, by Minghao Chi
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmHTLygWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoUg2D/0Rmb2xC9R/ztyc0OaVyXVr4II0
hyqeHgrWiBFqMOP8AApM2vLAzMyFOcUDwek7KURsLQXi5/wDeyQyg80S7z8voqZA
opCWQVuVuGS29fs6f4OdBWoCdIjlHzoIcwgaIjxbyxaW9jh36zjarLodVqJQXlyH
2LIZD28/neji+sH0QqZz40HlV+08bw4RlTM+TjYUM1XLoX0uZJLXu2jgpWw64Ig4
lw5RbakLX1seywNWMy96w5DhjfQaaNGjaPGf4WHztnMpAsUXfwVl6tOq3UgIKJUK
f/qRwE7DXBKMqY8dKPcULbAjrOUStrU+stUAqMEbjviBGAfc0Vx5c09WXe4ZLHPJ
ZejIdUU8bz19yyvUwz6UXGyba3LujKeDxRsDUfkRvHe4BelBXFuyUC16UW0S7RoT
r0BojmJnxOZ1+i9uVRzS7K88aggoRLCNzSdCmLbXCHNP6DDR88PWv87RGAWZ8vd2
NCM7SI9NSMG0ujwMTQ0dvKwlyabQLlRyO7cWYPHH6c2ms0PlSSvjmCjp+rb6FgLs
iRmVmAT0nEg8ak1Uv6vr3PvEJHIxTtiOmpr3ywBk9XWqeIA01BAVj/48ahc5ADQH
w3imur7TdKAEcb/4k/xz2e44PDKIHsxQdQH6SA6KWxt6YPbHcozS+cD5qau1Mtb0
ExMV//CNa3RFrS6QSw==
=iGTA
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-pullrequest-20220103' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- allow netlink usage in unprivileged containers, by Linus Lüssing
- remove unneeded variable, by Minghao Chi
* tag 'batadv-next-pullrequest-20220103' of git://git.open-mesh.org/linux-merge:
batman-adv: remove unneeded variable in batadv_nc_init
batman-adv: allow netlink usage in unprivileged containers
batman-adv: Start new development cycle
====================
Link: https://lore.kernel.org/r/20220103171722.1126109-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As Nicolas noted, if gateway validation fails walking the multipath
attribute the code should jump to the cleanup to free previously
allocated memory.
Fixes: 1ff15a710a ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route")
Signed-off-by: David Ahern <dsahern@kernel.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20220103170555.94638-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ip6_route_multipath_del loop continues processing the multipath
attribute even if delete of a nexthop path fails. For consistency,
do the same if the gateway attribute is invalid.
Fixes: 1ff15a710a ("ipv6: Check attribute length for RTA_GATEWAY when deleting multipath route")
Signed-off-by: David Ahern <dsahern@kernel.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20220103171911.94739-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add comments for both smc_link_sendable() and smc_link_usable()
to help better distinguish and use them.
No function changes.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same fix in commit 5ec7d18d18 ("sctp: use call_rcu to free endpoint")
is also needed for dumping one asoc and sock after the lookup.
Fixes: 86fdb3448c ("sctp: ensure ep is not destroyed before doing the dump")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removed spaces and added a tab that was causing an error on checkpatch
Signed-off-by: Hamish MacDonald <elusivenode@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add neighbour source flag in mctp_neigh_remove(...) to allow removal of
only static neighbours.
This should be a no-op change and might be useful later when mctp can
have MCTP_NEIGH_DISCOVER neighbours.
Signed-off-by: Gagan Kumar <gagan1kumar.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v3:
- Report 'backlog' (bytes) instead of 'qlen' (number of packets)
v2:
- Fix sparse warning (use rcu_dereference)
This patch adds support for the queue depth in IOAM trace data fields.
The draft [1] says the following:
The "queue depth" field is a 4-octet unsigned integer field. This
field indicates the current length of the egress interface queue of
the interface from where the packet is forwarded out. The queue
depth is expressed as the current amount of memory buffers used by
the queue (a packet could consume one or more memory buffers,
depending on its size).
An existing function (i.e., qdisc_qstats_qlen_backlog) is used to
retrieve the current queue length without reinventing the wheel.
Note: it was tested and qlen is increasing when an artificial delay is
added on the egress with tc.
[1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.7
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pointer link is being re-assigned the same value that it was
initialized with in the previous declaration statement. The
re-assignment is redundant and can be removed.
Fixes: 387707fdf4 ("net/smc: convert static link ID to dynamic references")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements TCP ULP for SMC, helps applications to replace TCP with
SMC protocol in place. And we use it to implement transparent
replacement.
This replaces original TCP sockets with SMC, reuse TCP as clcsock when
calling setsockopt with TCP_ULP option, and without any overhead.
To replace TCP sockets with SMC, there are two approaches:
- use setsockopt() syscall with TCP_ULP option, if error, it would
fallback to TCP.
- use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to
replace transparently. BPF hooks some points in create socket, bind
and others, users can inject their BPF logics without modifying their
applications, and choose which connections should be replaced with SMC
by calling setsockopt() in BPF prog, based on rules, such as TCP tuples,
PID, cgroup, etc...
BPF doesn't support calling setsockopt with TCP_ULP now, I will send the
patches after this accepted.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This prints net namespace ID, helps us to distinguish different net
namespaces when using tracepoints.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds net namespace ID to the kernel log, net_cookie is unique in
the whole system. It is useful in container environment.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds net namespace ID to diag of linkgroup, helps us to distinguish
different namespaces, and net_cookie is unique in the whole system.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, rdma device supports exclusive net namespace isolation,
however linkgroup doesn't know and support ibdev net namespace.
Applications in the containers don't want to share the nics if we
enabled rdma exclusive mode. Every net namespaces should have their own
linkgroups.
This patch introduce a new field net for linkgroup, which is standing
for the ibdev net namespace in the linkgroup. The net in linkgroup is
initialized with the net namespace of link's ibdev. It compares the net
of linkgroup and sock or ibdev before choose it, if no matched, create
new one in current net namespace. If rdma net namespace exclusive mode
is not enabled, it behaves as before.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The addition of routable multicast TX handling introduced a
bug/regression for packets with a link-local multicast destination:
These packets would be sent to all batman-adv nodes with a multicast
router and to all batman-adv nodes with an old version without multicast
router detection.
This even disregards the batman-adv multicast fanout setting, which can
potentially lead to an unwanted, high number of unicast transmissions or
even congestion.
Fixing this by avoiding to send link-local multicast packets to nodes in
the multicast router list.
Fixes: 11d458c1cb ("batman-adv: mcast: apply optimizations for routable packets, too")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Daniel Borkmann says:
====================
pull-request: bpf 2021-12-31
We've added 2 non-merge commits during the last 14 day(s) which contain
a total of 2 files changed, 3 insertions(+), 3 deletions(-).
The main changes are:
1) Revert of an earlier attempt to fix xsk's poll() behavior where it
turned out that the fix for a rare problem made it much worse in
general, from Magnus Karlsson. (Fyi, Magnus mentioned that a proper
fix is coming early next year, so the revert is mainly to avoid
slipping the behavior into 5.16.)
2) Minor misc spell fix in BPF selftests, from Colin Ian King.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, selftests: Fix spelling mistake "tained" -> "tainted"
Revert "xsk: Do not sleep in poll() when need_wakeup set"
====================
Link: https://lore.kernel.org/r/20211231160050.16105-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-12-30
The following pull-request contains BPF updates for your *net-next* tree.
We've added 72 non-merge commits during the last 20 day(s) which contain
a total of 223 files changed, 3510 insertions(+), 1591 deletions(-).
The main changes are:
1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii.
2) Beautify and de-verbose verifier logs, from Christy.
3) Composable verifier types, from Hao.
4) bpf_strncmp helper, from Hou.
5) bpf.h header dependency cleanup, from Jakub.
6) get_func_[arg|ret|arg_cnt] helpers, from Jiri.
7) Sleepable local storage, from KP.
8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
lwtunnel_valid_encap_type_attr is used to validate encap attributes
within a multipath route. Add length validation checking to the type.
lwtunnel_valid_encap_type_attr is called converting attributes to
fib{6,}_config struct which means it is used before fib_get_nhs,
ip6_route_multipath_add, and ip6_route_multipath_del - other
locations that use rtnh_ok and then nla_get_u16 on RTA_ENCAP_TYPE
attribute.
Fixes: 9ed59592e3 ("lwtunnel: fix autoload of lwt modules")
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure RTA_GATEWAY for IPv6 multipath route has enough bytes to hold
an IPv6 address.
Fixes: 6b9ea5a64e ("ipv6: fix multipath route replace error recovery")
Signed-off-by: David Ahern <dsahern@kernel.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit referenced in the Fixes tag used nla_memcpy for RTA_GATEWAY as
does the current nla_get_in6_addr. nla_memcpy protects against accessing
memory greater than what is in the attribute, but there is no check
requiring the attribute to have an IPv6 address. Add it.
Fixes: 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: David Ahern <dsahern@kernel.org>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure RTA_FLOW is at least 4B before using.
Fixes: 4e902c5741 ("[IPv4]: FIB configuration using struct fib_config")
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported uninit-value:
============================================================
BUG: KMSAN: uninit-value in fib_get_nhs+0xac4/0x1f80
net/ipv4/fib_semantics.c:708
fib_get_nhs+0xac4/0x1f80 net/ipv4/fib_semantics.c:708
fib_create_info+0x2411/0x4870 net/ipv4/fib_semantics.c:1453
fib_table_insert+0x45c/0x3a10 net/ipv4/fib_trie.c:1224
inet_rtm_newroute+0x289/0x420 net/ipv4/fib_frontend.c:886
Add helper to validate RTA_GATEWAY length before using the attribute.
Fixes: 4e902c5741 ("[IPv4]: FIB configuration using struct fib_config")
Reported-by: syzbot+d4b9a2851cc3ce998741@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
commit 077cdda764 ("net/mlx5e: TC, Fix memory leak with rules with internal port")
commit 31108d142f ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
commit 4390c6edc0 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/
net/smc/smc_wr.c
commit 49dc9013e3 ("net/smc: Use the bitmap API when applicable")
commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
bitmap_zero()/memset() is removed by the fix
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using the bitmap API is less verbose than hand writing them.
It also improves the semantic.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The assignment here will be overwritten, so it should be deleted
The clang_analyzer complains as follows:
net/ethtool/netlink.c:
Value stored to 'ret' is never read
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Other maps like hashmaps are already available to sleepable programs.
Sleepable BPF programs run under trace RCU. Allow task, sk and inode
storage to be used from sleepable programs. This allows sleepable and
non-sleepable programs to provide shareable annotations on kernel
objects.
Sleepable programs run in trace RCU where as non-sleepable programs run
in a normal RCU critical section i.e. __bpf_prog_enter{_sleepable}
and __bpf_prog_exit{_sleepable}) (rcu_read_lock or rcu_read_lock_trace).
In order to make the local storage maps accessible to both sleepable
and non-sleepable programs, one needs to call both
call_rcu_tasks_trace and call_rcu to wait for both trace and classical
RCU grace periods to expire before freeing memory.
Paul's work on call_rcu_tasks_trace allows us to have per CPU queueing
for call_rcu_tasks_trace. This behaviour can be achieved by setting
rcupdate.rcu_task_enqueue_lim=<num_cpus> boot parameter.
In light of these new performance changes and to keep the local storage
code simple, avoid adding a new flag for sleepable maps / local storage
to select the RCU synchronization (trace / classical).
Also, update the dereferencing of the pointers to use
rcu_derference_check (with either the trace or normal RCU locks held)
with a common bpf_rcu_lock_held helper method.
Signed-off-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211224152916.1550677-2-kpsingh@kernel.org
As we can see from the comment of the nla_put() that it could return
-EMSGSIZE if the tailroom of the skb is insufficient.
Therefore, it should be better to check the return value of the
nla_put_u32 and return the error code if error accurs.
Also, there are many other functions have the same problem, and if this
patch is correct, I will commit a new version to fix all.
Fixes: 955dc68cb9 ("net/ncsi: Add generic netlink family")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Link: https://lore.kernel.org/r/20211229032118.1706294-1-jiasheng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to first check if the context is a vlan one, then we need to
check the global bridge multicast vlan snooping flag, and finally the
vlan's multicast flag, otherwise we will unnecessarily enable vlan mcast
processing (e.g. querier timers).
Fixes: 7b54aaaf53 ("net: bridge: multicast: add vlan state initialization and control")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20211228153142.536969-1-nikolay@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A real world panic issue was found as follow in Linux 5.4.
BUG: unable to handle page fault for address: ffffde49a863de28
PGD 7e6fe62067 P4D 7e6fe62067 PUD 7e6fe63067 PMD f51e064067 PTE 0
RIP: 0010:tw_timer_handler+0x20/0x40
Call Trace:
<IRQ>
call_timer_fn+0x2b/0x120
run_timer_softirq+0x1ef/0x450
__do_softirq+0x10d/0x2b8
irq_exit+0xc7/0xd0
smp_apic_timer_interrupt+0x68/0x120
apic_timer_interrupt+0xf/0x20
This issue was also reported since 2017 in the thread [1],
unfortunately, the issue was still can be reproduced after fixing
DCCP.
The ipv4_mib_exit_net is called before tcp_sk_exit_batch when a net
namespace is destroyed since tcp_sk_ops is registered befrore
ipv4_mib_ops, which means tcp_sk_ops is in the front of ipv4_mib_ops
in the list of pernet_list. There will be a use-after-free on
net->mib.net_statistics in tw_timer_handler after ipv4_mib_exit_net
if there are some inflight time-wait timers.
This bug is not introduced by commit f2bf415cfe ("mib: add net to
NET_ADD_STATS_BH") since the net_statistics is a global variable
instead of dynamic allocation and freeing. Actually, commit
61a7e26028 ("mib: put net statistics on struct net") introduces
the bug since it put net statistics on struct net and free it when
net namespace is destroyed.
Moving init_ipv4_mibs() to the front of tcp_init() to fix this bug
and replace pr_crit() with panic() since continuing is meaningless
when init_ipv4_mibs() fails.
[1] https://groups.google.com/g/syzkaller/c/p1tn-_Kc6l4/m/smuL_FMAAgAJ?pli=1
Fixes: 61a7e26028 ("mib: put net statistics on struct net")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211228104145.9426-1-songmuchun@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Add support for Foxconn MT7922A
- Add support for Realtek RTL8852AE
- Rework HCI event handling to use skb_pull_data
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmHMzl8ZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKSpcD/0dP8yVmlPztQwV/O/DjIHc
rVCplr+pzdOcuW4qXDsKhDt9y5E6KgIhIqnksHV3fM58yY1L6sm9n0Qpihys0mIS
GHQ7VtqCE8WMpifkCTqdKV34Guy9BP0GGQpWTh7jNTeiwHT1t5pjA6ajYE3qaSS/
z4opVFz/eGqinPLPxjP7GBP4VqaLG0QMNo3pSz0hi+E/krIwnjT+VoKNKteDkHf9
klmHrGF2OYtXC8VezvquFgmUgoN+yL/63Ci1DlkF3mRUEc6jMHFMs7BiQov2eENt
Hq9E1ye5Hlf5IedtJ80QdkLvW/JgE1jZKATB7x7E09j9QBP2gxJRP9z6sSJiVIVm
53MRRYaA9jOPCfaVPKGMp+4NQMa7F2gahoI1k7LVS0ksc0RGC/nixtdiZWSjwU6E
cYiQOk5CD3hK+lPSYBvjMOnm+h4w4FZ5W70g5VKKnAGCUBvERSwh2qmUdpG4j1vq
4oeg64DsLP5n7MWbHn7zbtviiJVbMg+a0D4GlExlxu06OOx7yra3s+LbcHoZbG4B
T3ivhx9oX+ho7fG3rjBPrENoZB62n48x/mzxsLyFHt/m3uSgsXVSCVIm9dwwsdBM
P2OzX/dSx/3AhnH0wNPtENAoU4/ETuMOKuFCoU9pL0cjJbQ3M9FipoZakzF/B1se
18+x0pq2o4PT9W1JUPMfvA==
=AVll
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2021-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add support for Foxconn MT7922A
- Add support for Realtek RTL8852AE
- Rework HCI event handling to use skb_pull_data
* tag 'for-net-next-2021-12-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (62 commits)
Bluetooth: MGMT: Fix spelling mistake "simultanous" -> "simultaneous"
Bluetooth: vhci: Set HCI_QUIRK_VALID_LE_STATES
Bluetooth: MGMT: Fix LE simultaneous roles UUID if not supported
Bluetooth: hci_sync: Add check simultaneous roles support
Bluetooth: hci_sync: Wait for proper events when connecting LE
Bluetooth: hci_sync: Add support for waiting specific LE subevents
Bluetooth: hci_sync: Add hci_le_create_conn_sync
Bluetooth: hci_event: Use skb_pull_data when processing inquiry results
Bluetooth: hci_sync: Push sync command cancellation to workqueue
Bluetooth: hci_qca: Stop IBS timer during BT OFF
Bluetooth: btusb: Add support for Foxconn MT7922A
Bluetooth: btintel: Add missing quirks and msft ext for legacy bootloader
Bluetooth: btusb: Add two more Bluetooth parts for WCN6855
Bluetooth: L2CAP: Fix using wrong mode
Bluetooth: hci_sync: Fix not always pausing advertising when necessary
Bluetooth: mgmt: Make use of mgmt_send_event_skb in MGMT_EV_DEVICE_CONNECTED
Bluetooth: mgmt: Make use of mgmt_send_event_skb in MGMT_EV_DEVICE_FOUND
Bluetooth: mgmt: Introduce mgmt_alloc_skb and mgmt_send_event_skb
Bluetooth: btusb: Return error code when getting patch status failed
Bluetooth: btusb: Handle download_firmware failure cases
...
====================
Link: https://lore.kernel.org/r/20211229211258.2290966-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As reported[1] if startup query interval is set too low in combination with
large number of startup queries and we have multiple bridges or even a
single bridge with multiple querier vlans configured we can crash the
machine. Add a 1 second minimum which must be enforced by overwriting the
value if set lower (i.e. without returning an error) to avoid breaking
user-space. If that happens a log message is emitted to let the admin know
that the startup interval has been set to the minimum. It doesn't make
sense to make the startup interval lower than the normal query interval
so use the same value of 1 second. The issue has been present since these
intervals could be user-controlled.
[1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.com/
Fixes: d902eee43f ("bridge: Add multicast count/interval sysfs entries")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As reported[1] if query interval is set too low and we have multiple
bridges or even a single bridge with multiple querier vlans configured
we can crash the machine. Add a 1 second minimum which must be enforced
by overwriting the value if set lower (i.e. without returning an error) to
avoid breaking user-space. If that happens a log message is emitted to let
the administrator know that the interval has been set to the minimum.
The issue has been present since these intervals could be user-controlled.
[1] https://lore.kernel.org/netdev/e8b9ce41-57b9-b6e2-a46a-ff9c791cf0ba@gmail.com/
Fixes: d902eee43f ("bridge: Add multicast count/interval sysfs entries")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a check that the user-provided option is at least as long as the
number of bytes we intend to read. Before this patch we would blindly
read sizeof(int) bytes even in cases where the user passed
optlen<sizeof(int), which would potentially read garbage or fault.
Discovered by new tests in https://github.com/google/gvisor/pull/6957 .
The original get_user call predates history in the git repo.
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20211229200947.2862255-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit initialises the xskb's free_list_node when the xskb is
allocated. This prevents a potential false negative returned from a call
to list_empty for that node, such as the one introduced in commit
199d983bc0 ("xsk: Fix crash on double free in buffer pool")
In my environment this issue caused packets to not be received by
the xdpsock application if the traffic was running prior to application
launch. This happened when the first batch of packets failed the xskmap
lookup and XDP_PASS was returned from the bpf program. This action is
handled in the i40e zc driver (and others) by allocating an skbuff,
freeing the xdp_buff and adding the associated xskb to the
xsk_buff_pool's free_list if it hadn't been added already. Without this
fix, the xskb is not added to the free_list because the check to determine
if it was added already returns an invalid positive result. Later, this
caused allocation errors in the driver and the failure to receive packets.
Fixes: 199d983bc0 ("xsk: Fix crash on double free in buffer pool")
Fixes: 2b43470add ("xsk: Introduce AF_XDP buffer allocation API")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20211220155250.2746-1-ciara.loftus@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
Some NVMEM devices have text based cells. In such cases MAC is stored in
a XX:XX:XX:XX:XX:XX format. Use mac_pton() to parse such data and
support those NVMEM cells. This is required to support e.g. a very
popular U-Boot and its environment variables.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Variable expectlen is being assigned a value that is never read, the
assignment occurs before a return statement. The assignment is
redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A crash occurs when smc_cdc_tx_handler() tries to access smc_sock
but smc_release() has already freed it.
[ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88
[ 4570.696048] #PF: supervisor write access in kernel mode
[ 4570.696728] #PF: error_code(0x0002) - not-present page
[ 4570.697401] PGD 0 P4D 0
[ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111
[ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0
[ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30
<...>
[ 4570.711446] Call Trace:
[ 4570.711746] <IRQ>
[ 4570.711992] smc_cdc_tx_handler+0x41/0xc0
[ 4570.712470] smc_wr_tx_tasklet_fn+0x213/0x560
[ 4570.712981] ? smc_cdc_tx_dismisser+0x10/0x10
[ 4570.713489] tasklet_action_common.isra.17+0x66/0x140
[ 4570.714083] __do_softirq+0x123/0x2f4
[ 4570.714521] irq_exit_rcu+0xc4/0xf0
[ 4570.714934] common_interrupt+0xba/0xe0
Though smc_cdc_tx_handler() checked the existence of smc connection,
smc_release() may have already dismissed and released the smc socket
before smc_cdc_tx_handler() further visits it.
smc_cdc_tx_handler() |smc_release()
if (!conn) |
|
|smc_cdc_tx_dismiss_slots()
| smc_cdc_tx_dismisser()
|
|sock_put(&smc->sk) <- last sock_put,
| smc_sock freed
bh_lock_sock(&smc->sk) (panic) |
To make sure we won't receive any CDC messages after we free the
smc_sock, add a refcount on the smc_connection for inflight CDC
message(posted to the QP but haven't received related CQE), and
don't release the smc_connection until all the inflight CDC messages
haven been done, for both success or failed ones.
Using refcount on CDC messages brings another problem: when the link
is going to be destroyed, smcr_link_clear() will reset the QP, which
then remove all the pending CQEs related to the QP in the CQ. To make
sure all the CQEs will always come back so the refcount on the
smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced
by smc_ib_modify_qp_error().
And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we
need to wait for all pending WQEs done, or we may encounter use-after-
free when handling CQEs.
For IB device removal routine, we need to wait for all the QPs on that
device been destroyed before we can destroy CQs on the device, or
the refcount on smc_connection won't reach 0 and smc_sock cannot be
released.
Fixes: 5f08318f61 ("smc: connection data control (CDC)")
Reported-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We found smc_llc_send_link_delete_all() sometimes wait
for 2s timeout when testing with RDMA link up/down.
It is possible when a smc_link is in ACTIVATING state,
the underlaying QP is still in RESET or RTR state, which
cannot send any messages out.
smc_llc_send_link_delete_all() use smc_link_usable() to
checks whether the link is usable, if the QP is still in
RESET or RTR state, but the smc_link is in ACTIVATING, this
LLC message will always fail without any CQE entering the
CQ, and we will always wait 2s before timeout.
Since we cannot send any messages through the QP before
the QP enter RTS. I add a wrapper smc_link_sendable()
which checks the state of QP along with the link state.
And replace smc_link_usable() with smc_link_sendable()
in all LLC & CDC message sending routine.
Fixes: 5f08318f61 ("smc: connection data control (CDC)")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In smc_wr_tx_send_wait() the completion on index specified by
pend->idx is initialized and after smc_wr_tx_send() was called the wait
for completion starts. pend->idx is used to get the correct index for
the wait, but the pend structure could already be cleared in
smc_wr_tx_process_cqe().
Introduce pnd_idx to hold and use a local copy of the correct index.
Fixes: 09c61d24f9 ("net/smc: wait for departure of an IB message")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In compat mode SIOC{G,S}IFBR ioctls were only supporting
BRCTL_GET_VERSION returning an artificially version to spur userland
tool to use SIOCDEVPRIVATE instead. But some userland tools ignore that
and use SIOC{G,S}IFBR unconditionally as seen with busybox's brctl.
Example of non working 32-bit brctl with CONFIG_COMPAT=y:
$ brctl show
brctl: SIOCGIFBR: Invalid argument
Example of fixed 32-bit brctl with CONFIG_COMPAT=y:
$ brctl show
bridge name bridge id STP enabled interfaces
br0
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "__ip6_tnl_parm" struct was left uninitialized causing an invalid
load of random data when the "__ip6_tnl_parm" struct was used elsewhere.
As an example, in the function "ip6_tnl_xmit_ctl()", it tries to access
the "collect_md" member. With "__ip6_tnl_parm" being uninitialized and
containing random data, the UBSAN detected that "collect_md" held a
non-boolean value.
The UBSAN issue is as follows:
===============================================================
UBSAN: invalid-load in net/ipv6/ip6_tunnel.c:1025:14
load of value 30 is not a valid value for type '_Bool'
CPU: 1 PID: 228 Comm: kworker/1:3 Not tainted 5.16.0-rc4+ #8
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
<TASK>
dump_stack_lvl+0x44/0x57
ubsan_epilogue+0x5/0x40
__ubsan_handle_load_invalid_value+0x66/0x70
? __cpuhp_setup_state+0x1d3/0x210
ip6_tnl_xmit_ctl.cold.52+0x2c/0x6f [ip6_tunnel]
vti6_tnl_xmit+0x79c/0x1e96 [ip6_vti]
? lock_is_held_type+0xd9/0x130
? vti6_rcv+0x100/0x100 [ip6_vti]
? lock_is_held_type+0xd9/0x130
? rcu_read_lock_bh_held+0xc0/0xc0
? lock_acquired+0x262/0xb10
dev_hard_start_xmit+0x1e6/0x820
__dev_queue_xmit+0x2079/0x3340
? mark_lock.part.52+0xf7/0x1050
? netdev_core_pick_tx+0x290/0x290
? kvm_clock_read+0x14/0x30
? kvm_sched_clock_read+0x5/0x10
? sched_clock_cpu+0x15/0x200
? find_held_lock+0x3a/0x1c0
? lock_release+0x42f/0xc90
? lock_downgrade+0x6b0/0x6b0
? mark_held_locks+0xb7/0x120
? neigh_connected_output+0x31f/0x470
? lockdep_hardirqs_on+0x79/0x100
? neigh_connected_output+0x31f/0x470
? ip6_finish_output2+0x9b0/0x1d90
? rcu_read_lock_bh_held+0x62/0xc0
? ip6_finish_output2+0x9b0/0x1d90
ip6_finish_output2+0x9b0/0x1d90
? ip6_append_data+0x330/0x330
? ip6_mtu+0x166/0x370
? __ip6_finish_output+0x1ad/0xfb0
? nf_hook_slow+0xa6/0x170
ip6_output+0x1fb/0x710
? nf_hook.constprop.32+0x317/0x430
? ip6_finish_output+0x180/0x180
? __ip6_finish_output+0xfb0/0xfb0
? lock_is_held_type+0xd9/0x130
ndisc_send_skb+0xb33/0x1590
? __sk_mem_raise_allocated+0x11cf/0x1560
? dst_output+0x4a0/0x4a0
? ndisc_send_rs+0x432/0x610
addrconf_dad_completed+0x30c/0xbb0
? addrconf_rs_timer+0x650/0x650
? addrconf_dad_work+0x73c/0x10e0
addrconf_dad_work+0x73c/0x10e0
? addrconf_dad_completed+0xbb0/0xbb0
? rcu_read_lock_sched_held+0xaf/0xe0
? rcu_read_lock_bh_held+0xc0/0xc0
process_one_work+0x97b/0x1740
? pwq_dec_nr_in_flight+0x270/0x270
worker_thread+0x87/0xbf0
? process_one_work+0x1740/0x1740
kthread+0x3ac/0x490
? set_kthread_struct+0x100/0x100
ret_from_fork+0x22/0x30
</TASK>
===============================================================
The solution is to initialize "__ip6_tnl_parm" struct to zeros in the
"vti6_siocdevprivate()" function.
Signed-off-by: William Zhao <wizhao@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to delay the endpoint free by calling call_rcu() to fix
another use-after-free issue in sctp_sock_dump():
BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20
Call Trace:
__lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218
lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
_raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
spin_lock_bh include/linux/spinlock.h:334 [inline]
__lock_sock+0x203/0x350 net/core/sock.c:2253
lock_sock_nested+0xfe/0x120 net/core/sock.c:2774
lock_sock include/net/sock.h:1492 [inline]
sctp_sock_dump+0x122/0xb20 net/sctp/diag.c:324
sctp_for_each_transport+0x2b5/0x370 net/sctp/socket.c:5091
sctp_diag_dump+0x3ac/0x660 net/sctp/diag.c:527
__inet_diag_dump+0xa8/0x140 net/ipv4/inet_diag.c:1049
inet_diag_dump+0x9b/0x110 net/ipv4/inet_diag.c:1065
netlink_dump+0x606/0x1080 net/netlink/af_netlink.c:2244
__netlink_dump_start+0x59a/0x7c0 net/netlink/af_netlink.c:2352
netlink_dump_start include/linux/netlink.h:216 [inline]
inet_diag_handler_cmd+0x2ce/0x3f0 net/ipv4/inet_diag.c:1170
__sock_diag_cmd net/core/sock_diag.c:232 [inline]
sock_diag_rcv_msg+0x31d/0x410 net/core/sock_diag.c:263
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2477
sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:274
This issue occurs when asoc is peeled off and the old sk is freed after
getting it by asoc->base.sk and before calling lock_sock(sk).
To prevent the sk free, as a holder of the sk, ep should be alive when
calling lock_sock(). This patch uses call_rcu() and moves sock_put and
ep free into sctp_endpoint_destroy_rcu(), so that it's safe to try to
hold the ep under rcu_read_lock in sctp_transport_traverse_process().
If sctp_endpoint_hold() returns true, it means this ep is still alive
and we have held it and can continue to dump it; If it returns false,
it means this ep is dead and can be freed after rcu_read_unlock, and
we should skip it.
In sctp_sock_dump(), after locking the sk, if this ep is different from
tsp->asoc->ep, it means during this dumping, this asoc was peeled off
before calling lock_sock(), and the sk should be skipped; If this ep is
the same with tsp->asoc->ep, it means no peeloff happens on this asoc,
and due to lock_sock, no peeloff will happen either until release_sock.
Note that delaying endpoint free won't delay the port release, as the
port release happens in sctp_endpoint_destroy() before calling call_rcu().
Also, freeing endpoint by call_rcu() makes it safe to access the sk by
asoc->base.sk in sctp_assocs_seq_show() and sctp_rcv().
Thanks Jones to bring this issue up.
v1->v2:
- improve the changelog.
- add kfree(ep) into sctp_endpoint_destroy_rcu(), as Jakub noticed.
Reported-by: syzbot+9276d76e83e3bcde6c99@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Fixes: d25adbeb0c ("sctp: fix an use-after-free issue in sctp_sock_dump")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pointer lt is being assigned a value and then later
updated but that value is never read. The pointer is
redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The max number of UDP gso segments is intended to cap to
UDP_MAX_SEGMENTS, this is checked in udp_send_skb().
skb->len contains network and transport header len here, we should use
only data len instead.
This is the ipv6 counterpart to the below referenced commit,
which missed the ipv6 change
Fixes: 158390e456 ("udp: using datalen to cap max gso segments")
Signed-off-by: Coco Li <lixiaoyan@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20211223222441.2975883-1-lixiaoyan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a spelling mistake in a bt_dev_info message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Commit 561d835281 ("bridge: use ndo_siocdevprivate") changed the
source and destination arguments of copy_{to,from}_user in bridge's
old_deviceless() from args[1] to uarg breaking SIOC{G,S}IFBR ioctls.
Commit cbd7ad29a5 ("net: bridge: fix ioctl old_deviceless bridge
argument") fixed only BRCTL_{ADD,DEL}_BRIDGES commands leaving
BRCTL_GET_BRIDGES one untouched.
The fixes BRCTL_GET_BRIDGES as well and has been tested with busybox's
brctl.
Example of broken brctl:
$ brctl show
bridge name bridge id STP enabled interfaces
brctl: can't get bridge name for index 0: No such device or address
Example of fixed brctl:
$ brctl show
bridge name bridge id STP enabled interfaces
br0 8000.000000000000 no
Fixes: 561d835281 ("bridge: use ndo_siocdevprivate")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/all/20211223153139.7661-2-repk@triplefau.lt/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For Ocelot switches, the CPU injected frames have an injection header
where it can specify the QoS class of the packet and the DSA tag, now it
uses the SKB priority to set that. If a traffic class to priority
mapping is configured on the netdevice (with mqprio for example ...), it
won't be considered for CPU injected headers. This patch make the QoS
class aligned to the priority to traffic class mapping if it exists.
Fixes: 8dce89aa5f ("net: dsa: ocelot: add tagger for Ocelot/Felix switches")
Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Signed-off-by: Marouen Ghodhbane <marouen.ghodhbane@nxp.com>
Link: https://lore.kernel.org/r/20211223072211.33130-1-xiaoliang.yang_1@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Struct sctp_ep_common is included in both asoc and ep, but hlist_node
and hashent are only needed by ep after asoc_hashtable was dropped by
Commit b5eff71283 ("sctp: drop the old assoc hashtable of sctp").
So it is better to move hlist_node and hashent from sctp_ep_common to
sctp_endpoint, and it saves some space for each asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel generates mapping change message, XFRM_MSG_MAPPING,
when a source port chage is detected on a input state with UDP
encapsulation set. Kernel generates a message for each IPsec packet
with new source port. For a high speed flow per packet mapping change
message can be excessive, and can overload the user space listener.
Introduce rate limiting for XFRM_MSG_MAPPING message to the user space.
The rate limiting is configurable via netlink, when adding a new SA or
updating it. Use the new attribute XFRMA_MTIMER_THRESH in seconds.
v1->v2 change:
update xfrm_sa_len()
v2->v3 changes:
use u32 insted unsigned long to reduce size of struct xfrm_state
fix xfrm_ompat size Reported-by: kernel test robot <lkp@intel.com>
accept XFRM_MSG_MAPPING only when XFRMA_ENCAP is present
Co-developed-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
SA use_time was only updated once, for the first packet.
with this fix update the use_time for every packet.
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If destination port is above 32k and source port below 16k
assume this might cause 'port shadowing' where a 'new' inbound
connection matches an existing one, e.g.
inbound X:41234 -> Y:53 matches existing conntrack entry
Z:53 -> X:4123, where Z got natted to X.
In this case, new packet is natted to Z:53 which is likely
unwanted.
We avoid the rewrite for connections that originate from local host:
port-shadowing is only possible with forwarded connections.
Also adjust test case.
v3: no need to call tuple_force_port_remap if already in random mode (Phil)
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Eric Garver <eric@garver.life>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows to identify flows that originate from local machine
in a followup patch.
It would be possible to make this a ->status bit instead.
For now I did not do that yet because I don't have a use-case for
exposing this info to userspace.
If one comes up the toggle can be replaced with a status bit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chain stats are updated from the Netfilter hook path which already run
under rcu read-size lock section.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If controller/driver don't support LE simultaneous roles its UUID shall
be omitted when responding to MGMT_OP_READ_EXP_FEATURES_INFO.
This also rework the support introducing HCI_LE_SIMULTANEOUS_ROLES flag
so it can be detected when userspace wants to use or not.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This attempts to check if the controller can act as both central and
peripheral simultaneously and in case it does skip suspending
advertising or in case of directed advertising don't fail if scanning.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When using HCI_OP_LE_CREATE_CONN wait for HCI_EV_LE_CONN_COMPLETE before
completing it and for HCI_OP_LE_EXT_CREATE_CONN wait for
HCI_EV_LE_ENHANCED_CONN_COMPLETE before resuming advertising.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This adds support for waiting for specific LE subevents instead of
command status which may only indicate that the commands is in progress
and a different event is used to complete the operation.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This adds hci_le_create_conn_sync and make hci_le_connect use it instead
of queueing multiple commands which may conflict with the likes of
hci_update_passive_scan which uses hci_cmd_sync_queue.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This makes each result entry to be checked using skb_pull_data instead
of acessing them by index.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
syzbot reported that hci_cmd_sync_cancel may sleep from the wrong
context. To avoid this, create a new work item that pushes the relevant
parts into a different context.
Note that we keep the old implementation with the name
__hci_cmd_sync_cancel as the sleeping behaviour is desired in some
cases.
Reported-and-tested-by: syzbot+485cc00ea7cf41dfdbf1@syzkaller.appspotmail.com
Fixes: c97a747efc ("Bluetooth: btusb: Cancel sync commands for certain URB errors")
Signed-off-by: Benjamin Berg <bberg@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add new device generic parameter to determine the size of the
asynchronous control events EQ.
For example, to reduce event EQ size to 64, execute:
$ devlink dev param set pci/0000:06:00.0 \
name event_eq_size value 64 cmode driverinit
$ devlink dev reload pci/0000:06:00.0
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add new device generic parameter to determine the size of the
I/O completion EQs.
For example, to reduce I/O EQ size to 64, execute:
$ devlink dev param set pci/0000:06:00.0 \
name io_eq_size value 64 cmode driverinit
$ devlink dev reload pci/0000:06:00.0
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If user has a set to use SOCK_STREAM the socket would default to
L2CAP_MODE_ERTM which later needs to be adjusted if the destination
address is LE which doesn't support such mode.
Fixes: 15f02b9105 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
hci_pause_advertising_sync shall always pause advertising until
hci_resume_advertising_sync but instance 0x00 doesn't count
in adv_instance_cnt so it was causing it to be skipped.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This makes use of mgmt_alloc_skb to build MGMT_EV_DEVICE_CONNECTED so
the data is copied directly to skb that is then sent using
mgmt_send_event_skb eliminating the necessity of intermediary buffers.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This makes use of mgmt_alloc_skb to build MGMT_EV_DEVICE_FOUND so the
data is copied directly to skb that is then sent using
mgmt_send_event_skb eliminating the necessity of intermediary buffers.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* ndo_fill_forward_path support in mac80211, to let
drivers use it
* association comeback notification for userspace,
to be able to react more sensibly to long delays
* support for background radar detection hardware
in some chipsets
* SA Query Procedures offload on the AP side
* more logging if we find problems with HT/VHT/HE
* various cleanups and minor fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmHBuIoACgkQB8qZga/f
l8SDNQ//bWl1fnVTzXcva16NGXNtQc8ufOdDfEHsusTA0qP1EfCDfhiMmRZ+jUQH
Xdg7F3Yube0fgij1sEgpcoVOFm5wr7p861nljR8m71t9FI832gfd+qdCJicNxGGI
B3zEhHCkcZ4yBhT35+cKG/H3WBysI8RO65dC6NVlzCyY1iM9TVkHBtbEKrdNljcM
cKKWRp/fk7lCRVqLtunUd5kJauwJxjwHOm4GTH5BajbT/06m91GLoj/tZEjr9rQL
aSsBa1nR0/LcMyYbbQYIxLikTZnkzILIJGLakb7k5ZJ2W4/hUv0Zn6LUCyMDM1mK
7+Bt6qvB3Wz/TwjKYDm2qOniaD4IDVOtEpVPaXGau8c5Cj6rjnJ/cgF3ydBk4+xB
5xngZBCk6Y4+epg9V7EWfqmV0vVqlWqfUfARwPulLWA1X15mVVBmcrafGEaLvGrC
mvkq0n0XZzf+ObrILK7yjafOdLC4ATCj8j6RW85mH4yU+PqKrx3gOCrWn3Zm+6BN
n6y7vs5x6zEitqjap4zsiVxqJf3jtAVcdVy7k52VF2BBpF8xoyrIMYZw5CNUG2Jv
aTmW5aE8X9mQ2VT88JewZst0IX4jjfK/B8wOj24tokC2mXRdM5uKTOWK7uTFQJfM
lLFcRYzo6n6epHrA5oBN4SnQ3/QpZNJOEsRxyROXemDxnQ9de+w=
=u1jf
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2021-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
This time we have:
* ndo_fill_forward_path support in mac80211, to let drivers use it
* association comeback notification for userspace, to be able
to react more sensibly to long delays
* support for background radar detection hardware in some chipsets
* SA Query Procedures offload on the AP side
* more logging if we find problems with HT/VHT/HE
* various cleanups and minor fixes
Conflicts:
net/wireless/reg.c:
e08ebd6d7b ("cfg80211: Acquire wiphy mutex on regulatory work")
701fdfe348 ("cfg80211: Enable regulatory enforcement checks for drivers supporting mesh iface")
https://lore.kernel.org/r/20211221111950.57ecc6a7@canb.auug.org.au
drivers/net/wireless/ath/ath10k/wmi.c:
7f599aeccb ("cfg80211: Use the HE operation IE to determine a 6GHz BSS channel")
3bf2537ec2 ("ath10k: drop beacon and probe response which leak from other channel")
https://lore.kernel.org/r/20211221115004.1cd6b262@canb.auug.org.au
* tag 'mac80211-next-for-net-next-2021-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (32 commits)
cfg80211: Enable regulatory enforcement checks for drivers supporting mesh iface
rfkill: allow to get the software rfkill state
cfg80211: refactor cfg80211_get_ies_channel_number()
nl82011: clarify interface combinations wrt. channels
nl80211: Add support to offload SA Query procedures for AP SME device
nl80211: Add support to set AP settings flags with single attribute
mac80211: add more HT/VHT/HE state logging
cfg80211: Use the HE operation IE to determine a 6GHz BSS channel
cfg80211: rename offchannel_chain structs to background_chain to avoid confusion with ETSI standard
mac80211: Notify cfg80211 about association comeback
cfg80211: Add support for notifying association comeback
mac80211: introduce channel switch disconnect function
cfg80211: Fix order of enum nl80211_band_iftype_attr documentation
cfg80211: simplify cfg80211_chandef_valid()
mac80211: Remove a couple of obsolete TODO
mac80211: fix FEC flag in radio tap header
mac80211: use coarse boottime for airtime fairness code
ieee80211: change HE nominal packet padding value defines
cfg80211: use ieee80211_bss_get_elem() instead of _get_ie()
mac80211: Use memset_after() to clear tx status
...
====================
Link: https://lore.kernel.org/r/20211221112532.28708-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix following coccicheck warnings:
./net/sched/cls_api.c:3333:17-18: WARNING opportunity for min()
./net/sched/cls_api.c:3389:17-18: WARNING opportunity for min()
./net/sched/cls_api.c:3427:17-18: WARNING opportunity for min()
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ioctl() implicitly assumed that the socket was already bound to
a valid local socket name, i.e. Phonet object. If the socket was not
bound, two separate problems would occur:
1) We'd send an pipe enablement request with an invalid source object.
2) Later socket calls could BUG on the socket unexpectedly being
connected yet not bound to a valid object.
Reported-by: syzbot+2dc91e7fc3dea88b1e8a@syzkaller.appspotmail.com
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently cfg80211 checks for invalid channels whenever there is a
regulatory update and stops the active interfaces if it is operating on
an unsupported channel in the new regulatory domain.
This is done based on a regulatory flag REGULATORY_IGNORE_STALE_KICKOFF
set during wiphy registration which disables this enforcement when
unsupported interface modes are supported by driver.
Add support to enable this enforcement when Mesh Point interface type
is advertised by drivers.
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Link: https://lore.kernel.org/r/1638409120-28997-1-git-send-email-quic_srirrama@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
iwlwifi needs to be able to differentiate between the
software rfkill state and the hardware rfkill state.
The reason for this is that iwlwifi needs to notify any
change in the software rfkill state even when it doesn't
own the device (which means even when the hardware rfkill
is asserted).
In order to be able to know the software rfkill when the
host does not own the device, iwlwifi needs to be able to
ask the state of the software rfkill ignoring the state
of the hardware rfkill.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20211219195124.125689-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In previous method each AP settings flag is represented by a top-level
flag attribute and conversion to enum cfg80211_ap_settings_flags had to
be done before sending them to driver. This commit is to make it easier
to define new AP settings flags and sending them to driver.
This commit also deprecate sending of
%NL80211_ATTR_EXTERNAL_AUTH_SUPPORT in %NL80211_CMD_START_AP. But to
maintain backwards compatibility checks for
%NL80211_ATTR_EXTERNAL_AUTH_SUPPORT in %NL80211_CMD_START_AP when
%NL80211_ATTR_AP_SETTINGS_FLAGS not present in %NL80211_CMD_START_AP.
Signed-off-by: Veerendranath Jakkam <vjakkam@codeaurora.org>
Link: https://lore.kernel.org/r/1637911519-21306-1-git-send-email-vjakkam@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A non-collocated AP whose primary channel is not a PSC channel
may transmit a duplicated beacon on the corresponding PSC channel
in which it would indicate its true primary channel.
Use this inforamtion contained in the HE operation IE to determine
the primary channel of the AP.
In case of invalid infomration ignore it and use the channel
the frame was received on.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211202143322.71eb2176e54e.I130f678e4aa390973ab39d838bbfe7b2d54bff8e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ETSI standard defines "Offchannel CAC" as:
"Off-Channel CAC is performed by a number of non-continuous checks
spread over a period in time. This period, which is required to
determine the presence of radar signals, is defined as the Off-Channel
CAC Time..
Minimum Off-Channel CAC Time 6 minutes and Maximum Off-Channel CAC Time
4 hours..".
mac80211 implementation refers to a dedicated hw chain used for continuous
radar monitoring. Rename offchannel_* references to background_* in
order to avoid confusion with ETSI standard.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/4204cc1d648d76b44557981713231e030a3bd991.1638190762.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Thought the underline driver MLME can handle association temporal
rejection with comeback, it is still useful to notify this to
user space, as user space might want to handle the temporal
rejection differently. For example, in case the comeback time
is too long, user space can deauthenticate immediately and try
to associate with a different AP.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.2467809e8cb3.I45574185b582666bc78eef0c29a4c36b478e5382@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Introduce a disconnect function that can be used when a
channel switch error occurs. The channel switch can request to
block the tx, and so, we need to make sure we do not send a deauth
frame in this case.
Signed-off-by: Nathan Errera <nathan.errera@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.cd2a615a0702.I9edb14785586344af17644b610ab5be109dcef00@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are a lot of duplicate checks in this function to
check the delta between the control channel and CF1.
With the addition of 320 MHz, this will become even more.
Simplify the code so that the common checks are done
only once for multiple bandwidths.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.2d0240b07f11.I759e8e990f5386ba2b56ffb2488a8d4e16e22c1b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In mac80211, while building radiotap header
IEEE80211_RADIOTAP_MCS_HAVE_FEC flag is missing when LDPC enabled
from driver, hence LDPC is not updated properly in radiotap header.
Fix that by adding HAVE_FEC flag while building radiotap header.
Signed-off-by: P Praneesh <quic_ppranees@quicinc.com>
Link: https://lore.kernel.org/r/1638294648-844-2-git-send-email-quic_ppranees@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The time values used by the airtime fairness code only need to be accurate
enough to cover station activity detection.
Using ktime_get_coarse_boottime_ns instead of ktime_get_boottime_ns will
drop the accuracy down to jiffies intervals, but at the same time saves
a lot of CPU cycles in a hot path
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20211217114258.14619-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmG/rsoeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGLW8H/0sfRUxZNJ+DgO6W
0LpyIx1DF7S2x1wOQGl7VZsjdy+dtuOW4BtEoqhuQWgQ2sr0zbavbeyZYTXD1+d6
I1h5C6JMD138JkoqZNdeyWwsalAFtVItH+lzszUGKTxueTbKCJTkGMerBZVu/gVY
U3QYZHEkUH6o/2fINYfSMhEgyJpjVEUoaztB2drsC0tNUGeDjTmrhlRrWIKZ6Kn+
oAiMqsomSBF09TZ++GEqMYDTm39CwNDPJzKGyyOx86ydcg4ErWvjp1jfAUX0cvFG
dso2K5mDzTewdF8bW0A1KvsQaliMFpxgZ5CTblYQ6HvyltjRj+0B8207VzHQXBKV
YFZ8y9A=
=mfug
-----END PGP SIGNATURE-----
Merge 5.16-rc6 into tty-next
We need the tty/serial fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add process to validate flags of filter and actions when adding
a tc filter.
We need to prevent adding filter with flags conflicts with its actions.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add reoffload process to update hw_count when driver
is inserted or removed.
We will delete the action if it is with skip_sw flag and
not offloaded to any hardware in reoffload process.
When reoffloading actions, we still offload the actions
that are added independent of filters.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Save full action flags and return user flags when return flags to
user space.
Save full action flags to distinguish if the action is created
independent from classifier.
We made this change mainly for further patch to reoffload tc actions.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When collecting stats for actions update them using both
hardware and software counters.
Stats update process should not run in context of preempt_disable.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename exts stats update functions with hw for readability.
We make this change also to update stats from hw for an action
when it is offloaded to hw as a single action.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We add skip_hw and skip_sw for user to control if offload the action
to hardware.
We also add in_hw_count for user to indicate if the action is offloaded
to any hardware.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use flow_indr_dev_register/flow_indr_dev_setup_offload to
offload tc action.
We need to call tc_cleanup_flow_action to clean up tc action entry since
in tc_setup_action, some actions may hold dev refcnt, especially the mirror
action.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new ops to tc_action_ops for flow action setup.
Refactor function tc_setup_flow_action to use this new ops.
We make this change to facilitate to add standalone action module.
We will also use this ops to offload action independent of filter
in following patch.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To improves readability, we rename offload functions with offload instead
of flow.
The term flow is related to exact matches, so we rename these functions
with offload.
We make this change to facilitate single action offload functions naming.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add index to flow_action_entry structure and delete index from police and
gate child structure.
We make this change to offload tc action for driver to identify a tc
action.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fill flags to action structure to allow user control if
the action should be offloaded to hardware or not.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some helper functions may modify its arguments, for example,
bpf_d_path, bpf_get_stack etc. Previously, their argument types
were marked as ARG_PTR_TO_MEM, which is compatible with read-only
mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
but technically incorrect, to modify a read-only memory by passing
it into one of such helper functions.
This patch tags the bpf_args compatible with immutable memory with
MEM_RDONLY flag. The arguments that don't have this flag will be
only compatible with mutable memory types, preventing the helper
from modifying a read-only memory. The bpf_args that have
MEM_RDONLY are compatible with both mutable memory and immutable
memory.
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com
This patch introduce a flag MEM_RDONLY to tag a reg value
pointing to read-only memory. It makes the following changes:
1. PTR_TO_RDWR_BUF -> PTR_TO_BUF
2. PTR_TO_RDONLY_BUF -> PTR_TO_BUF | MEM_RDONLY
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211217003152.48334-6-haoluo@google.com
We have introduced a new type to make bpf_reg composable, by
allocating bits in the type to represent flags.
One of the flags is PTR_MAYBE_NULL which indicates a pointer
may be NULL. This patch switches the qualified reg_types to
use this flag. The reg_types changed in this patch include:
1. PTR_TO_MAP_VALUE_OR_NULL
2. PTR_TO_SOCKET_OR_NULL
3. PTR_TO_SOCK_COMMON_OR_NULL
4. PTR_TO_TCP_SOCK_OR_NULL
5. PTR_TO_BTF_ID_OR_NULL
6. PTR_TO_MEM_OR_NULL
7. PTR_TO_RDONLY_BUF_OR_NULL
8. PTR_TO_RDWR_BUF_OR_NULL
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211217003152.48334-5-haoluo@google.com
The xdp_rxq_info_unreg() called by xdp_rxq_info_reg() is meaningless when
dev is NULL, so move the if dev statements to the first.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing cleanup routine implementation is not well synchronized
with the syscall routine. When a device is detaching, below race could
occur.
static int ax25_sendmsg(...) {
...
lock_sock()
ax25 = sk_to_ax25(sk);
if (ax25->ax25_dev == NULL) // CHECK
...
ax25_queue_xmit(skb, ax25->ax25_dev->dev); // USE
...
}
static void ax25_kill_by_device(...) {
...
if (s->ax25_dev == ax25_dev) {
s->ax25_dev = NULL;
...
}
Other syscall functions like ax25_getsockopt, ax25_getname,
ax25_info_show also suffer from similar races. To fix them, this patch
introduce lock_sock() into ax25_kill_by_device in order to guarantee
that the nullify action in cleanup routine cannot proceed when another
socket request is pending.
Signed-off-by: Hanjie Wu <nagi@zju.edu.cn>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
entry->addr.id is u8 with a range from 0 to 255 and MAX_ADDR_ID is 255.
We should drop both false expressions of (entry->addr.id > MAX_ADDR_ID).
We should also remove the obsolete parentheses in the first if branch.
Use U8_MAX for MAX_ADDR_ID and add a comment to show the link to
mptcp_addr_info.id as suggested by Mr. Matthieu Baerts.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MPTCP packet scheduler has sub-optimal behavior with asymmetric
subflows: if the faster subflow-level cwin is closed, the packet
scheduler can enqueue "too much" data on a slower subflow.
When all the data on the faster subflow is acked, if the mptcp-level
cwin is closed, and link utilization becomes suboptimal.
The solution is implementing blest-like[1] HoL-blocking estimation,
transmitting only on the subflow with the shorter estimated time to
flush the queued memory. If such subflows cwin is closed, we wait
even if other subflows are available.
This is quite simpler than the original blest implementation, as we
leverage the pacing rate provided by the TCP socket. To get a more
accurate estimation for the subflow linger-time, we maintain a
per-subflow weighted average of such info.
Additionally drop magic numbers usage in favor of newly defined
macros and use more meaningful names for status variable.
[1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/137
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zone id is not restored if we passed ct and ct rejected the connection,
as there is no ct info on the skb.
Save the zone from tc skb cb to tc skb extension and pass it on to
ovs, use that info to restore the zone id for invalid connections.
Fixes: d29334c15d ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If ct rejects a flow, it removes the conntrack info from the skb.
act_ct sets the post_ct variable so the dissector will see this case
as an +tracked +invalid state, but the zone id is lost with the
conntrack info.
To restore the zone id on such cases, set the last executed zone,
via the tc control block, when passing ct, and read it back in the
dissector if there is no ct info on the skb (invalid connection).
Fixes: 7baf2429a1 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
BPF layer extends the qdisc control block via struct bpf_skb_data_end
and because of that there is no more room to add variables to the
qdisc layer control block without going over the skb->cb size.
Extend the qdisc control block with a tc control block,
and move all tc related variables to there as a pre-step for
extending the tc control block with additional members.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit bd0687c18e.
This patch causes a Tx only workload to go to sleep even when it does
not have to, leading to misserable performance in skb mode. It fixed
one rare problem but created a much worse one, so this need to be
reverted while I try to craft a proper solution to the original
problem.
Fixes: bd0687c18e ("xsk: Do not sleep in poll() when need_wakeup set")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211217145646.26449-1-magnus.karlsson@gmail.com
Add a new API "mhi_prepare_for_transfer_autoqueue" for using with client
drivers like QRTR to request MHI core to autoqueue buffers for the DL
channel along with starting both UL and DL channels.
So far, the "auto_queue" flag specified by the controller drivers in
channel definition served this purpose but this will be removed at some
point in future.
Cc: netdev@vger.kernel.org
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Loic Poulain <loic.poulain@linaro.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Link: https://lore.kernel.org/r/20211216081227.237749-9-manivannan.sadhasivam@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix UAF in set catch-all element, from Eric Dumazet.
2) Fix MAC mangling for multicast/loopback traffic in nfnetlink_queue
and nfnetlink_log, from Ignacy Gawędzki.
3) Remove expired entries from ctnetlink dump path regardless the tuple
direction, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm ineterface does not allow xfrm if_id = 0
fail to create or update xfrm state and policy.
With this commit:
ip xfrm policy add src 192.0.2.1 dst 192.0.2.2 dir out if_id 0
RTNETLINK answers: Invalid argument
ip xfrm state add src 192.0.2.1 dst 192.0.2.2 proto esp spi 1 \
reqid 1 mode tunnel aead 'rfc4106(gcm(aes))' \
0x1111111111111111111111111111111111111111 96 if_id 0
RTNETLINK answers: Invalid argument
v1->v2 change:
- add Fixes: tag
Fixes: 9f8550e4bd ("xfrm: fix disable_xfrm sysctl when used on xfrm interfaces")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xfrm interface if_id = 0 would cause xfrm policy lookup errors since
Commit 9f8550e4bd.
Now explicitly fail to create an xfrm interface when if_id = 0
With this commit:
ip link add ipsec0 type xfrm dev lo if_id 0
Error: if_id must be non zero.
v1->v2 change:
- add Fixes: tag
Fixes: 9f8550e4bd ("xfrm: fix disable_xfrm sysctl when used on xfrm interfaces")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Current release - regressions:
- dpaa2-eth: fix buffer overrun when reporting ethtool statistics
Current release - new code bugs:
- bpf: fix incorrect state pruning for <8B spill/fill
- iavf:
- add missing unlocks in iavf_watchdog_task()
- do not override the adapter state in the watchdog task (again)
- mlxsw: spectrum_router: consolidate MAC profiles when possible
Previous releases - regressions:
- mac80211, fix:
- rate control, avoid driver crash for retransmitted frames
- regression in SSN handling of addba tx
- a memory leak where sta_info is not freed
- marking TX-during-stop for TX in in_reconfig, prevent stall
- cfg80211: acquire wiphy mutex on regulatory work
- wifi drivers: fix build regressions and LED config dependency
- virtio_net: fix rx_drops stat for small pkts
- dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()
Previous releases - always broken:
- bpf, fix:
- kernel address leakage in atomic fetch
- kernel address leakage in atomic cmpxchg's r0 aux reg
- signed bounds propagation after mov32
- extable fixup offset
- extable address check
- mac80211:
- fix the size used for building probe request
- send ADDBA requests using the tid/queue of the aggregation
session
- agg-tx: don't schedule_and_wake_txq() under sta->lock,
avoid deadlocks
- validate extended element ID is present
- mptcp:
- never allow the PM to close a listener subflow (null-defer)
- clear 'kern' flag from fallback sockets, prevent crash
- fix deadlock in __mptcp_push_pending()
- inet_diag: fix kernel-infoleak for UDP sockets
- xsk: do not sleep in poll() when need_wakeup set
- smc: avoid very long waits in smc_release()
- sch_ets: don't remove idle classes from the round-robin list
- netdevsim:
- zero-initialize memory for bpf map's value, prevent info leak
- don't let user space overwrite read only (max) ethtool parms
- ixgbe: set X550 MDIO speed before talking to PHY
- stmmac:
- fix null-deref in flower deletion w/ VLAN prio Rx steering
- dwmac-rk: fix oob read in rk_gmac_setup
- ice: time stamping fixes
- systemport: add global locking for descriptor life cycle
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmG7rdUACgkQMUZtbf5S
IrtRvw//etsgeg2+zxe+fBSbe7ZihcCB4yzWUoRDdNzPrLNLsnWxKT1wYblDcZft
b1f/SpTy9ycfg+fspn2qET8gzydn4m9xHkjmlQPzmXB9tdIDF6mECFTAXYlar1hQ
RQIijpfZYyrZeGdgHpsyq72YC4dpNdbZrxmQFVdpMr3cK8P2N0Dn32bBVa//+jb+
LCv3Uw9C0yNbqhtRIiukkWIE20+/pXtKm0uErDVmvonqFMWPo6mYD0C2PwC20PwR
Kv5ok6jH+44fCSwDoLChbB+Wes0AtrIQdUvUwXGXaF3MDfZl+24oLkX5xJl3EHWT
90Mh0k0NhRORgBZ3NItwK7OliohrRHCYxlAXPjg1Dicxl+kxl0wPlva8v64eAA+u
ZhwXwaQpCrZNdKoxHJw9kQ/CmbggtxcWkVolbZp3TzDjYY1E7qxuwg51YMhGmGT1
FPjradYGvHKi+thizJiEdiZaMKRc8bpaL0hbpROxFQvfjNwFOwREQhtnXYP3W5Kd
lK88fWaH86dxqL+ABvbrMnSZKuNlSL8R/CROWpZuF+vyLRXaxhAvYRrL79bgmkKq
zvImnh1mFovdyKGJhibFMdy92X14z8FzoyX3VQuFcl9EB+2NQXnNZ6abDLJlufZX
A0jQ5r46Ce/yyaXXmS61PrP7Pf5sxhs/69fqAIDQfSSzpyUKHd4=
=VIbd
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from mac80211, wifi, bpf.
Relatively large batches of fixes from BPF and the WiFi stack, calm in
general networking.
Current release - regressions:
- dpaa2-eth: fix buffer overrun when reporting ethtool statistics
Current release - new code bugs:
- bpf: fix incorrect state pruning for <8B spill/fill
- iavf:
- add missing unlocks in iavf_watchdog_task()
- do not override the adapter state in the watchdog task (again)
- mlxsw: spectrum_router: consolidate MAC profiles when possible
Previous releases - regressions:
- mac80211 fixes:
- rate control, avoid driver crash for retransmitted frames
- regression in SSN handling of addba tx
- a memory leak where sta_info is not freed
- marking TX-during-stop for TX in in_reconfig, prevent stall
- cfg80211: acquire wiphy mutex on regulatory work
- wifi drivers: fix build regressions and LED config dependency
- virtio_net: fix rx_drops stat for small pkts
- dsa: mv88e6xxx: unforce speed & duplex in mac_link_down()
Previous releases - always broken:
- bpf fixes:
- kernel address leakage in atomic fetch
- kernel address leakage in atomic cmpxchg's r0 aux reg
- signed bounds propagation after mov32
- extable fixup offset
- extable address check
- mac80211:
- fix the size used for building probe request
- send ADDBA requests using the tid/queue of the aggregation
session
- agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid
deadlocks
- validate extended element ID is present
- mptcp:
- never allow the PM to close a listener subflow (null-defer)
- clear 'kern' flag from fallback sockets, prevent crash
- fix deadlock in __mptcp_push_pending()
- inet_diag: fix kernel-infoleak for UDP sockets
- xsk: do not sleep in poll() when need_wakeup set
- smc: avoid very long waits in smc_release()
- sch_ets: don't remove idle classes from the round-robin list
- netdevsim:
- zero-initialize memory for bpf map's value, prevent info leak
- don't let user space overwrite read only (max) ethtool parms
- ixgbe: set X550 MDIO speed before talking to PHY
- stmmac:
- fix null-deref in flower deletion w/ VLAN prio Rx steering
- dwmac-rk: fix oob read in rk_gmac_setup
- ice: time stamping fixes
- systemport: add global locking for descriptor life cycle"
* tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (89 commits)
bpf, selftests: Fix racing issue in btf_skc_cls_ingress test
selftest/bpf: Add a test that reads various addresses.
bpf: Fix extable address check.
bpf: Fix extable fixup offset.
bpf, selftests: Add test case trying to taint map value pointer
bpf: Make 32->64 bounds propagation slightly more robust
bpf: Fix signed bounds propagation after mov32
sit: do not call ipip6_dev_free() from sit_init_net()
net: systemport: Add global locking for descriptor lifecycle
net/smc: Prevent smc_release() from long blocking
net: Fix double 0x prefix print in SKB dump
virtio_net: fix rx_drops stat for small pkts
dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED
sfc_ef100: potential dereference of null pointer
net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup
net: usb: lan78xx: add Allied Telesis AT29M2-AF
net/packet: rx_owner_map depends on pg_vec
netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc
dpaa2-eth: fix ethtool statistics
ixgbe: set X550 MDIO speed before talking to PHY
...
We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency,
make sure those who actually need more than the definition of
struct cgroup_bpf include bpf-cgroup.h explicitly.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org
Daniel Borkmann says:
====================
pull-request: bpf 2021-12-16
We've added 15 non-merge commits during the last 7 day(s) which contain
a total of 12 files changed, 434 insertions(+), 30 deletions(-).
The main changes are:
1) Fix incorrect verifier state pruning behavior for <8B register spill/fill,
from Paul Chaignon.
2) Fix x86-64 JIT's extable handling for fentry/fexit when return pointer
is an ERR_PTR(), from Alexei Starovoitov.
3) Fix 3 different possibilities that BPF verifier missed where unprivileged
could leak kernel addresses, from Daniel Borkmann.
4) Fix xsk's poll behavior under need_wakeup flag, from Magnus Karlsson.
5) Fix an oob-write in test_verifier due to a missed MAX_NR_MAPS bump,
from Kumar Kartikeya Dwivedi.
6) Fix a race in test_btf_skc_cls_ingress selftest, from Martin KaFai Lau.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, selftests: Fix racing issue in btf_skc_cls_ingress test
selftest/bpf: Add a test that reads various addresses.
bpf: Fix extable address check.
bpf: Fix extable fixup offset.
bpf, selftests: Add test case trying to taint map value pointer
bpf: Make 32->64 bounds propagation slightly more robust
bpf: Fix signed bounds propagation after mov32
bpf, selftests: Update test case for atomic cmpxchg on r0 with pointer
bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux reg
bpf, selftests: Add test case for atomic fetch on spilled pointer
bpf: Fix kernel address leakage in atomic fetch
selftests/bpf: Fix OOB write in test_verifier
xsk: Do not sleep in poll() when need_wakeup set
selftests/bpf: Tests for state pruning with u32 spill/fill
bpf: Fix incorrect state pruning for <8B spill/fill
====================
Link: https://lore.kernel.org/r/20211216210005.13815-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In nginx/wrk benchmark, there's a hung problem with high probability
on case likes that: (client will last several minutes to exit)
server: smc_run nginx
client: smc_run wrk -c 10000 -t 1 http://server
Client hangs with the following backtrace:
0 [ffffa7ce8Of3bbf8] __schedule at ffffffff9f9eOd5f
1 [ffffa7ce8Of3bc88] schedule at ffffffff9f9eløe6
2 [ffffa7ce8Of3bcaO] schedule_timeout at ffffffff9f9e3f3c
3 [ffffa7ce8Of3bd2O] wait_for_common at ffffffff9f9el9de
4 [ffffa7ce8Of3bd8O] __flush_work at ffffffff9fOfeOl3
5 [ffffa7ce8øf3bdfO] smc_release at ffffffffcO697d24 [smc]
6 [ffffa7ce8Of3be2O] __sock_release at ffffffff9f8O2e2d
7 [ffffa7ce8Of3be4ø] sock_close at ffffffff9f8ø2ebl
8 [ffffa7ce8øf3be48] __fput at ffffffff9f334f93
9 [ffffa7ce8Of3be78] task_work_run at ffffffff9flOlff5
10 [ffffa7ce8Of3beaO] do_exit at ffffffff9fOe5Ol2
11 [ffffa7ce8Of3bflO] do_group_exit at ffffffff9fOe592a
12 [ffffa7ce8Of3bf38] __x64_sys_exit_group at ffffffff9fOe5994
13 [ffffa7ce8Of3bf4O] do_syscall_64 at ffffffff9f9d4373
14 [ffffa7ce8Of3bfsO] entry_SYSCALL_64_after_hwframe at ffffffff9fa0007c
This issue dues to flush_work(), which is used to wait for
smc_connect_work() to finish in smc_release(). Once lots of
smc_connect_work() was pending or all executing work dangling,
smc_release() has to block until one worker comes to free, which
is equivalent to wait another smc_connnect_work() to finish.
In order to fix this, There are two changes:
1. For those idle smc_connect_work(), cancel it from the workqueue; for
executing smc_connect_work(), waiting for it to finish. For that
purpose, replace flush_work() with cancel_work_sync().
2. Since smc_connect() hold a reference for passive closing, if
smc_connect_work() has been cancelled, release the reference.
Fixes: 24ac3a08e6 ("net/smc: rebuild nonblocking connect")
Reported-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1639571361-101128-1-git-send-email-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that there is only one fib nla_policy there is no need to
keep the macro around. Place it where its used.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The attributes are identical in all implementations so move the ipv4 one
into the core and remove the per-family nla policies.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When dumping conntrack table to userspace via ctnetlink, check if the ct has
already expired before doing any of the 'skip' checks.
This expires dead entries faster.
/proc handler also removes outdated entries first.
Reported-by: Vitaly Zuevsky <vzuevsky@ns1.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If compiled with CONFIG_NET_NS_REFCNT_TRACKER=y,
using put_net_track() in iterate_cleanup_work()
and netns_tracker_alloc() in nf_nat_masq_schedule()
might help us finding netns refcount imbalances.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If compiled with CONFIG_NET_NS_REFCNT_TRACKER=y,
using put_net_track() in nfulnl_instance_free_rcu()
and get_net_track() in instance_create()
might help us finding netns refcount imbalances.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When printing netdev features %pNF already takes care of the 0x prefix,
remove the explicit one.
Fixes: 6413139dfc ("skbuff: increase verbosity when dumping skb data")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet sockets may switch ring versions. Avoid misinterpreting state
between versions, whose fields share a union. rx_owner_map is only
allocated with a packet ring (pg_vec) and both are swapped together.
If pg_vec is NULL, meaning no packet ring was allocated, then neither
was rx_owner_map. And the field may be old state from a tpacket_v3.
Fixes: 61fad6816f ("net/packet: tpacket_rcv: avoid a producer race condition")
Reported-by: Syzbot <syzbot+1ac0994a0a0c55151121@syzkaller.appspotmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211215143937.106178-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, mostly
rather small housekeeping patches:
1) Remove unused variable in IPVS, from GuoYong Zheng.
2) Use memset_after in conntrack, from Kees Cook.
3) Remove leftover function in nfnetlink_queue, from Florian Westphal.
4) Remove redundant test on bool in conntrack, from Bernard Zhao.
5) egress support for nft_fwd, from Lukas Wunner.
6) Make pppoe work for br_netfilter, from Florian Westphal.
7) Remove unused variable in conntrack resize routine, from luo penghao.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
netfilter: conntrack: Remove useless assignment statements
netfilter: bridge: add support for pppoe filtering
netfilter: nft_fwd_netdev: Support egress hook
netfilter: ctnetlink: remove useless type conversion to bool
netfilter: nf_queue: remove leftover synchronize_rcu
netfilter: conntrack: Use memset_startat() to zero struct nf_conn
ipvs: remove unused variable for ip_vs_new_dest
====================
Link: https://lore.kernel.org/r/20211215234911.170741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The old_size assignment here will not be used anymore
The clang_analyzer complains as follows:
Value stored to 'old_size' is never read
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In commit 5648b5e116 ("netfilter: nfnetlink_queue: fix OOB when mac
header was cleared"), the test for non-empty MAC header introduced in
commit 2c38de4c1f ("netfilter: fix looped (broad|multi)cast's MAC
handling") has been replaced with a test for a set MAC header.
This breaks the case when the MAC header has been reset (using
skb_reset_mac_header), as is the case with looped-back multicast
packets. As a result, the packets ending up in NFQUEUE get a bogus
hwaddr interpreted from the first bytes of the IP header.
This patch adds a test for a non-empty MAC header in addition to the
test for a set MAC header. The same two tests are also implemented in
nfnetlink_log.c, where the initial code of commit 2c38de4c1f
("netfilter: fix looped (broad|multi)cast's MAC handling") has not been
touched, but where supposedly the same situation may happen.
Fixes: 5648b5e116 ("netfilter: nfnetlink_queue: fix OOB when mac header was cleared")
Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 0976b888a1 ("ethtool: fix null-ptr-deref on ref tracker")
made the write to req_info.dev conditional, but as Eric points out
in a different follow up the structure is often allocated on the
stack and not kzalloc()'d so seems safer to always write the dev,
in case it's garbage on input.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most notable changes are in af_packet, tipc ones are trivial.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems I missed that most ethnl_parse_header_dev_get() callers
declare an on-stack struct ethnl_req_info, and that they simply call
dev_put(req_info.dev) when about to return.
Add ethnl_parse_header_dev_put() helper to properly untrack
reference taken by ethnl_parse_header_dev_get().
Fixes: e4b8954074 ("netlink: add net device refcount tracker to struct ethnl_req_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp ULP extension relies on sk->sk_sock_kern being set correctly:
It prevents setsockopt(fd, IPPROTO_TCP, TCP_ULP, "mptcp", 6); from
working for plain tcp sockets (any userspace-exposed socket).
But in case of fallback, accept() can return a plain tcp sk.
In such case, sk is still tagged as 'kernel' and setsockopt will work.
This will crash the kernel, The subflow extension has a NULL ctx->conn
mptcp socket:
BUG: KASAN: null-ptr-deref in subflow_data_ready+0x181/0x2b0
Call Trace:
tcp_data_ready+0xf8/0x370
[..]
Fixes: cf7da0d66c ("mptcp: Create SUBFLOW socket for incoming connections")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP_ULP setsockopt cannot be used for mptcp because its already
used internally to plumb subflow (tcp) sockets to the mptcp layer.
syzbot managed to trigger a crash for mptcp connections that are
in fallback mode:
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
CPU: 1 PID: 1083 Comm: syz-executor.3 Not tainted 5.16.0-rc2-syzkaller #0
RIP: 0010:tls_build_proto net/tls/tls_main.c:776 [inline]
[..]
__tcp_set_ulp net/ipv4/tcp_ulp.c:139 [inline]
tcp_set_ulp+0x428/0x4c0 net/ipv4/tcp_ulp.c:160
do_tcp_setsockopt+0x455/0x37c0 net/ipv4/tcp.c:3391
mptcp_setsockopt+0x1b47/0x2400 net/mptcp/sockopt.c:638
Remove support for TCP_ULP setsockopt.
Fixes: d9e4c12918 ("mptcp: only admit explicitly supported sockopt")
Reported-by: syzbot+1fd9b69cde42967d1add@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not sleep in poll() when the need_wakeup flag is set. When this
flag is set, the application needs to explicitly wake up the driver
with a syscall (poll, recvmsg, sendmsg, etc.) to guarantee that Rx
and/or Tx processing will be processed promptly. But the current code
in poll(), sleeps first then wakes up the driver. This means that no
driver processing will occur (baring any interrupts) until the timeout
has expired.
Fix this by checking the need_wakeup flag first and if set, wake the
driver and return to the application. Only if need_wakeup is not set
should the process sleep if there is a timeout set in the poll() call.
Fixes: 77cd0d7b3f ("xsk: add support for need_wakeup flag in AF_XDP rings")
Reported-by: Keith Wiles <keith.wiles@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20211214102607.7677-1-magnus.karlsson@gmail.com
__rds_conn_create() did not release conn->c_path when loop_trans != 0 and
trans->t_prefer_loopback != 0 and is_outgoing == 0.
Fixes: aced3ce57c ("RDS tcp loopback connection can hang")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use min() in order to make code cleaner. Issue found by coccinelle.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix a station info memory leak on insert collisions
* a rate control fix for retransmissions
* two aggregation setup fixes
* reload current regdomain when reloading database
* a locking fix in regulatory work
* a probe request allocation size fix in mac80211
* apply TCP vs. aggregation (sk pacing) on mesh
* fix ordering of channel context update vs. station
state
* set up skb->dev for mesh forwarding properly
* track QoS data frames only for admission control to
avoid out-of-bounds read (found by syzbot)
* validate extended element ID vs. existing data to
avoid out-of-bounds read (found by syzbot)
* fix locking in mac80211 aggregation TX setup
* fix traffic stall after HW restart when TXQs are used
* fix ordering of reconfig/restart after HW restart
* fix interface type for extended aggregation capability
lookup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmG4dWoACgkQB8qZga/f
l8Sorw//TtgV3lufyv4i0aYKzARw/V7vhDkt+FdPNbTKb9p4jkKkfwMyQttmqJrt
6yjjFZEiY7Z2ZzZ16k+EkNFmpV41DCmJTqRLg/1D+FsDlIIOWBsOifqUdnudLWXR
WboegggjMZHaty6ebpA93exh5K6c058JY5YJRpP+I/0RDtOSEiv5gI1akxnRQsz4
/r9RsHy9+2zzyQC37/QBfljHPXAtF7UYRENUqPbc7wMFtX1I0/iWEVIXiHu5NC+g
bVjZSEooChjpfHamle9paamfWIqE6yNFcwlTBH2vyuAupPAFJEIioCwLQjU4ACzx
8MZFlV8LZNRF1MP3owdOBPG+puPSRGgLa/3vvDybRTKWxJ2VZFoCXiEDVF7W8mJj
GGCFkp268unSnLvRBXFRy3tw2vAaMcfeHx6aAUNKF8FQGGy9efoQfNVPNConbhld
NlPkMa0I/kxLOBX8uJHtLekNb49EsWAs7NUGnUucXlxEAvp3jdcqguksbYmVukQT
WHwlnIV929chOrT6zO3mSzCZW7ZeS7qAmVINzkKHP/7XGIzBbsRq5K4nie9Inoc+
EV9E+F+a1Drm8VksKvVGqbqmvhh+uZrBEKo+4Auaz0+/PZyo6uBQZHn2G3tGl42f
iNUDEaNtkVqFA8q8bl6jEdj2lM1YtLtaP9kDKkg6IH4ZWv8g+Ps=
=4LtY
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-net-2021-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A fairly large number of fixes this time:
* fix a station info memory leak on insert collisions
* a rate control fix for retransmissions
* two aggregation setup fixes
* reload current regdomain when reloading database
* a locking fix in regulatory work
* a probe request allocation size fix in mac80211
* apply TCP vs. aggregation (sk pacing) on mesh
* fix ordering of channel context update vs. station
state
* set up skb->dev for mesh forwarding properly
* track QoS data frames only for admission control to
avoid out-of-bounds read (found by syzbot)
* validate extended element ID vs. existing data to
avoid out-of-bounds read (found by syzbot)
* fix locking in mac80211 aggregation TX setup
* fix traffic stall after HW restart when TXQs are used
* fix ordering of reconfig/restart after HW restart
* fix interface type for extended aggregation capability
lookup
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
On the NXP Bluebox 3 board which uses a multi-switch setup with sja1105,
the mechanism through which the tagger connects to the switch tree is
broken, due to improper DSA code design. At the time when tag_ops->connect()
is called in dsa_port_parse_cpu(), DSA hasn't finished "touching" all
the ports, so it doesn't know how large the tree is and how many ports
it has. It has just seen the first CPU port by this time. As a result,
this function will call the tagger's ->connect method too early, and the
tagger will connect only to the first switch from the tree.
This could be perhaps addressed a bit more simply by just moving the
tag_ops->connect(dst) call a bit later (for example in dsa_tree_setup),
but there is already a design inconsistency at present: on the switch
side, the notification is on a per-switch basis, but on the tagger side,
it is on a per-tree basis. Furthermore, the persistent storage itself is
per switch (ds->tagger_data). And the tagger connect and disconnect
procedures (at least the ones that exist currently) could see a fair bit
of simplification if they didn't have to iterate through the switches of
a tree.
To fix the issue, this change transforms tag_ops->connect(dst) into
tag_ops->connect(ds) and moves it somewhere where we already iterate
over all switches of a tree. That is in dsa_switch_setup_tag_protocol(),
which is a good placement because we already have there the connection
call to the switch side of things.
As for the dsa_tree_bind_tag_proto() method (called from the code path
that changes the tag protocol), things are a bit more complicated
because we receive the tree as argument, yet when we unwind on errors,
it would be nice to not call tag_ops->disconnect(ds) where we didn't
previously call tag_ops->connect(ds). We didn't have this problem before
because the tag_ops connection operations passed the entire dst before,
and this is more fine grained now. To solve the error rewind case using
the new API, we have to create yet one more cross-chip notifier for
disconnection, and stay connected with the old tag protocol to all the
switches in the tree until we've succeeded to connect with the new one
as well. So if something fails half way, the whole tree is still
connected to the old tagger. But there may still be leaks if the tagger
fails to connect to the 2nd out of 3 switches in a tree: somebody needs
to tell the tagger to disconnect from the first switch. Nothing comes
for free, and this was previously handled privately by the tagging
protocol driver before, but now we need to emit a disconnect cross-chip
notifier for that, because DSA has to take care of the unwind path. We
assume that the tagging protocol has connected to a switch if it has set
ds->tagger_data to something, otherwise we avoid calling its
disconnection method in the error rewind path.
The rest of the changes are in the tagging protocol drivers, and have to
do with the replacement of dst with ds. The iteration is removed and the
error unwind path is simplified, as mentioned above.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The method was meant to zeroize ds->tagger_data but got the wrong
pointer. Fix this.
Fixes: c79e84866d ("net: dsa: tag_sja1105: convert to tagger-owned data")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev can be a NULL here, not all requests set require_dev.
Fixes: e4b8954074 ("netlink: add net device refcount tracker to struct ethnl_req_info")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the order of arguments and make qdisc_is_running() appear first.
This is more readable for the general case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to return EOPNOTSUPP for the unsupported mpls action type when
setup the flow action.
In the original implement, we will return 0 for the unsupported mpls
action type, actually we do not setup it and the following actions
to the flow action entry.
Fixes: 9838b20a7f ("net: sched: take rtnl lock in tc_setup_flow_action()")
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 94dd016ae5 ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP
ioctl to active device") the user could get bond active interface's
PHC index directly. But when there is a failover, the bond active
interface will change, thus the PHC index is also changed. This may
break the user's program if they did not update the PHC timely.
This patch adds a new hwtstamp_config flag HWTSTAMP_FLAG_BONDED_PHC_INDEX.
When the user wants to get the bond active interface's PHC, they need to
add this flag and be aware the PHC index may be changed.
With the new flag. All flag checks in current drivers are removed. Only
the checking in net_hwtstamp_validate() is kept.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we reconfigure, the driver might do some things to complete
the reconfiguration. It's strange and could be broken in some
cases because we restart other works (e.g. remain-on-channel and
TX) before this happens, yet only start queues later.
Change this to do the reconfig complete when reconfiguration is
actually complete, not when we've already started doing other
things again.
For iwlwifi, this should fix a race where the reconfig can race
with TX, for ath10k and ath11k that also use this it won't make
a difference because they just start queues there, and mac80211
also stopped the queues and will restart them later as before.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.cab99f22fe19.Iefe494687f15fd85f77c1b989d1149c8efdfdc36@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Mark TXQs as having seen transmit while they were stopped if
we bail out of drv_wake_tx_queue() due to reconfig, so that
the queue wake after this will make them catch up. This is
particularly necessary for when TXQs are used for management
packets since those TXQs won't see a lot of traffic that'd
make them catch up later.
Cc: stable@vger.kernel.org
Fixes: 4856bfd230 ("mac80211: do not call driver wake_tx_queue op during reconfig")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.4573a221c0e1.I0d1d5daea3089be3fc0dccc92991b0f8c5677f0c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently channel context is updated only after station got an update about
new assoc state, this results in station using the old channel context.
Fix this by moving the update channel context before updating station,
enabling the driver to immediately use the updated channel context in
the new assoc state.
Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.1c80c17ffd8a.I94ae31378b363c1182cfdca46c4b7e7165cff984@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We should be doing the HE capabilities lookup based on the full
interface type so if P2P doesn't have HE but client has it doesn't
get confused. Fix that.
Fixes: 2ab4587675 ("mac80211: add support for the ADDBA extension element")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211129152938.010fc1d61137.If3a468145f29d670cb00a693bed559d8290ba693@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The function cfg80211_reg_can_beacon_relax() expects wiphy
mutex to be held when it is being called. However, when
reg_leave_invalid_chans() is called the mutex is not held.
Fix it by acquiring the lock before calling the function.
Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211202152831.527686cda037.I40ad9372a47cbad53b4aae7b5a6ccc0dc3fddf8b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we call ieee80211_agg_start_txq(), that will in turn call
schedule_and_wake_txq(). Called from ieee80211_stop_tx_ba_cb()
this is done under sta->lock, which leads to certain circular
lock dependencies, as reported by Chris Murphy:
https://lore.kernel.org/r/CAJCQCtSXJ5qA4bqSPY=oLRMbv-irihVvP7A2uGutEbXQVkoNaw@mail.gmail.com
In general, ieee80211_agg_start_txq() is usually not called
with sta->lock held, only in this one place. But it's always
called with sta->ampdu_mlme.mtx held, and that's therefore
clearly sufficient.
Change ieee80211_stop_tx_ba_cb() to also call it without the
sta->lock held, by factoring it out of ieee80211_remove_tid_tx()
(which is only called in this one place).
This breaks the locking chain and makes it less likely that
we'll have similar locking chain problems in the future.
Fixes: ba8c3d6f16 ("mac80211: add an intermediate software queue implementation")
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20211202152554.f519884c8784.I555fef8e67d93fff3d9a304886c4a9f8b322e591@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This removes the previously unused reload flag, which was introduced in
1eda919126.
The request is handled as NL80211_REGDOM_SET_BY_CORE, which is parsed
unconditionally.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes: 1eda919126 ("nl80211: reset regdom when reloading regdb")
Link: https://lore.kernel.org/all/YaZuKYM5bfWe2Urn@archlinux-ax161/
Signed-off-by: Finn Behrens <me@kloenk.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/YadvTolO8rQcNCd/@gimli.kloenk.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sending them out on a different queue can cause a race condition where a
number of packets in the queue may be discarded by the receiver, because
the ADDBA request is sent too early.
This affects any driver with software A-MPDU setup which does not allocate
packet seqno in hardware on tx, regardless of whether iTXQ is used or not.
The only driver I've seen that explicitly deals with this issue internally
is mwl8k.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20211202124533.80388-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Misc bugfixes.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmGxGLEPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpPWsIALBdp+6ZQEzQXOpqiHGqDenaXZkUkcdzjvgC
2jtbxHu9+iZOmzLgtrKK6T9ky7gewucqhax80lvN+5wzZK/siVtmidJCi+f8MihN
g2ePnOk5LC3KAetCNBMYlRlrLQuH8raLuec3CroJLIfmZEx2rPNbp/CmXfj7BBlc
XcEjP3pvsJehVY7Em/EbTOQhln/fOzHXjimqM7vx+CJQ7Vuvwwm60UvlpwJhPaVk
c8n4rYrddYQaLPpQvedhO3xGQ1fHSx94iA4QmMRB9Q6oeNPJ+I9eQsm+na8cAnga
WePmcJqIiB9NnE67otr8Z2nOboX9x79DgVxCNzwuclH6qoVzJeM=
=jIiu
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Misc virtio and vdpa bugfixes"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa: Consider device id larger than 31
virtio/vsock: fix the transport to work with VMADDR_CID_ANY
virtio_ring: Fix querying of maximum DMA mapping size for virtio device
virtio: always enter drivers/virtio/
vduse: check that offset is within bounds in get_config()
vdpa: check that offsets are within bounds
vduse: fix memory corruption in vduse_dev_ioctl()
In non trivial scenarios, the action id alone is not sufficient to
identify the program causing the warning. Before the previous patch,
the generated stack-trace pointed out at least the involved device
driver.
Let's additionally include the program name and id, and the relevant
device name.
If the user needs additional infos, he can fetch them via a kernel
probe, leveraging the arguments added here.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com
The WARN_ONCE() in bpf_warn_invalid_xdp_action() can be triggered by
any bugged program, and even attaching a correct program to a NIC
not supporting the given action.
The resulting splat, beyond polluting the logs, fouls automated tools:
e.g. a syzkaller reproducers using an XDP program returning an
unsupported action will never pass validation.
Replace the WARN_ONCE with a less intrusive pr_warn_once().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/016ceec56e4817ebb2a9e35ce794d5c917df572c.1638189075.git.pabeni@redhat.com
I'm about to add more information to the server-side SUNRPC
tracepoints, so I'm going to offset the increased trace log
consumption by getting rid of some tracepoints that fire frequently
but don't offer much value.
trace_svc_xprt_received() was useful for debugging, perhaps, but
is not generally informative.
trace_svc_handle_xprt() reports largely the same information as
trace_svc_xdr_recvfrom().
As a clean-up, rename trace_svc_xprt_do_enqueue() to match
svc_xprt_dequeue().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_set_num_threads() does everything that lockd_start_svc() does, except
set sv_maxconn. It also (when passed 0) finds the threads and
stops them with kthread_stop().
So move the setting for sv_maxconn, and use svc_set_num_thread()
We now don't need nlmsvc_task.
Now that we use svc_set_num_threads() it makes sense to set svo_module.
This request that the thread exists with module_put_and_exit().
Also fix the documentation for svo_module to make this explicit.
svc_prepare_thread is now only used where it is defined, so it can be
made static.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently 'pooled' services hold a reference on the pool_map, and
'unpooled' services do not.
svc_destroy() uses the presence of ->svo_function (via
svc_serv_is_pooled()) to determine if the reference should be dropped.
There is no direct correlation between being pooled and the use of
svo_function, though in practice, lockd is the only non-pooled service,
and the only one not to use svo_function.
This is untidy and would cause problems if we changed lockd to use
svc_set_num_threads(), which requires the use of ->svo_function.
So change the test for "is the service pooled" to "is sv_nrpools > 1".
This means that when svc_pool_map_get() returns 1, it must NOT take a
reference to the pool.
We discard svc_serv_is_pooled(), and test sv_nrpools directly.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
These definitions are not used outside of svc.c, and there is no
evidence that they ever have been. So move them into svc.c
and make the declarations 'static'.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The ->svo_setup callback serves no purpose. It is always called from
within the same module that chooses which callback is needed. So
discard it and call the relevant function directly.
Now that svc_set_num_threads() is no longer used remove it and rename
svc_set_num_threads_sync() to remove the "_sync" suffix.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Using sv_lock means we don't need to hold the service mutex over these
updates.
In particular, svc_exit_thread() no longer requires synchronisation, so
threads can exit asynchronously.
Note that we could use an atomic_t, but as there are many more read
sites than writes, that would add unnecessary noise to the code.
Some reads are already racy, and there is no need for them to not be.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The use of sv_nrthreads as a general refcount results in clumsy code, as
is seen by various comments needed to explain the situation.
This patch introduces a 'struct kref' and uses that for reference
counting, leaving sv_nrthreads to be a pure count of threads. The kref
is managed particularly in svc_get() and svc_put(), and also nfsd_put();
svc_destroy() now takes a pointer to the embedded kref, rather than to
the serv.
nfsd allows the svc_serv to exist with ->sv_nrhtreads being zero. This
happens when a transport is created before the first thread is started.
To support this, a 'keep_active' flag is introduced which holds a ref on
the svc_serv. This is set when any listening socket is successfully
added (unless there are running threads), and cleared when the number of
threads is set. So when the last thread exits, the nfs_serv will be
destroyed.
The use of 'keep_active' replaces previous code which checked if there
were any permanent sockets.
We no longer clear ->rq_server when nfsd() exits. This was done
to prevent svc_exit_thread() from calling svc_destroy().
Instead we take an extra reference to the svc_serv to prevent
svc_destroy() from being called.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_destroy() is poorly named - it doesn't necessarily destroy the svc,
it might just reduce the ref count.
nfsd_destroy() is poorly named for the same reason.
This patch:
- removes the refcount functionality from svc_destroy(), moving it to
a new svc_put(). Almost all previous callers of svc_destroy() now
call svc_put().
- renames nfsd_destroy() to nfsd_put() and improves the code, using
the new svc_destroy() rather than svc_put()
- removes a few comments that explain the important for balanced
get/put calls. This should be obvious.
The only non-trivial part of this is that svc_destroy() would call
svc_sock_update() on a non-final decrement. It can no longer do that,
and svc_put() isn't really a good place of it. This call is now made
from svc_exit_thread() which seems like a good place. This makes the
call *before* sv_nrthreads is decremented rather than after. This
is not particularly important as the call just sets a flag which
causes sv_nrthreads set be checked later. A subsequent patch will
improve the ordering.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Update module_put_and_exit to call kthread_exit instead of do_exit.
Change the name to reflect this change in functionality. All of the
users of module_put_and_exit are causing the current kthread to exit
so this change makes it clear what is happening. There is no
functional change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The root-lock is dropped before dev_hard_start_xmit() is invoked and after
setting the __QDISC___STATE_RUNNING bit. If the Qdisc owner is preempted
by another sender/task with a higher priority then this new sender won't
be able to submit packets to the NIC directly instead they will be
enqueued into the Qdisc. The NIC will remain idle until the Qdisc owner
is scheduled again and finishes the job.
By serializing every task on the ->busylock then the task will be
preempted by a sender only after the Qdisc has no owner.
Always serialize on the busylock on PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables the "/proc/sys/net/unix/max_dgram_qlen" sysctl to be
exposed to non-init user namespaces. max_dgram_qlen is used as the default
"sk_max_ack_backlog" value for when a unix socket is created.
Currently, when a networking namespace is initialized, its unix sysctls
are exposed only if the user namespace that "owns" it is the init user
namespace. If there is an non-init user namespace that "owns" a networking
namespace (for example, in the case after we call clone() with both
CLONE_NEWUSER and CLONE_NEWNET set), the sysctls are hidden from view
and not configurable.
Exposing the unix sysctl is safe because any changes made to it will be
limited in scope to the networking namespace the non-init user namespace
"owns" and has privileges over (changes won't affect any other net
namespace). There is also no possibility of a non-privileged user namespace
messing up the net namespace sysctls it shares with its parent user namespace.
When a new user namespace is created without unsharing the network namespace
(eg calling clone() with CLONE_NEWUSER), the new user namespace shares its
parent's network namespace. Write access is protected by the mode set
in the sysctl's ctl_table (and enforced by procfs). Here in the case of
"max_dgram_qlen", 0644 is set; only the user owner has write access.
v1 -> v2:
* Add more detail to commit message, specify the
"/proc/sys/net/unix/max_dgram_qlen" sysctl in commit message.
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When key_exchange is disabled, there is no reason to accept MSG_CRYPTO
msgs if it doesn't send MSG_CRYPTO msgs.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sja1105 driver messes with the tagging protocol's state when PTP RX
timestamping is enabled/disabled. This is fundamentally necessary
because the tagger needs to know what to do when it receives a PTP
packet. If RX timestamping is enabled, then a metadata follow-up frame
is expected, and this holds the (partial) timestamp. So the tagger plays
hide-and-seek with the network stack until it also gets the metadata
frame, and then presents a single packet, the timestamped PTP packet.
But when RX timestamping isn't enabled, there is no metadata frame
expected, so the hide-and-seek game must be turned off and the packet
must be delivered right away to the network stack.
Considering this, we create a pseudo isolation by devising two tagger
methods callable by the switch: one to get the RX timestamping state,
and one to set it. Since we can't export symbols between the tagger and
the switch driver, these methods are exposed through function pointers.
After this change, the public portion of the sja1105_tagger_data
contains only function pointers.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 6d709cadfd.
The above change was done to avoid calling symbols exported by the
switch driver from the tagging protocol driver.
With the tagger-owned storage model, we have a new option on our hands,
and that is for the switch driver to provide a data consumer handler in
the form of a function pointer inside the ->connect_tag_protocol()
method. Having a function pointer avoids the problems of the exported
symbols approach.
By creating a handler for metadata frames holding TX timestamps on
SJA1110, we are able to eliminate an skb queue from the tagger data, and
replace it with a simple, and stateless, function pointer. This skb
queue is now handled exclusively by sja1105_ptp.c, which makes the code
easier to follow, as it used to be before the reverted patch.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>