linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-18 18:11:56 +00:00

Author	SHA1	Message	Date
Trond Myklebust	edffb84cc8	NFSoRDmA Client updates for Linux 5.11 Cleanups and improvements: - Remove use of raw kernel memory addresses in tracepoints - Replace dprintk() call sites in ERR_CHUNK path - Trace unmap sync calls - Optimize MR DMA-unmapping -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl/SlTcACgkQ18tUv7Cl QOsc+xAA0qmDLbdZShFAiip/jgFvHkoIJzVcil5++xiRY77xOjR2rgkgBSv4hFYm XWBidP/eQm3n5r0S+LG0zudcHnaKBonZ0UV2j32PelMEnvn3H9qlSJYneEm9xv2O K1koNcbunJH8JRsLxbStdMnOnNwCmB+HzIjnWr87OgXkYucqDBBxt3RMyZVD46QU GypnItAtLns4Oacsw0TyPAPAxstjZ71YRlTMhtfyIYEJfvxUVRYRKvs7nPei9eCa 0uCzoKb28F7CWy+S9so7wSjfdDu4G+FmAfMvJbQF4irhgZBvzyzj4jq4hCnv377S XF6Eygm0Z4BSWNoAnRjgLx3Bo4ps4qZi7iAj8cUAksEKQ3QI2BWTWR3e07YwLfnm iwhWougP76zbtC3lNA+4D2yaYwA1LtN4IYp19KmS/0LyRe5EBry39ccE/eE508JA BGnCohawI9ya7WT3xTne9lyxNGhjm/jNDKt0ax0ze4hwIqSGekqhUzxQ0bl4suQz ZHP2gUEad07jZyy8gCcpEvvxA3fFW241vaG/hN34Dy47eHpAQDQIMk7611YRQXH/ jCBHoym1dHXc8eNa+gTmrjXdhVdgHo4zI0ppeaQvA2Ss0nPp8N8H2nJV/as536Dp on0b1zAF4o5uzGlEzHu+FuFGqUPE9My9/dNDqw3o0Cncu7adffk= =R45y -----END PGP SIGNATURE----- Merge tag 'nfs-rdma-for-5.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next NFSoRDmA Client updates for Linux 5.11 Cleanups and improvements: - Remove use of raw kernel memory addresses in tracepoints - Replace dprintk() call sites in ERR_CHUNK path - Trace unmap sync calls - Optimize MR DMA-unmapping Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-15 20:08:41 -05:00
Linus Torvalds	d635a69dd4	Networking updates for 5.11 Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0 XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM 4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv +htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730= =GXs1 -----END PGP SIGNATURE----- Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support" * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits) net: hns3: fix expression that is currently always true net: fix proc_fs init handling in af_packet and tls nfc: pn533: convert comma to semicolon af_vsock: Assign the vsock transport considering the vsock address flags af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path vsock_addr: Check for supported flag values vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag vm_sockets: Add flags field in the vsock address data structure net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context nfc: s3fwrn5: Release the nfc firmware net: vxget: clean up sparse warnings mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3 mlxsw: spectrum_router_xm: Introduce basic XM cache flushing mlxsw: reg: Add Router LPM Cache Enable Register mlxsw: reg: Add Router LPM Cache ML Delete Register mlxsw: spectrum_router_xm: Implement L-value tracking for M-index mlxsw: reg: Add XM Router M Table Register ...	2020-12-15 13:22:29 -08:00
Yonatan Linik	a268e0f245	net: fix proc_fs init handling in af_packet and tls proc_fs was used, in af_packet, without a surrounding #ifdef, although there is no hard dependency on proc_fs. That caused the initialization of the af_packet module to fail when CONFIG_PROC_FS=n. Specifically, proc_create_net() was used in af_packet.c, and when it fails, packet_net_init() returns -ENOMEM. It will always fail when the kernel is compiled without proc_fs, because, proc_create_net() for example always returns NULL. The calling order that starts in af_packet.c is as follows: packet_init() register_pernet_subsys() register_pernet_operations() __register_pernet_operations() ops_init() ops->init() (packet_net_ops.init=packet_net_init()) proc_create_net() It worked in the past because register_pernet_subsys()'s return value wasn't checked before this Commit `36096f2f4f` ("packet: Fix error path in packet_init."). It always returned an error, but was not checked before, so everything was working even when CONFIG_PROC_FS=n. The fix here is simply to add the necessary #ifdef. This also fixes a similar error in tls_proc.c, that was found by Jakub Kicinski. Fixes: `d26b698dd3` ("net/tls: add skeleton of MIB statistics") Fixes: `36096f2f4f` ("packet: Fix error path in packet_init") Signed-off-by: Yonatan Linik <yonatanlinik@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 19:39:30 -08:00
Andra Paraschiv	7f816984f4	af_vsock: Assign the vsock transport considering the vsock address flags The vsock flags field can be set in the connect path (user space app) and the (listen) receive path (kernel space logic). When the vsock transport is assigned, the remote CID is used to distinguish between types of connection. Use the vsock flags value (in addition to the CID) from the remote address to decide which vsock transport to assign. For the sibling VMs use case, all the vsock packets need to be forwarded to the host, so always assign the guest->host transport if the VMADDR_FLAG_TO_HOST flag is set. For the other use cases, the vsock transport assignment logic is not changed. Changelog v3 -> v4 * Update the "remote_flags" local variable type to reflect the change of the "svm_flags" field to be 1 byte in size. v2 -> v3 * Update bitwise check logic to not compare result to the flag value. v1 -> v2 * Use bitwise operator to check the vsock flag. * Use the updated "VMADDR_FLAG_TO_HOST" flag naming. * Merge the checks for the g2h transport assignment in one "if" block. Signed-off-by: Andra Paraschiv <andraprs@amazon.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 19:33:39 -08:00
Andra Paraschiv	1b5f2ab98e	af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path The vsock flags can be set during the connect() setup logic, when initializing the vsock address data structure variable. Then the vsock transport is assigned, also considering this flags field. The vsock transport is also assigned on the (listen) receive path. The flags field needs to be set considering the use case. Set the value of the vsock flags of the remote address to the one targeted for packets forwarding to the host, if the following conditions are met: * The source CID of the packet is higher than VMADDR_CID_HOST. * The destination CID of the packet is higher than VMADDR_CID_HOST. Changelog v3 -> v4 * No changes. v2 -> v3 * No changes. v1 -> v2 * Set the vsock flag on the receive path in the vsock transport assignment logic. * Use bitwise operator for the vsock flag setup. * Use the updated "VMADDR_FLAG_TO_HOST" flag naming. Signed-off-by: Andra Paraschiv <andraprs@amazon.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 19:33:39 -08:00
Andra Paraschiv	cada7ccd9d	vsock_addr: Check for supported flag values Check if the provided flags value from the vsock address data structure includes the supported flags in the corresponding kernel version. The first byte of the "svm_zero" field is used as "svm_flags", so add the flags check instead. Changelog v3 -> v4 * New patch in v4. Signed-off-by: Andra Paraschiv <andraprs@amazon.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 19:33:39 -08:00
Tariq Toukan	ae0b04b238	net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be logically done when HW_CSUM offload is off. Fixes: `2342a8512a` ("net: Add TLS TX offload features") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 19:31:36 -08:00
Alexander Duyck	c31b70c996	tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG message in the case of IPv6 or a fragmentation request in the case of IPv4. This results in the socket stalling for a second or more as it does not respond to the message by retransmitting the SYN frame. Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that makes use of the entire MSS. In the case of fastopen it does, and an additional complication is that the retransmit queue doesn't contain the original frames. As a result when tcp_simple_retransmit is called and walks the list of frames in the queue it may not mark the frames as lost because both the SYN and the data packet each individually are smaller than the MSS size after the adjustment. This results in the socket being stalled until the retransmit timer kicks in and forces the SYN frame out again without the data attached. In order to resolve this we can reduce the MSS the packets are compared to in tcp_simple_retransmit to -1 for cases where we are still in the TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of the packets related to the fastopen SYN as lost. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 19:29:55 -08:00
Vasily Averin	54970a2fbb	net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet syzbot reproduces BUG_ON in skb_checksum_help(): tun creates (bogus) skb with huge partial-checksummed area and small ip packet inside. Then ip_rcv trims the skb based on size of internal ip packet, after that csum offset points beyond of trimmed skb. Then checksum_tg() called via netfilter hook triggers BUG_ON: offset = skb_checksum_start_offset(skb); BUG_ON(offset >= skb_headlen(skb)); To work around the problem this patch forces pskb_trim_rcsum_slow() to return -EINVAL in described scenario. It allows its callers to drop such kind of packets. Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0 Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 18:41:01 -08:00
Linus Torvalds	adb35e8dc9	Scheduler updates: - migrate_disable/enable() support which originates from the RT tree and is now a prerequisite for the new preemptible kmap_local() API which aims to replace kmap_atomic(). - A fair amount of topology and NUMA related improvements - Improvements for the frequency invariant calculations - Enhanced robustness for the global CPU priority tracking and decision making - The usual small fixes and enhancements all over the place -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ Gsq0rJW5iiu/OQ== =Oz1V -----END PGP SIGNATURE----- Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - migrate_disable/enable() support which originates from the RT tree and is now a prerequisite for the new preemptible kmap_local() API which aims to replace kmap_atomic(). - A fair amount of topology and NUMA related improvements - Improvements for the frequency invariant calculations - Enhanced robustness for the global CPU priority tracking and decision making - The usual small fixes and enhancements all over the place * tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits) sched/fair: Trivial correction of the newidle_balance() comment sched/fair: Clear SMT siblings after determining the core is not idle sched: Fix kernel-doc markup x86: Print ratio freq_max/freq_base used in frequency invariance calculations x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC x86, sched: Calculate frequency invariance for AMD systems irq_work: Optimize irq_work_single() smp: Cleanup smp_call_function*() irq_work: Cleanup sched: Limit the amount of NUMA imbalance that can exist at fork time sched/numa: Allow a floating imbalance between NUMA nodes sched: Avoid unnecessary calculation of load imbalance at clone time sched/numa: Rename nr_running and break out the magic number sched: Make migrate_disable/enable() independent of RT sched/topology: Condition EAS enablement on FIE support arm64: Rebuild sched domains on invariance status changes sched/topology,schedutil: Wrap sched domains rebuild sched/uclamp: Allow to reset a task uclamp constraint value sched/core: Fix typos in comments Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug ...	2020-12-14 18:29:11 -08:00
Wang Hai	989a1db06e	net: bridge: Fix a warning when del bridge sysfs I got a warining report: br_sysfs_addbr: can't create group bridge4/bridge ------------[ cut here ]------------ sysfs group 'bridge' not found for kobject 'bridge4' WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group fs/sysfs/group.c:279 [inline] WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group+0x153/0x1b0 fs/sysfs/group.c:270 Modules linked in: iptable_nat ... Call Trace: br_dev_delete+0x112/0x190 net/bridge/br_if.c:384 br_dev_newlink net/bridge/br_netlink.c:1381 [inline] br_dev_newlink+0xdb/0x100 net/bridge/br_netlink.c:1362 __rtnl_newlink+0xe11/0x13f0 net/core/rtnetlink.c:3441 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500 rtnetlink_rcv_msg+0x385/0x980 net/core/rtnetlink.c:5562 netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x793/0xc80 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0x139/0x170 net/socket.c:671 ____sys_sendmsg+0x658/0x7d0 net/socket.c:2353 ___sys_sendmsg+0xf8/0x170 net/socket.c:2407 __sys_sendmsg+0xd3/0x190 net/socket.c:2440 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In br_device_event(), if the bridge sysfs fails to be added, br_device_event() should return error. This can prevent warining when removing bridge sysfs that do not exist. Fixes: `bb900b27a2` ("bridge: allow creating bridge devices with netlink") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Tested-by: Nikolay Aleksandrov <nikolay@nvidia.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20201211122921.40386-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 18:27:49 -08:00
Paolo Abeni	15e6ca974b	mptcp: let MPTCP create max size skbs Currently the xmit path of the MPTCP protocol creates smaller- than-max-size skbs, which is suboptimal for the performances. There are a few things to improve: - when coalescing to an existing skb, must clear the PUSH flag - tcp_build_frag() expect the available space as an argument. When coalescing is enable MPTCP already subtracted the to-be-coalesced skb len. We must increment said argument accordingly. Before: ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM [...] 131072 16384 16384 30.00 24414.86 After: ./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM [...] 131072 16384 16384 30.05 28357.69 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Paolo Abeni	1bc7327b5f	mptcp: pm: simplify select_local_address() There is no need to unconditionally acquire the join list lock, we can simply splice the join list into the subflow list and traverse only the latter. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Florian Westphal	50c504a20a	mptcp: parse and act on incoming FASTCLOSE option parse the MPTCP FASTCLOSE subtype. If provided key matches the local one, schedule the work queue to close (with tcp reset) all subflows. The MPTCP socket moves to closed state immediately. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Florian Westphal	049fe386d3	tcp: parse mptcp options contained in reset packets Because TCP-level resets only affect the subflow, there is a MPTCP option to indicate that the MPTCP-level connection should be closed immediately without a mptcp-level fin exchange. This is the 'MPTCP fast close option'. It can be carried on ack segments or TCP resets. In the latter case, its needed to parse mptcp options also for reset packets so that MPTCP can act accordingly. Next patch will add receive side fastclose support in MPTCP. Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Florian Westphal	ab82e996a1	mptcp: hold mptcp socket before calling tcp_done When processing options from tcp reset path its possible that tcp_done(ssk) drops the last reference on the mptcp socket which results in use-after-free. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Geliang Tang	ba34c3de71	mptcp: use MPTCPOPT_HMAC_LEN macro Use the macro MPTCPOPT_HMAC_LEN instead of a constant in struct mptcp_options_received. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Geliang Tang	141694df65	mptcp: remove address when netlink flushes addrs When the PM netlink flushes the addresses, invoke the remove address function mptcp_nl_remove_subflow_and_signal_addr to remove the addresses and the subflows. Since this function should not be invoked under lock, move __flush_addrs out of the pernet->lock. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Nicolas Rybowski	3764b0c565	mptcp: attach subflow socket to parent cgroup It has been observed that the kernel sockets created for the subflows (except the first one) are not in the same cgroup as their parents. That's because the additional subflows are created by kernel workers. This is a problem with eBPF programs attached to the parent's cgroup won't be executed for the children. But also with any other features of CGroup linked to a sk. This patch fixes this behaviour. As the subflow sockets are created by the kernel, we can't use 'mem_cgroup_sk_alloc' because of the current context being the one of the kworker. This is why we have to do low level memcg manipulation, if required. Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:30:06 -08:00
Eelco Chaudron	09d6217254	net: openvswitch: fix TTL decrement exception action execution Currently, the exception actions are not processed correctly as the wrong dataset is passed. This change fixes this, including the misleading comment. In addition, a check was added to make sure we work on an IPv4 packet, and not just assume if it's not IPv6 it's IPv4. This was all tested using OVS with patch, https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639, applied and sending packets with a TTL of 1 (and 0), both with IPv4 and IPv6. Fixes: `69929d4c49` ("net: openvswitch: fix TTL decrement action netlink message format") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 17:18:25 -08:00
Linus Torvalds	f9b4240b07	fixes-v5.11 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF GEP9def51HvXLpBV2BzNUGggo1SoGgQ= =w/cO -----END PGP SIGNATURE----- Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull misc fixes from Christian Brauner: "This contains several fixes which felt worth being combined into a single branch: - Use put_nsproxy() instead of open-coding it switch_task_namespaces() - Kirill's work to unify lifecycle management for all namespaces. The lifetime counters are used identically for all namespaces types. Namespaces may of course have additional unrelated counters and these are not altered. This work allows us to unify the type of the counters and reduces maintenance cost by moving the counter in one place and indicating that basic lifetime management is identical for all namespaces. - Peilin's fix adding three byte padding to Dmitry's PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak. - Two smal patches to convert from the /* fall through / comment annotation to the fallthrough keyword annotation which I had taken into my branch and into -next before `df561f6688` ("treewide: Use fallthrough pseudo-keyword") made it upstream which fixed this tree-wide. Since I didn't want to invalidate all testing for other commits I didn't rebase and kept them" tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: nsproxy: use put_nsproxy() in switch_task_namespaces() sys: Convert to the new fallthrough notation signal: Convert to the new fallthrough notation time: Use generic ns_common::count cgroup: Use generic ns_common::count mnt: Use generic ns_common::count user: Use generic ns_common::count pid: Use generic ns_common::count ipc: Use generic ns_common::count uts: Use generic ns_common::count net: Use generic ns_common::count ns: Add a common refcount into ns_common ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()	2020-12-14 16:40:27 -08:00
Jakub Kicinski	7bca5021a4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap. 2) Use atomic_inc_return() instead of atomic_add_return() in IPVS, from Yejune Deng. 3) Simplify check for overquota in xt_nfacct, from Kaixu Xia. 4) Move nfnl_acct_list away from struct net, from Miao Wang. 5) Pass actual sk in reject actions, from Jan Engelhardt. 6) Add timeout and protoinfo to ctnetlink destroy events, from Florian Westphal. 7) Four patches to generalize set infrastructure to support for multiple expressions per set element. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: netfilter: nftables: netlink support for several set element expressions netfilter: nftables: generalize set extension to support for several expressions netfilter: nftables: move nft_expr before nft_set netfilter: nftables: generalize set expressions support netfilter: ctnetlink: add timeout and protoinfo to destroy events netfilter: use actual socket sk for REJECT action netfilter: nfnl_acct: remove data from struct net netfilter: Remove unnecessary conversion to bool ipvs: replace atomic_add_return() netfilter: nft_reject_bridge: fix build errors due to code movement ==================== Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 15:43:21 -08:00
Jakub Kicinski	a6b5e026e6	Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2020-12-14 1) Expose bpf_sk_storage_() helpers to iterator programs, from Florent Revest. 2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua. 3) Support for finding BTF based kernel attach targets through libbpf's bpf_program__set_attach_target() API, from Andrii Nakryiko. 4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song. 5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet. 6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo. 7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel. 8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman. 9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner, KP Singh, Jiri Olsa and various others. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits) selftests/bpf: Add a test for ptr_to_map_value on stack for helper access bpf: Permits pointers on stack for helper calls libbpf: Expose libbpf ring_buffer epoll_fd selftests/bpf: Add set_attach_target() API selftest for module target libbpf: Support modules in bpf_program__set_attach_target() API selftests/bpf: Silence ima_setup.sh when not running in verbose mode. selftests/bpf: Drop the need for LLVM's llc selftests/bpf: fix bpf_testmod.ko recompilation logic samples/bpf: Fix possible hang in xdpsock with multiple threads selftests/bpf: Make selftest compilation work on clang 11 selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore selftests/bpf: Drop tcp-{client,server}.py from Makefile selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV selftests/bpf: Xsk selftests - DRV POLL, NOPOLL selftests/bpf: Xsk selftests - SKB POLL, NOPOLL selftests/bpf: Xsk selftests framework bpf: Only provide bpf_sock_from_file with CONFIG_NET bpf: Return -ENOTSUPP when attaching to non-kernel BTF xsk: Validate socket state in xsk_recvmsg, prior touching socket members ... ==================== Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-14 15:34:36 -08:00
Ilya Dryomov	2f0df6cfa3	libceph: drop ceph_auth_{create,update}_authorizer() Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	ce287162d9	libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1 This shouldn't cause any functional changes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	cd1a677cad	libceph, ceph: implement msgr2.1 protocol (crc and secure modes) Implement msgr2.1 wire protocol, available since nautilus 14.2.11 and octopus 15.2.5. msgr2.0 wire protocol is not implemented -- it has several security, integrity and robustness issues and therefore considered deprecated. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	00498b9941	libceph: introduce connection modes and ms_mode option msgr2 supports two connection modes: crc (plain) and secure (on-wire encryption). Connection mode is picked by server based on input from client. Introduce ms_mode option: ms_mode=legacy - msgr1 (default) ms_mode=crc - crc mode, if denied fail ms_mode=secure - secure mode, if denied fail ms_mode=prefer-crc - crc mode, if denied agree to secure mode ms_mode=prefer-secure - secure mode, if denied agree to crc mode ms_mode affects all connections, we don't separate connections to mons like it's done in userspace with ms_client_mode vs ms_mon_client_mode. For now the default is legacy, to be flipped to prefer-crc after some time. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	313771e80f	libceph, rbd: ignore addr->type while comparing in some cases For libceph, this ensures that libceph instance sharing (share option) continues to work. For rbd, this avoids blocklisting alive lock owners (locker addr is always LEGACY, while watcher addr is ANY in nautilus). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	a5cbd5fc22	libceph, ceph: get and handle cluster maps with addrvecs In preparation for msgr2, make the cluster send us maps with addrvecs including both LEGACY and MSGR2 addrs instead of a single LEGACY addr. This means advertising support for SERVER_NAUTILUS and also some older features: SERVER_MIMIC, MONENC and MONNAMES. MONNAMES and MONENC are actually pre-argonaut, we just never updated ceph_monmap_decode() for them. Decoding is unconditional, see commit `23c625ce30` ("libceph: assume argonaut on the server side"). SERVER_MIMIC doesn't bear any meaning for the kernel client. Since ceph_decode_entity_addrvec() is guarded by encoding version checks (and in msgr2 case it is guarded implicitly by the fact that server is speaking msgr2), we assume MSG_ADDR2 for it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	8921f25116	libceph: factor out finish_auth() In preparation for msgr2, factor out finish_auth() so it is suitable for both existing MAuth message based authentication and upcoming msgr2 authentication exchange. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	c1c0ce78f4	libceph: drop ac->ops->name field Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	59711f9ec2	libceph: amend cephx init_protocol() and build_request() In msgr2, initial authentication happens with an exchange of msgr2 control frames -- MAuth message and struct ceph_mon_request_header aren't used. Make that optional. Stop reporting cephx protocol as "x". Use "cephx" instead. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	285ea34fc8	libceph, ceph: incorporate nautilus cephx changes - request service tickets together with auth ticket. Currently we get auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message. Since nautilus, desired service tickets are shared togther with auth ticket in CEPHX_GET_AUTH_SESSION_KEY reply. - propagate session key and connection secret, if any. In preparation for msgr2, update handle_reply() and verify_authorizer_reply() auth ops to propagate session key and connection secret. Since nautilus, if secure mode is negotiated, connection secret is shared either in CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer reply (for osds and mdses). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	6610fff278	libceph: safer en/decoding of cephx requests and replies Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	f79e25b087	libceph: more insight into ticket expiry and invalidation Make it clear that "need" is a union of "missing" and "have, but up for renewal" and dout when the ticket goes missing due to expiry or invalidation by client. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	a56dd9bf47	libceph: move msgr1 protocol specific fields to its own struct A couple whitespace fixups, no functional changes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	2f713615dd	libceph: move msgr1 protocol implementation to its own file A pure move, no other changes. Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}() helpers are also moved. msgr2 will bring its own, more efficient, variants based on iov_iter. Switching msgr1 to them was considered but decided against to avoid subtle regressions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:50 +01:00
Ilya Dryomov	566050e17e	libceph: separate msgr1 protocol implementation In preparation for msgr2, define internal messenger <-> protocol interface (as opposed to external messenger <-> client interface, which is struct ceph_connection_operations) consisting of try_read(), try_write(), revoke(), revoke_incoming(), opened(), reset_session() and reset_protocol() ops. The semantics are exactly the same as they are now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	6503e0b69c	libceph: export remaining protocol independent infrastructure In preparation for msgr2, make all protocol independent functions in messenger.c global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	699921d9e6	libceph: export zero_page In preparation for msgr2, make zero_page global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	3fefd43e74	libceph: rename and export con->flags bits In preparation for msgr2, move the defines to the header file. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	6d7f62bfb5	libceph: rename and export con->state states In preparation for msgr2, rename msgr1 specific states and move the defines to the header file. Also drop state transition comments. They don't cover all possible transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more harm than good. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	30be780a87	libceph: make con->state an int unsigned long is a leftover from when con->state used to be a set of bits managed with set_bit(), clear_bit(), etc. Save a bit of memory. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	2f68738037	libceph: don't export ceph_messenger_{init_fini}() to modules Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	fd1a154cad	libceph: make sure our addr->port is zero and addr->nonce is non-zero Our messenger instance addr->port is normally zero -- anything else is nonsensical because as a client we connect to multiple servers and don't listen on any port. However, a user can supply an arbitrary addr:port via ip option and the port is currently preserved. Zero it. Conversely, make sure our addr->nonce is non-zero. A zero nonce is special: in combination with a zero port, it is used to blocklist the entire ip. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	771294fe07	libceph: factor out ceph_con_get_out_msg() Move the logic of grabbing the next message from the queue into its own function. Like ceph_con_in_msg_alloc(), this is protocol independent. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	fc4c128e15	libceph: change ceph_con_in_msg_alloc() to take hdr ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and struct ceph_msg_header in general) is msgr1 specific. While the struct is deeply ingrained inside and outside the messenger, con->in_hdr field can be separated. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	8ee8abf797	libceph: change ceph_msg_data_cursor_init() to take cursor Make it possible to have local cursors and embed them outside struct ceph_msg. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	0247192809	libceph: handle discarding acked and requeued messages separately Make it easier to follow and remove dependency on msgr1 specific CEPH_MSGR_TAG_SEQ. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	5cd8da3a1c	libceph: drop msg->ack_stamp field It is set in process_ack() but never used. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	d3c1248cac	libceph: remove redundant session reset log message Stick with pr_info message because session reset isn't an error most of the time. When it is (i.e. if the server denies the reconnect attempt), we get a bunch of other pr_err messages. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:49 +01:00
Ilya Dryomov	a3da057bbd	libceph: clear con->peer_global_seq on RESETSESSION con->peer_global_seq is part of session state. Clear it when the server tells us to reset, not just in ceph_con_close(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Ilya Dryomov	5963c3d01c	libceph: rename reset_connection() to ceph_con_reset_session() With just session reset bits left, rename appropriately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Ilya Dryomov	3596f4c124	libceph: split protocol reset bits out of reset_connection() Move protocol reset bits into ceph_con_reset_protocol(), leaving just session reset bits. Note that con->out_skip is now reset on faults. This fixes a crash in the case of a stateful session getting a fault while in the middle of revoking a message. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Ilya Dryomov	90b6561a05	libceph: don't call reset_connection() on version/feature mismatches A fault due to a version mismatch or a feature set mismatch used to be treated differently from other faults: the connection would get closed without trying to reconnect and there was a ->bad_proto() connection op for notifying about that. This changed a long time ago, see commits `6384bb8b8e` ("libceph: kill bad_proto ceph connection op") and `0fa6ebc600` ("libceph: fix protocol feature mismatch failure path"). Nowadays these aren't any different from other faults (i.e. we try to reconnect even though the mismatch won't resolve until the server is replaced). reset_connection() calls there are rather confusing because reset_connection() resets a session together an individual instance of the protocol. This is cleaned up in the next patch. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Ilya Dryomov	418af5b3bf	libceph: lower exponential backoff delay The current setting allows the backoff to climb up to 5 minutes. This is too high -- it becomes hard to tell whether the client is stuck on something or just in backoff. In userspace, ms_max_backoff is defaulted to 15 seconds. Let's do the same. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Ilya Dryomov	b77f8f0e4f	libceph: include middle_len in process_message() dout Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Linus Torvalds	9e4b0d55d8	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add speed testing on 1420-byte blocks for networking Algorithms: - Improve performance of chacha on ARM for network packets - Improve performance of aegis128 on ARM for network packets Drivers: - Add support for Keem Bay OCS AES/SM4 - Add support for QAT 4xxx devices - Enable crypto-engine retry mechanism in caam - Enable support for crypto engine on sdm845 in qce - Add HiSilicon PRNG driver support" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits) crypto: qat - add capability detection logic in qat_4xxx crypto: qat - add AES-XTS support for QAT GEN4 devices crypto: qat - add AES-CTR support for QAT GEN4 devices crypto: atmel-i2c - select CONFIG_BITREVERSE crypto: hisilicon/trng - replace atomic_add_return() crypto: keembay - Add support for Keem Bay OCS AES/SM4 dt-bindings: Add Keem Bay OCS AES bindings crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd crypto: seed - remove trailing semicolon in macro definition crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg crypto: cpt - Fix sparse warnings in cptpf hwrng: ks-sa - Add dependency on IOMEM and OF crypto: lib/blake2s - Move selftest prototype into header file crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata crypto: ecdh - avoid unaligned accesses in ecdh_set_secret() crypto: ccree - rework cache parameters handling crypto: cavium - Use dma_set_mask_and_coherent to simplify code crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code ...	2020-12-14 12:18:19 -08:00
Trond Myklebust	5802f7c2a6	SUNRPC: When expanding the buffer, we may need grow the sparse pages If we're shifting the page data to the right, and this happens to be a sparse page array, then we may need to allocate new pages in order to receive the data. Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	f8d0e60f10	SUNRPC: Cleanup - constify a number of xdr_buf helpers There are a number of xdr helpers for struct xdr_buf that do not change the structure itself. Mark those as taking const pointers for documentation purposes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	5a5f1c2c2c	SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field Move the setting of the xdr_stream 'nwords' field into the helpers that reset the xdr_stream cursor. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	e43ac22b83	SUNRPC: _copy_to/from_pages() now check for zero length Clean up callers of _copy_to/from_pages() that still check for a zero length. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	6707fbd7d3	SUNRPC: Cleanup xdr_shrink_bufhead() Clean up xdr_shrink_bufhead() to use the new helpers instead of doing its own thing. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	c4f2f591f0	SUNRPC: Fix xdr_expand_hole() We do want to try to grow the buffer if possible, but if that attempt fails, we still want to move the data and truncate the XDR message. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	9a20f6f4e6	SUNRPC: Fixes for xdr_align_data() The main use case right now for xdr_align_data() is to shift the page data to the left, and in practice shrink the total XDR data buffer. This patch ensures that we fix up the accounting for the buffer length as we shift that data around. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	c54e959b36	SUNRPC: _shift_data_left/right_pages should check the shift length Exit early if the shift is zero. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Chuck Lever	15261b9126	xprtrdma: Fix XDRBUF_SPARSE_PAGES support Olga K. observed that rpcrdma_marsh_req() allocates sparse pages only when it has determined that a Reply chunk is necessary. There are plenty of cases where no Reply chunk is needed, but the XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in rpcrdma_inline_fixup() when it tries to copy parts of the received Reply into a missing page. To avoid crashing, handle sparse page allocation up front. Until XATTR support was added, this issue did not appear often because the only SPARSE_PAGES consumer always expected a reply large enough to always require a Reply chunk. Reported-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Dan Aloni	ac9645c873	sunrpc: fix xs_read_xdr_buf for partial pages receive When receiving pages data, return value 'ret' when positive includes `buf->page_base`, so we should subtract that before it is used for changing `offset` and comparing against `want`. This was discovered on the very rare cases where the server returned a chunk of bytes that when added to the already received amount of bytes for the pages happened to match the current `recv.len`, for example on this case: buf->page_base : 258356 actually received from socket: 1740 ret : 260096 want : 260096 In this case neither of the two 'if ... goto out' trigger, and we continue to tail parsing. Worth to mention that the ensuing EMSGSIZE from the continued execution of `xs_read_xdr_buf` may be observed by an application due to 4 superfluous bytes being added to the pages data. Fixes: `277e4ab7d5` ("SUNRPC: Simplify TCP receive code by switching to using iterators") Signed-off-by: Dan Aloni <dan@kernelim.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:46:55 -05:00
Xie He	13458ffe0a	net: x25: Remove unimplemented X.25-over-LLC code stubs According to the X.25 documentation, there was a plan to implement X.25-over-802.2-LLC. It never finished but left various code stubs in the X.25 code. At this time it is unlikely that it would ever finish so it may be better to remove those code stubs. Also change the documentation to make it clear that this is not a ongoing plan anymore. Change words like "will" to "could", "would", etc. Cc: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Xie He <xie.he.0141@gmail.com> Link: https://lore.kernel.org/r/20201209033346.83742-1-xie.he.0141@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-12 17:15:33 -08:00
SeongJae Park	0b9b241406	inet: frags: batch fqdir destroy works On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls make the number of active slab objects including 'sock_inode_cache' type rapidly and continuously increase. As a result, memory pressure occurs. In more detail, I made an artificial reproducer that resembles the workload that we found the problem and reproduce the problem faster. It merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes about 2 minutes. On 40 CPU cores / 70GB DRAM machine, the available memory continuously reduced in a fast speed (about 120MB per second, 15GB in total within the 2 minutes). Note that the issue don't reproduce on every machine. On my 6 CPU cores machine, the problem didn't reproduce. 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the relevant memory objects. They are asynchronously invoked by the work queues and internally use 'rcu_barrier()' to ensure safe destructions. 'cleanup_net()' works in a batched maneer in a single thread worker, while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the 'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the workload and made the contention for 'rcu_barrier()' high. In more detail, the global mutex, 'rcu_state.barrier_mutex' became the bottleneck. This commit avoids such contention by doing the 'rcu_barrier()' and subsequent lightweight works in a batched manner, as similar to that of 'cleanup_net()'. The fqdir hashtable destruction, which is done before the 'rcu_barrier()', is still allowed to run in parallel for fast processing, but this commit makes it to use a dedicated work queue instead of the 'system_wq', to make sure that the number of threads is bounded. Signed-off-by: SeongJae Park <sjpark@amazon.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-12 15:08:54 -08:00
Jakub Kicinski	e2437ac2f5	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2020-12-12 Just one patch this time: 1) Redact the SA keys with kernel lockdown confidentiality. If enabled, no secret keys are sent to uuserspace. From Antony Antony. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: redact SA secret with lockdown confidentiality ==================== Link: https://lore.kernel.org/r/20201212085737.2101294-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-12 12:28:42 -08:00
Pablo Neira Ayuso	48b0ae046e	netfilter: nftables: netlink support for several set element expressions This patch adds three new netlink attributes to encapsulate a list of expressions per set elements: - NFTA_SET_EXPRESSIONS: this attribute provides the set definition in terms of expressions. New set elements get attached the list of expressions that is specified by this new netlink attribute. - NFTA_SET_ELEM_EXPRESSIONS: this attribute allows users to restore (or initialize) the stateful information of set elements when adding an element to the set. - NFTA_DYNSET_EXPRESSIONS: this attribute specifies the list of expressions that the set element gets when it is inserted from the packet path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-12 19:20:52 +01:00
Pablo Neira Ayuso	563125a73a	netfilter: nftables: generalize set extension to support for several expressions This patch replaces NFT_SET_EXPR by NFT_SET_EXT_EXPRESSIONS. This new extension allows to attach several expressions to one set element (not only one single expression as NFT_SET_EXPR provides). This patch prepares for support for several expressions per set element in the netlink userspace API. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-12 19:20:24 +01:00
Jakub Kicinski	00f7763a26	A new set of wireless changes: * validate key indices for key deletion * more preamble support in mac80211 * various 6 GHz scan fixes/improvements * a common SAR power limitations API * various small fixes & code improvements -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl/TfYAACgkQB8qZga/f l8ShjQ/9Hd6KjvA7keATtdjR7rDHo7H2nBKV/LukpuHsiTRrXTVOAfkcUTOb2hfR 7SJMzsUXdJGivbwm4lkx5TrIgiJm1hfW3zG0PFOs/bIuXs/KICrb+kLgQWiRIUfa RIinf8BGPH3GgcCHDcWFUrnfnBVchrPUx2wIHoCQbCzLHIhB6q6x8jEJA67+smpv 57tDfUhm6pf6OYOqVN8HYlo0uRAIn1ImneplDelCmCI1dzlneEkMhqZuBXqWpD/I C5vU+MjoOsJiW1XkmYOMe6VKQ/Bve06GUWs830S7aROOEfByv+ptlR9IjqvHvPIm UI9NivfXQiZr6S7yD1m2xV7a14UMCIzYarwaM/I/NHAWF/Y4vzHFVzQjLfVKqMCV dxxsWN+Yg7Gx3T5Fj3NNiQgnPF9ASVqgMrlC59ga+4If0y60V7dOSFuo9HF7AWgP NIWKVI3He7Mb5TciM+BX5YQWkJiCSZXs427WLO6p0bp3kAgS6N6BThUraGCogXVF 1BT/y5G3QzZwg02vL3lxgWglXoH/e63UYCPt0r+i5c83Z+n4YnFSUZyRSViy9Elj DkCgdxmP0OtM+FaHxLdYm+FL4GXaGWQVNORIDP0ViSrstPgSxhWIgVj/pKNcKC7g bJI/IXm7eQkW5SXOafmhVV0TmvDOt/mM46E0CeWcPTqIhetk3lM= =+qd6 -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== A new set of wireless changes: * validate key indices for key deletion * more preamble support in mac80211 * various 6 GHz scan fixes/improvements * a common SAR power limitations API * various small fixes & code improvements * tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (35 commits) mac80211: add ieee80211_set_sar_specs nl80211: add common API to configure SAR power limitations mac80211: fix a mistake check for rx_stats update mac80211: mlme: save ssid info to ieee80211_bss_conf while assoc mac80211: Update rate control on channel change mac80211: don't filter out beacons once we start CSA mac80211: Fix calculation of minimal channel width mac80211: ignore country element TX power on 6 GHz mac80211: use bitfield helpers for BA session action frames mac80211: support Rx timestamp calculation for all preamble types mac80211: don't set set TDLS STA bandwidth wider than possible mac80211: support driver-based disconnect with reconnect hint cfg80211: support immediate reconnect request hint mac80211: use struct assignment for he_obss_pd cfg80211: remove struct ieee80211_he_bss_color nl80211: validate key indexes for cfg80211_registered_device cfg80211: include block-tx flag in channel switch started event mac80211: disallow band-switch during CSA ieee80211: update reduced neighbor report TBTT info length cfg80211: Save the regulatory domain when setting custom regulatory ... ==================== Link: https://lore.kernel.org/r/20201211142552.209018-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-12 10:07:56 -08:00
Pablo Neira Ayuso	8cfd9b0f85	netfilter: nftables: generalize set expressions support Currently, the set infrastucture allows for one single expressions per element. This patch extends the existing infrastructure to allow for up to two expressions. This is not updating the netlink API yet, this is coming as an initial preparation patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-12 11:44:42 +01:00
Florian Westphal	86d21fc747	netfilter: ctnetlink: add timeout and protoinfo to destroy events DESTROY events do not include the remaining timeout. Add the timeout if the entry was removed explicitly. This can happen when a conntrack gets deleted prematurely, e.g. due to a tcp reset, module removal, netdev notifier (nat/masquerade device went down), ctnetlink and so on. Add the protocol state too for the destroy message to check for abnormal state on connection termination. Joint work with Pablo. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-12 11:44:42 +01:00
Jakub Kicinski	46d5e62dd3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-11 22:29:38 -08:00
Carl Huang	c534e093d8	mac80211: add ieee80211_set_sar_specs This change registers ieee80211_set_sar_specs to mac80211_config_ops, so cfg80211 can call it. Signed-off-by: Carl Huang <cjhuang@codeaurora.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Abhishek Kumar <kuabhs@chromium.org> Link: https://lore.kernel.org/r/20201203103728.3034-3-cjhuang@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:39:59 +01:00
Carl Huang	6bdb68cef7	nl80211: add common API to configure SAR power limitations NL80211_CMD_SET_SAR_SPECS is added to configure SAR from user space. NL80211_ATTR_SAR_SPEC is used to pass the SAR power specification when used with NL80211_CMD_SET_SAR_SPECS. Wireless driver needs to register SAR type, supported frequency ranges to wiphy, so user space can query it. The index in frequency range is used to specify which sub band the power limitation applies to. The SAR type is for compatibility, so later other SAR mechanism can be implemented without breaking the user space SAR applications. Normal process is user space queries the SAR capability, and gets the index of supported frequency ranges and associates the power limitation with this index and sends to kernel. Here is an example of message send to kernel: 8c 00 00 00 08 00 01 00 00 00 00 00 38 00 2b 81 08 00 01 00 00 00 00 00 2c 00 02 80 14 00 00 80 08 00 02 00 00 00 00 00 08 00 01 00 38 00 00 00 14 00 01 80 08 00 02 00 01 00 00 00 08 00 01 00 48 00 00 00 NL80211_CMD_SET_SAR_SPECS: 0x8c NL80211_ATTR_WIPHY: 0x01(phy idx is 0) NL80211_ATTR_SAR_SPEC: 0x812b (NLA_NESTED) NL80211_SAR_ATTR_TYPE: 0x00 (NL80211_SAR_TYPE_POWER) NL80211_SAR_ATTR_SPECS: 0x8002 (NLA_NESTED) freq range 0 power: 0x38 in 0.25dbm unit (14dbm) freq range 1 power: 0x48 in 0.25dbm unit (18dbm) Signed-off-by: Carl Huang <cjhuang@codeaurora.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Abhishek Kumar <kuabhs@chromium.org> Link: https://lore.kernel.org/r/20201203103728.3034-2-cjhuang@codeaurora.org [minor edits, NLA parse cleanups] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:38:54 +01:00
Wen Gong	f879ac8ed6	mac80211: fix a mistake check for rx_stats update It should be !is_multicast_ether_addr() in ieee80211_rx_h_sta_process() for the rx_stats update, below commit remove the !, this patch is to change it back. It lead the rx rate "iw wlan0 station dump" become invalid for some scenario when IEEE80211_HW_USES_RSS is set. Fixes: `09a740ce35` ("mac80211: receive and process S1G beacons") Signed-off-by: Wen Gong <wgong@codeaurora.org> Link: https://lore.kernel.org/r/1607483189-3891-1-git-send-email-wgong@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:06 +01:00
Wen Gong	b0140fda62	mac80211: mlme: save ssid info to ieee80211_bss_conf while assoc The ssid info of ieee80211_bss_conf is filled in ieee80211_start_ap() for AP mode. For STATION mode, it is empty, save the info from struct ieee80211_mgd_assoc_data, the struct ieee80211_mgd_assoc_data will be freed after assoc, so the ssid info of ieee80211_mgd_assoc_data can not access after assoc, save ssid info to ieee80211_bss_conf, then ssid info can be still access after assoc. Signed-off-by: Wen Gong <wgong@codeaurora.org> Link: https://lore.kernel.org/r/1607312195-3583-2-git-send-email-wgong@codeaurora.org [reset on disassoc] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Ilan Peer	44b72ca816	mac80211: Update rate control on channel change A channel change or a channel bandwidth change can impact the rate control logic. However, the rate control logic was not updated before/after such a change, which might result in unexpected behavior. Fix this by updating the stations rate control logic when the corresponding channel context changes. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.600d967fe3c9.I48305f25cfcc9c032c77c51396e9e9b882748a86@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Emmanuel Grumbach	189a164d0f	mac80211: don't filter out beacons once we start CSA I hit a bug in which we started a CSA with an action frame, but the AP changed its mind and didn't change the beacon. The CSA wasn't cancelled and we lost the connection. The beacons were ignored because they never changed: they never contained any CSA IE. Because they never changed, the CRC of the beacon didn't change either which made us ignore the beacons instead of processing them. Now what happens is: 1) beacon has CRC X and it is valid. No CSA IE in the beacon 2) as long as beacon's CRC X, don't process their IEs 3) rx action frame with CSA 4) invalidate the beacon's CRC 5) rx beacon, CRC is still X, but now it is invalid 6) process the beacon, detect there is no CSA IE 7) abort CSA Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.83470b8407e6.I739b907598001362744692744be15335436b8351@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Ilan Peer	bbf31e88df	mac80211: Fix calculation of minimal channel width When calculating the minimal channel width for channel context, the current operation Rx channel width of a station was used and not the overall channel width capability of the station, i.e., both for Tx and Rx. Fix ieee80211_get_sta_bw() to use the maximal channel width the station is capable. While at it make the function static. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.4387040b99a0.I74bcf19238f75a5960c4098b10e355123d933281@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Johannes Berg	2dedfe1dbd	mac80211: ignore country element TX power on 6 GHz Updates to the 802.11ax draft are coming that deprecate the country element in favour of the transmit power envelope element, and make the maximum transmit power level field in the triplets reserved, so if we parse them we'd use 0 dBm transmit power. Follow suit and completely ignore the element on 6 GHz for purposes of determining TX power. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.9abf9f6b4f88.Icb6e52af586edcc74f1f0360e8f6fc9ef2bfe8f5@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Johannes Berg	db8ebd06cc	mac80211: use bitfield helpers for BA session action frames Use the appropriate bitfield helpers for encoding and decoding the capability field in the BA session action frames instead of open-coding the shifts/masks. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.0c46e5097cc0.I06e75706770c40b9ba1cabd1f8a78ab7a05c5b73@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Avraham Stern	da3882331a	mac80211: support Rx timestamp calculation for all preamble types Add support for calculating the Rx timestamp for HE frames. Since now all frame types are supported, allow setting the Rx timestamp regardless of the frame type. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.4786559af475.Ia54486bb0a12e5351f9d5c60ef6fcda7c9e7141c@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Johannes Berg	f65607cdbc	mac80211: don't set set TDLS STA bandwidth wider than possible When we set up a TDLS station, we set sta->sta.bandwidth solely based on the capabilities, because the "what's the current bandwidth" check is bypassed and only applied for other types of stations. This leads to the unfortunate scenario that the sta->sta.bandwidth is 160 MHz if both stations support it, but we never actually configure this bandwidth unless the AP is already using 160 MHz; even for wider bandwidth support we only go up to 80 MHz (at least right now.) For iwlwifi, this can also lead to firmware asserts, telling us that we've configured the TX rates for a higher bandwidth than is actually available due to the PHY configuration. For non-TDLS, we check against the interface's requested bandwidth, but we explicitly skip this check for TDLS to cope with the wider BW case. Change this to (a) still limit to the TDLS peer's own chandef, which gets factored into the overall PHY configuration we request from the driver, and (b) limit it to when the TDLS peer is authorized, because it's only factored into the channel context in this case. Fixes: `504871e602` ("mac80211: fix bandwidth computation for TDLS peers") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.fcc7d29c4590.I11f77e9e25ddf871a3c8d5604650c763e2c5887a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Johannes Berg	3f8a39ff28	mac80211: support driver-based disconnect with reconnect hint Support the driver indicating that a disconnection needs to be performed, and pass through the reconnect hint in this case. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.5c8dab7a22a0.I58459fdf6968b16c90cab9c574f0f04ca22b0c79@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Johannes Berg	3bb02143ff	cfg80211: support immediate reconnect request hint There are cases where it's necessary to disconnect, but an immediate reconnection is desired. Support a hint to userspace that this is the case, by including a new attribute in the deauth or disassoc event. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.58d33941fb9d.I0e7168c205c7949529c8e3b86f3c9b12c01a7017@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Johannes Berg	a5a55032ea	mac80211: use struct assignment for he_obss_pd Use a struct assignment here, which is clearer than the memcpy() and type-safe as well. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.2ab3aad7d5fc.Iaca4ee6db651b7de17e4351f4be7973ff8600186@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:05 +01:00
Johannes Berg	539a36ba2f	cfg80211: remove struct ieee80211_he_bss_color We don't really use this struct, we're now using struct cfg80211_he_bss_color instead. Change the one place in mac80211 that's using the old name to use struct assignment instead of memcpy() and thus remove the wrong sizeof while at it. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201206145305.f6698d97ae4e.Iba2dffcb79c4ab80bde7407609806010b55edfdf@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:04 +01:00
Anant Thazhemadam	2d9463083c	nl80211: validate key indexes for cfg80211_registered_device syzbot discovered a bug in which an OOB access was being made because an unsuitable key_idx value was wrongly considered to be acceptable while deleting a key in nl80211_del_key(). Since we don't know the cipher at the time of deletion, if cfg80211_validate_key_settings() were to be called directly in nl80211_del_key(), even valid keys would be wrongly determined invalid, and deletion wouldn't occur correctly. For this reason, a new function - cfg80211_valid_key_idx(), has been created, to determine if the key_idx value provided is valid or not. cfg80211_valid_key_idx() is directly called in 2 places - nl80211_del_key(), and cfg80211_validate_key_settings(). Reported-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com Tested-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com Suggested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Link: https://lore.kernel.org/r/20201204215825.129879-1-anant.thazhemadam@gmail.com Cc: stable@vger.kernel.org [also disallow IGTK key IDs if no IGTK cipher is supported] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 13:20:04 +01:00
Johannes Berg	669b84134a	cfg80211: include block-tx flag in channel switch started event In the NL80211_CMD_CH_SWITCH_STARTED_NOTIFY event, include the NL80211_ATTR_CH_SWITCH_BLOCK_TX flag attribute if block-tx was requested by the AP. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.8953ef22cc64.Ifee9cab337a4369938545920ba5590559e91327a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:59:37 +01:00
Johannes Berg	3660944a37	mac80211: disallow band-switch during CSA If the AP advertises a band switch during CSA, we will not have the right information to continue working with it, since it will likely (have to) change its capabilities and we don't track any capability changes at all. Additionally, we store e.g. supported rates per band, and that information would become invalid. Since this is a fringe scenario, just disconnect explicitly. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.0e2327107c06.I461adb07704e056b054a4a7c29b80c95a9f56637@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:59:10 +01:00
Ilan Peer	beee246951	cfg80211: Save the regulatory domain when setting custom regulatory When custom regulatory was set, only the channels setting was updated, but the regulatory domain was not saved. Fix it by saving it. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.290fa5c5568a.Ic5732aa64de6ee97ae3578bd5779fc723ba489d1@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:57:24 +01:00
Avraham Stern	c837cbad40	nl80211: always accept scan request with the duration set Accept a scan request with the duration set even if the driver does not support setting the scan dwell. The duration can be used as a hint to the driver, but the driver may use its internal logic for setting the scan dwell. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.9491a12f9226.Ia9c5b24fcefc5ce5592537507243391633a27e5f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:57:11 +01:00
Ilan Peer	b45a19dd7e	cfg80211: Update TSF and TSF BSSID for multi BSS When a new BSS entry is created based on multi BSS IE, the TSF and the TSF BSSID were not updated. Fix it. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.8377d5063827.I6f2011b6017c2ad507c61a3f1ca03b7177a46e32@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:57:02 +01:00
Ayala Beker	d590a125ee	cfg80211: scan PSC channels in case of scan with wildcard SSID In case of scan request with wildcard SSID, or in case of more than one SSID in scan request, need to scan PSC channels even though all the co-located APs found during the legacy bands scan indicated that all the APs in their ESS are co-located, as we might find different networks on the PSC channels. Signed-off-by: Ayala Beker <ayala.beker@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.736415a9ca5d.If5b3578ae85e11a707a5da07e66ba85928ba702c@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:55:16 +01:00
Ilan Peer	3598ae87fe	mac80211: Skip entries with SAE H2E only membership selector When parsing supported rates IE. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.8228e2be791e.I626c93241fef66bc71aa0cb9719aba1b11232cf1@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:54:58 +01:00
Ilan Peer	d6587602c5	cfg80211: Parse SAE H2E only membership selector This extends the support for drivers that rebuild IEs in the FW (same as with HT/VHT/HE). Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.4012647275f3.I1a93ae71c57ef0b6f58f99d47fce919d19d65ff0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:54:41 +01:00
Johannes Berg	4271d4bde0	mac80211: support MIC error/replay detected counters driver update Support the driver incrementing MIC error and replay detected counters when having detected a bad frame, if it drops it directly instead of relying on mac80211 to do the checks. These are then exposed to userspace, though currently only in some cases and in debugfs. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.fb59be9c6de8.Ife2260887366f585afadd78c983ebea93d2bb54b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:53:48 +01:00
Shaul Triebitz	081e1e7ece	mac80211: he: remove non-bss-conf fields from bss_conf ack_enabled and multi_sta_back_32bit are station capabilities and should not be in the bss_conf structure. Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20201129172929.69a7f7753444.I405c4b5245145e24577512c477f19131d4036489@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:52:24 +01:00
Tom Rix	84674ef4d6	mac80211: remove trailing semicolon in macro definitions The macro uses should have (and already have) the semicolon. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20201127193842.2876355-1-trix@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:51:55 +01:00
Gustavo A. R. Silva	d7832c7187	nl80211: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/fe5afd456a1244751177e53359d3dd149a63a873.1605896060.git.gustavoars@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:51:33 +01:00
Gustavo A. R. Silva	aaaee2d68a	mac80211: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/1a9c4e8248e76e1361edbe2471a68773d87f0b67.1605896060.git.gustavoars@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:51:04 +01:00
Gustavo A. R. Silva	01c9c0ab35	cfg80211: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/ed94a115106fa9c6df94d09b2a6c5791c618c4f2.1605896059.git.gustavoars@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:50:52 +01:00
Sami Tolvanen	32fc4a9ad5	cfg80211: fix callback type mismatches in wext-compat Instead of casting callback functions to type iw_handler, which trips indirect call checking with Clang's Control-Flow Integrity (CFI), add stub functions with the correct function type for the callbacks. Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201117205902.405316-1-samitolvanen@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:50:27 +01:00
Colin Ian King	c7ed0e683d	net: wireless: make a const array static, makes object smaller Don't populate the const array bws on the stack but instead it static. Makes the object code smaller by 80 bytes: Before: text data bss dec hex filename 85694 16865 1216 103775 1955f ./net/wireless/reg.o After: text data bss dec hex filename 85518 16961 1216 103695 1950f ./net/wireless/reg.o (gcc version 10.2.0) Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20201116181636.362729-1-colin.king@canonical.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:50:02 +01:00
Lev Stipakov	36ec144f04	net: mac80211: use core API for updating TX/RX stats Commits `d3fd65484c` ("net: core: add dev_sw_netstats_tx_add") `451b05f413` ("net: netdevice.h: sw_netstats_rx_add helper) have added API to update net device per-cpu TX/RX stats. Use core API instead of ieee80211_tx/rx_stats(). Signed-off-by: Lev Stipakov <lev@openvpn.net> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/20201113214623.144663-1-lev@openvpn.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:47:26 +01:00
Emmanuel Grumbach	14486c8261	rfkill: add a reason to the HW rfkill state The WLAN device may exist yet not be usable. This can happen when the WLAN device is controllable by both the host and some platform internal component. We need some arbritration that is vendor specific, but when the device is not available for the host, we need to reflect this state towards the user space. Add a reason field to the rfkill object (and event) so that userspace can know why the device is in rfkill: because some other platform component currently owns the device, or because the actual hw rfkill signal is asserted. Capable userspace can now determine the reason for the rfkill and possibly do some negotiation on a side band channel using a proprietary protocol to gain ownership on the device in case the device is owned by some other component. When the host gains ownership on the device, the kernel can remove the RFKILL_HARD_BLOCK_NOT_OWNER reason and the hw rfkill state will be off. Then, the userspace can bring the device up and start normal operation. The rfkill_event structure is enlarged to include the additional byte, it is now 9 bytes long. Old user space will ask to read only 8 bytes so that the kernel can know not to feed them with more data. When the user space writes 8 bytes, new kernels will just read what is present in the file descriptor. This new byte is read only from the userspace standpoint anyway. If a new user space uses an old kernel, it'll ask to read 9 bytes but will get only 8, and it'll know that it didn't get the new state. When it'll write 9 bytes, the kernel will again ignore this new byte which is read only from the userspace standpoint. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Link: https://lore.kernel.org/r/20201104134641.28816-1-emmanuel.grumbach@intel.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2020-12-11 12:47:17 +01:00
David S. Miller	d9838b1d39	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Alexei Starovoitov says: ==================== pull-request: bpf 2020-12-10 The following pull-request contains BPF updates for your net tree. We've added 21 non-merge commits during the last 12 day(s) which contain a total of 21 files changed, 163 insertions(+), 88 deletions(-). The main changes are: 1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei. 2) Fix ring_buffer__poll() return value, from Andrii. 3) Fix race in lwt_bpf, from Cong. 4) Fix test_offload, from Toke. 5) Various xsk fixes. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-10 14:29:30 -08:00
Jakub Kicinski	51e13685bd	rtnetlink: RCU-annotate both dimensions of rtnl_msg_handlers We use rcu_assign_pointer to assign both the table and the entries, but the entries are not marked as __rcu. This generates sparse warnings. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-10 13:35:59 -08:00
Arjun Roy	e0fecb289a	tcp: correctly handle increased zerocopy args struct size A prior patch increased the size of struct tcp_zerocopy_receive but did not update do_tcp_getsockopt() handling to properly account for this. This patch simply reintroduces content erroneously cut from the referenced prior patch that handles the new struct size. Fixes: `18fb76ed53` ("net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.") Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-10 13:10:32 -08:00
Oliver Hartkopp	921ca574cd	can: isotp: add SF_BROADCAST support for functional addressing When CAN_ISOTP_SF_BROADCAST is set in the CAN_ISOTP_OPTS flags the CAN_ISOTP socket is switched into functional addressing mode, where only single frame (SF) protocol data units can be send on the specified CAN interface and the given tp.tx_id after bind(). In opposite to normal and extended addressing this socket does not register a CAN-ID for reception which would be needed for a 1-to-1 ISOTP connection with a segmented bi-directional data transfer. Sending SFs on this socket is therefore a TX-only 'broadcast' operation. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Thomas Wagner <thwa1@web.de> Link: https://lore.kernel.org/r/20201206144731.4609-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2020-12-10 09:31:40 +01:00
Guillaume Nault	7fdd375e38	net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower TCA_FLOWER_KEY_MPLS_OPT_LSE_LABEL is a u32 attribute (MPLS label is 20 bits long). Fixes the following bug: $ tc filter add dev ethX ingress protocol mpls_uc \ flower mpls lse depth 2 label 256 \ action drop $ tc filter show dev ethX ingress filter protocol mpls_uc pref 49152 flower chain 0 filter protocol mpls_uc pref 49152 flower chain 0 handle 0x1 eth_type 8847 mpls lse depth 2 label 0 <-- invalid label 0, should be 256 ... Fixes: `61aec25a6d` ("cls_flower: Support filtering on multiple MPLS Label Stack Entries") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 20:39:38 -08:00
Xie He	6b21c0bb3a	net: x25: Fix handling of Restart Request and Restart Confirmation 1. When the x25 module gets loaded, layer 2 may already be running and connected. In this case, although we are in X25_LINK_STATE_0, we still need to handle the Restart Request received, rather than ignore it. 2. When we are in X25_LINK_STATE_2, we have already sent a Restart Request and is waiting for the Restart Confirmation with t20timer. t20timer will restart itself repeatedly forever so it will always be there, as long as we are in State 2. So we don't need to check x25_t20timer_pending again. Fixes: `d023b2b9cc` ("net/x25: fix restart request/confirm handling") Cc: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Xie He <xie.he.0141@gmail.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:34:25 -08:00
Paolo Abeni	d7b1bfd083	mptcp: be careful on subflows shutdown When the workqueue disposes of the msk, the subflows can still receive some data from the peer after __mptcp_close_ssk() completes. The above could trigger a race between the msk receive path and the msk destruction. Acquiring the mptcp_data_lock() in __mptcp_destroy_sock() will not save the day: the rx path could be reached even after msk destruction completes. Instead use the subflow 'disposable' flag to prevent entering the msk receive path after __mptcp_close_ssk(). Fixes: `e16163b6e2` ("mptcp: refactor shutdown and close") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:31:58 -08:00
Paolo Abeni	0597d0f8e0	mptcp: plug subflow context memory leak When a MPTCP listener socket is closed with unaccepted children pending, the ULP release callback will be invoked, but nobody will call into __mptcp_close_ssk() on the corresponding subflow. As a consequence, at ULP release time, the 'disposable' flag will be cleared and the subflow context memory will be leaked. This change addresses the issue always freeing the context if the subflow is still in the accept queue at ULP release time. Additionally, this fixes an incorrect code reference in the related comment. Note: this fix leverages the changes introduced by the previous commit. Fixes: `e16163b6e2` ("mptcp: refactor shutdown and close") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:31:58 -08:00
Paolo Abeni	5b950ff433	mptcp: link MPC subflow into msk only after accept Christoph reported the following splat: WARNING: CPU: 0 PID: 4615 at net/ipv4/inet_connection_sock.c:1031 inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031 Modules linked in: CPU: 0 PID: 4615 Comm: syz-executor.4 Not tainted 5.9.0 #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031 Code: 03 00 00 00 e8 79 b2 3d ff e9 ad f9 ff ff e8 1f 76 ba fe be 02 00 00 00 4c 89 f7 e8 62 b2 3d ff e9 14 f9 ff ff e8 08 76 ba fe <0f> 0b e9 97 f8 ff ff e8 fc 75 ba fe be 03 00 00 00 4c 89 f7 e8 3f RSP: 0018:ffffc900037f7948 EFLAGS: 00010293 RAX: ffff88810a349c80 RBX: ffff888114ee1b00 RCX: ffffffff827b14cd RDX: 0000000000000000 RSI: ffffffff827b1c38 RDI: 0000000000000005 RBP: ffff88810a2a8000 R08: ffff88810a349c80 R09: fffff520006fef1f R10: 0000000000000003 R11: fffff520006fef1e R12: ffff888114ee2d00 R13: dffffc0000000000 R14: 0000000000000001 R15: ffff888114ee1d68 FS: 00007f2ac1945700(0000) GS:ffff88811b400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd44798bc0 CR3: 0000000109810002 CR4: 0000000000170ef0 Call Trace: __tcp_close+0xd86/0x1110 net/ipv4/tcp.c:2433 __mptcp_close_ssk+0x256/0x430 net/mptcp/protocol.c:1761 __mptcp_destroy_sock+0x49b/0x770 net/mptcp/protocol.c:2127 mptcp_close+0x62d/0x910 net/mptcp/protocol.c:2184 inet_release+0xe9/0x1f0 net/ipv4/af_inet.c:434 __sock_release+0xd2/0x280 net/socket.c:596 sock_close+0x15/0x20 net/socket.c:1277 __fput+0x276/0x960 fs/file_table.c:281 task_work_run+0x109/0x1d0 kernel/task_work.c:151 get_signal+0xe8f/0x1d40 kernel/signal.c:2561 arch_do_signal+0x88/0x1b60 arch/x86/kernel/signal.c:811 exit_to_user_mode_loop kernel/entry/common.c:161 [inline] exit_to_user_mode_prepare+0x9b/0xf0 kernel/entry/common.c:191 syscall_exit_to_user_mode+0x22/0x150 kernel/entry/common.c:266 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f2ac1254469 Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007f2ac1944dc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffbf RBX: 000000000069bf00 RCX: 00007f2ac1254469 RDX: 0000000000000000 RSI: 0000000000008982 RDI: 0000000000000003 RBP: 000000000069bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000069bf0c R13: 00007ffeb53f178f R14: 00000000004668b0 R15: 0000000000000003 After commit `0397c6d85f` ("mptcp: keep unaccepted MPC subflow into join list"), the msk's workqueue and/or PM can touch the MPC subflow - and acquire its socket lock - even if it's still unaccepted. If the above event races with the relevant listener socket close, we can end-up with the above splat. This change addresses the issue delaying the MPC socket insertion in conn_list at accept time - that is, partially reverting the blamed commit. We must additionally ensure that mptcp_pm_fully_established() happens after accept() time, or the PM will not be able to handle properly such event - conn_list could be empty otherwise. In the receive path, we check the subflow list node to ensure it is out of the listener queue. Be sure client subflows do not match transiently such condition moving them into the join list earlier at creation time. Since we now have multiple mptcp_pm_fully_established() call sites from different code-paths, said helper can now race with itself. Use an additional PM status bit to avoid multiple notifications. Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/103 Fixes: `0397c6d85f` ("mptcp: keep unaccepted MPC subflow into join list"), Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:31:58 -08:00
Geliang Tang	432d9e74d8	mptcp: use the variable sk instead of open-coding Since the local variable sk has been defined, use it instead of open-coding. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:16 -08:00
Geliang Tang	13ad9f01a2	mptcp: rename add_addr_signal and mptcp_add_addr_status Since the RM_ADDR signal had been reused with add_addr_signal, it's not suitable to call it add_addr_signal or mptcp_add_addr_status. So this patch renamed add_addr_signal to addr_signal, and renamed mptcp_add_addr_status to mptcp_addr_signal_status. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:16 -08:00
Geliang Tang	42842a425a	mptcp: drop rm_addr_signal flag This patch reused add_addr_signal for the RM_ADDR announcing signal, by defining a new ADD_ADDR status named MPTCP_RM_ADDR_SIGNAL. Then the flag rm_addr_signal in PM could be dropped. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	90a4aea8b6	mptcp: print out port and ahmac when receiving ADD_ADDR This patch printed out more debugging information for the ADD_ADDR suboption parsing on the incoming path. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	0f5c9e3f07	mptcp: add port parameter for mptcp_pm_announce_addr This patch added a new parameter 'port' for mptcp_pm_announce_addr. If this parameter is true, we set the MPTCP_ADD_ADDR_PORT bit of the add_addr_signal. That means the announced address is added with a port number. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	fbe0f87ac7	mptcp: send out dedicated packet for ADD_ADDR using port The process is similar to that of the ADD_ADDR IPv6, this patch also sent out a pure ack for the ADD_ADDR using port. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	4a2777a834	mptcp: add the outgoing ADD_ADDR port support This patch added a new add_addr_signal type named MPTCP_ADD_ADDR_PORT, to identify it is an address with port to be added. It also added a new parameter 'port' for both mptcp_add_addr_len and mptcp_pm_add_addr_signal. In mptcp_established_options_add_addr, we check whether the announced address is added with port. If it is, we put this port number to mptcp_out_options's port field. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	2ec72faec8	mptcp: use adding up size to get ADD_ADDR length This patch uses adding up size to get the ADD_ADDR suboption length rather than returning the ADD_ADDR size constants. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	22fb85ffae	mptcp: add port support for ADD_ADDR suboption writing In rfc8684, the length of ADD_ADDR suboption with IPv4 address and port is 18 octets, but mptcp_write_options is 32-bit aligned, so we need to pad it to 20 octets. All the other port related option lengths need to be added up 2 octets similarly. This patch added a new field 'port' in mptcp_out_options. When this field is set with a port number, we need to add up 4 octets for the ADD_ADDR suboption, and put the port number into the suboption. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	e1ef683222	mptcp: unify ADD_ADDR and ADD_ADDR6 suboptions writing The length of ADD_ADDR6 is 12 octets longer than ADD_ADDR. That's the only difference between them. This patch dropped the duplicate code between ADD_ADDR and ADD_ADDR6 suboptions writing, and unify them into one. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
Geliang Tang	6eb3d1e350	mptcp: unify ADD_ADDR and echo suboptions writing There are two differences between ADD_ADDR suboption and ADD_ADDR echo suboption: The length of the former is 8 octets longer than the length of the latter. The former's echo-flag is 0, and latter's echo-flag is 1. This patch added two local variables, len and echo, to unify ADD_ADDR and ADD_ADDR echo suboptions writing. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 19:02:15 -08:00
David S. Miller	b7e4ba9a91	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Switch to RCU in x_tables to fix possible NULL pointer dereference, from Subash Abhinov Kasiviswanathan. 2) Fix netlink dump of dynset timeouts later than 23 days. 3) Add comment for the indirect serialization of the nft commit mutex with rtnl_mutex. 4) Remove bogus check for confirmed conntrack when matching on the conntrack ID, from Brett Mastbergen. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 18:55:46 -08:00
Zheng Yongjun	a319aedde4	net: rxrpc: convert comma to semicolon Replace a comma between expression statements by a semicolon. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 16:23:07 -08:00
Neal Cardwell	299bcb55ec	tcp: fix cwnd-limited bug for TSO deferral where we send nothing When cwnd is not a multiple of the TSO skb size of NMSS, we can get into persistent scenarios where we have the following sequence: (1) ACK for full-sized skb of NMSS arrives -> tcp_write_xmit() transmit full-sized skb with N*MSS -> move pacing release time forward -> exit tcp_write_xmit() because pacing time is in the future (2) TSQ callback or TCP internal pacing timer fires -> try to transmit next skb, but TSO deferral finds remainder of available cwnd is not big enough to trigger an immediate send now, so we defer sending until the next ACK. (3) repeat... So we can get into a case where we never mark ourselves as cwnd-limited for many seconds at a time, even with bulk/infinite-backlog senders, because: o In case (1) above, every time in tcp_write_xmit() we have enough cwnd to send a full-sized skb, we are not fully using the cwnd (because cwnd is not a multiple of the TSO skb size). So every time we send data, we are not cwnd limited, and so in the cwnd-limited tracking code in tcp_cwnd_validate() we mark ourselves as not cwnd-limited. o In case (2) above, every time in tcp_write_xmit() that we try to transmit the "remainder" of the cwnd but defer, we set the local variable is_cwnd_limited to true, but we do not send any packets, so sent_pkts is zero, so we don't call the cwnd-limited logic to update tp->is_cwnd_limited. Fixes: `ca8a226343` ("tcp: make cwnd-limited checks measurement-based, and gentler") Reported-by: Ingemar Johansson <ingemar.s.johansson@ericsson.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-09 16:15:54 -08:00
Chris Mi	5137d30365	net: flow_offload: Fix memory leak for indirect flow block The offending commit introduces a cleanup callback that is invoked when the driver module is removed to clean up the tunnel device flow block. But it returns on the first iteration of the for loop. The remaining indirect flow blocks will never be freed. Fixes: `1fac52da59` ("net: flow_offload: consolidate indirect flow_block infrastructure") CC: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com>	2020-12-09 16:08:33 -08:00
Wei Wang	8ef44b6fe4	tcp: Retain ECT bits for tos reflection For DCTCP, we have to retain the ECT bits set by the congestion control algorithm on the socket when reflecting syn TOS in syn-ack, in order to make ECN work properly. Fixes: `ac8f1710c1` ("tcp: reflect tos value received in SYN to the socket") Reported-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Wei Wang <weiwan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-09 16:08:23 -08:00
Michal Kubecek	a770bf5156	ethtool: fix stack overflow in ethnl_parse_bitset() Syzbot reported a stack overflow in bitmap_from_arr32() called from ethnl_parse_bitset() when bitset from netlink message is longer than target bitmap length. While ethnl_compact_sanity_checks() makes sure that trailing part is all zeros (i.e. the request does not try to touch bits kernel does not recognize), we also need to cap change_bits to nbits so that we don't try to write past the prepared bitmaps. Fixes: `88db6d1e4f` ("ethtool: add ethnl_parse_bitset() helper") Reported-by: syzbot+9d39fa49d4df294aab93@syzkaller.appspotmail.com Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/r/3487ee3a98e14cd526f55b6caaa959d2dcbcad9f.1607465316.git.mkubecek@suse.cz Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-09 15:50:38 -08:00
Pablo Neira Ayuso	102e2c0723	net: sched: incorrect Kconfig dependencies on Netfilter modules - NET_ACT_CONNMARK and NET_ACT_CTINFO only require conntrack support. - NET_ACT_IPT only requires NETFILTER_XTABLES symbols, not IP_NF_IPTABLES. After this patch, NET_ACT_IPT becomes consistent with NET_EMATCH_IPT. NET_ACT_IPT dependency on IP_NF_IPTABLES predates Linux-2.6.12-rc2 (initial git repository build). Fixes: `22a5dc0e5e` ("net: sched: Introduce connmark action") Fixes: `24ec483cec` ("net: sched: Introduce act_ctinfo action") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://lore.kernel.org/r/20201208204707.11268-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-09 15:49:29 -08:00
Oliver Hartkopp	323a391a22	can: isotp: isotp_setsockopt(): block setsockopt on bound sockets The isotp socket can be widely configured in its behaviour regarding addressing types, fill-ups, receive pattern tests and link layer length. Usually all these settings need to be fixed before bind() and can not be changed afterwards. This patch adds a check to enforce the common usage pattern. Fixes: `e057dd3fc2` ("can: add ISO 15765-2:2016 transport protocol") Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-by: Thomas Wagner <thwa1@web.de> Link: https://lore.kernel.org/r/20201203140604.25488-2-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/r/20201204133508.742120-3-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-09 08:44:15 -08:00
Toke Høiland-Jørgensen	998f172962	xdp: Remove the xdp_attachment_flags_ok() callback Since commit `7f0a838254` ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device"), the XDP program attachment info is now maintained in the core code. This interacts badly with the xdp_attachment_flags_ok() check that prevents unloading an XDP program with different load flags than it was loaded with. In practice, two kinds of failures are seen: - An XDP program loaded without specifying a mode (and which then ends up in driver mode) cannot be unloaded if the program mode is specified on unload. - The dev_xdp_uninstall() hook always calls the driver callback with the mode set to the type of the program but an empty flags argument, which means the flags_ok() check prevents the program from being removed, leading to bpf prog reference leaks. The original reason this check was added was to avoid ambiguity when multiple programs were loaded. With the way the checks are done in the core now, this is quite simple to enforce in the core code, so let's add a check there and get rid of the xdp_attachment_flags_ok() callback entirely. Fixes: `7f0a838254` ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk	2020-12-09 16:27:42 +01:00
Chuck Lever	5e54dafbe0	SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall There's no need to defer allocation of pages for the receive buffer. - This upcall is quite infrequent - gssp_alloc_receive_pages() can allocate the pages with GFP_KERNEL, unlike the transport - gssp_alloc_receive_pages() knows exactly how many pages are needed Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Olga Kornievskaia <kolga@netapp.com>	2020-12-09 09:38:34 -05:00
Roberto Bergantinos Corpas	4b5cff7ed8	sunrpc: clean-up cache downcall We can simplify code around cache_downcall unifying memory allocations using kvmalloc. This has the benefit of getting rid of cache_slow_downcall (and queue_io_mutex), and also matches userland allocation size and limits. Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-12-09 09:38:34 -05:00
Brett Mastbergen	2d94b20b95	netfilter: nft_ct: Remove confirmation check for NFT_CT_ID Since commit `656c8e9cc1` ("netfilter: conntrack: Use consistent ct id hash calculation") the ct id will not change from initialization to confirmation. Removing the confirmation check allows for things like adding an element to a 'typeof ct id' set in prerouting upon reception of the first packet of a new connection, and then being able to reference that set consistently both before and after the connection is confirmed. Fixes: `656c8e9cc1` ("netfilter: conntrack: Use consistent ct id hash calculation") Signed-off-by: Brett Mastbergen <brett.mastbergen@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-09 10:31:58 +01:00
Florent Revest	b60da4955f	bpf: Only provide bpf_sock_from_file with CONFIG_NET This moves the bpf_sock_from_file definition into net/core/filter.c which only gets compiled with CONFIG_NET and also moves the helper proto usage next to other tracing helpers that are conditional on CONFIG_NET. This avoids ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file': bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file' When compiling a kernel with BPF and without NET. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org	2020-12-08 18:23:36 -08:00
Eric Dumazet	72d05c00d7	tcp: select sane initial rcvq_space.space for big MSS Before commit `a337531b94` ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS. This is no longer the case, and Hazem Mohamed Abuelfotoh reported that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames. Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit of rcvq_space.space computation, while it can select later a smaller value for tp->rcv_ssthresh and tp->window_clamp. ss -temoi on receiver would show : skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596 This means that TCP can not increase its window in tcp_grow_window(), and that DRS can never kick. Fix this by making sure that rcvq_space.space is not bigger than number of bytes that can be held in TCP receive queue. People unable/unwilling to change their kernel can work around this issue by selecting a bigger tcp_rmem[1] value as in : echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem Based on an initial report and patch from Hazem Mohamed Abuelfotoh https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/ Fixes: `a337531b94` ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: `041a14d267` ("tcp: start receiver buffer autotuning sooner") Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 16:27:48 -08:00
Zheng Yongjun	5e359044c1	net: openvswitch: conntrack: simplify the return expression of ovs_ct_limit_get_default_limit() Simplify the return expression. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 16:22:54 -08:00
Zheng Yongjun	8daa76a52d	net: core: devlink: simplify the return expression of devlink_nl_cmd_trap_set_doit() Simplify the return expression. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 16:22:54 -08:00
Zheng Yongjun	9faad250ce	net: ipv6: rpl_iptunnel: simplify the return expression of rpl_do_srh() Simplify the return expression. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 16:22:54 -08:00
Zheng Yongjun	57b0637d00	net/sched: cls_u32: simplify the return expression of u32_reoffload_knode() Simplify the return expression. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 16:22:53 -08:00
Colin Ian King	8354bcbebd	net: sched: fix spelling mistake in Kconfig "trys" -> "tries" There is a spelling mistake in the Kconfig help text. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 16:01:56 -08:00
David S. Miller	e1be4b5990	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2020-12-07 Here's the main bluetooth-next pull request for the 5.11 kernel. - Updated Bluetooth entries in MAINTAINERS to include Luiz von Dentz - Added support for Realtek 8822CE and 8852A devices - Added support for MediaTek MT7615E device - Improved workarounds for fake CSR devices - Fix Bluetooth qualification test case L2CAP/COS/CFD/BV-14-C - Fixes for LL Privacy support - Enforce 16 byte encryption key size for FIPS security level - Added new mgmt commands for extended advertising support - Multiple other smaller fixes & improvements Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 15:58:49 -08:00
Julian Wiedmann	97f8841e04	net/af_iucv: use DECLARE_SOCKADDR to cast from sockaddr This gets us compile-time size checking. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 15:56:53 -08:00
Cengiz Can	0398ba9e5a	net: tipc: prevent possible null deref of link `tipc_node_apply_property` does a null check on a `tipc_link_entry` pointer but also accesses the same pointer out of the null check block. This triggers a warning on Coverity Static Analyzer because we're implying that `e->link` can BE null. Move "Update MTU for node link entry" line into if block to make sure that we're not in a state that `e->link` is null. Signed-off-by: Cengiz Can <cengiz@kernel.wtf> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-08 15:53:41 -08:00
Pablo Neira Ayuso	42f1c27120	netfilter: nftables: comment indirect serialization of commit_mutex with rtnl_mutex Add an explicit comment in the code to describe the indirect serialization of the holders of the commit_mutex with the rtnl_mutex. Commit `90d2723c6d` ("netfilter: nf_tables: do not hold reference on netdevice from preparation phase") already describes this, but a comment in this case is better for reference. Reported-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-08 21:53:48 +01:00
Pablo Neira Ayuso	917d80d376	netfilter: nft_dynset: fix timeouts later than 23 days Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by `8e1102d5a1` ("netfilter: nf_tables: support timeouts larger than 23 days"), otherwise ruleset listing breaks. Fixes: `a8b1e36d0d` ("netfilter: nft_dynset: fix element timeout for HZ != 1000") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-08 20:42:11 +01:00
Rasmus Villemoes	bdc40a3f4b	net: dsa: print the MTU value that could not be set These warnings become somewhat more informative when they include the MTU value that could not be set and not just the errno. Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Link: https://lore.kernel.org/r/20201205133944.10182-1-rasmus.villemoes@prevas.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-08 11:24:07 -08:00
Björn Töpel	3546b9b8ec	xsk: Validate socket state in xsk_recvmsg, prior touching socket members In AF_XDP the socket state needs to be checked, prior touching the members of the socket. This was not the case for the recvmsg implementation. Fix that by moving the xsk_is_bound() call. Fixes: `45a8668184` ("xsk: Add support for recvmsg()") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201207082008.132263-1-bjorn.topel@gmail.com	2020-12-08 17:11:58 +01:00
Subash Abhinov Kasiviswanathan	cc00bcaa58	netfilter: x_tables: Switch synchronization to RCU When running concurrent iptables rules replacement with data, the per CPU sequence count is checked after the assignment of the new information. The sequence count is used to synchronize with the packet path without the use of any explicit locking. If there are any packets in the packet path using the table information, the sequence count is incremented to an odd value and is incremented to an even after the packet process completion. The new table value assignment is followed by a write memory barrier so every CPU should see the latest value. If the packet path has started with the old table information, the sequence counter will be odd and the iptables replacement will wait till the sequence count is even prior to freeing the old table info. However, this assumes that the new table information assignment and the memory barrier is actually executed prior to the counter check in the replacement thread. If CPU decides to execute the assignment later as there is no user of the table information prior to the sequence check, the packet path in another CPU may use the old table information. The replacement thread would then free the table information under it leading to a use after free in the packet processing context- Unable to handle kernel NULL pointer dereference at virtual address 000000000000008e pc : ip6t_do_table+0x5d0/0x89c lr : ip6t_do_table+0x5b8/0x89c ip6t_do_table+0x5d0/0x89c ip6table_filter_hook+0x24/0x30 nf_hook_slow+0x84/0x120 ip6_input+0x74/0xe0 ip6_rcv_finish+0x7c/0x128 ipv6_rcv+0xac/0xe4 __netif_receive_skb+0x84/0x17c process_backlog+0x15c/0x1b8 napi_poll+0x88/0x284 net_rx_action+0xbc/0x23c __do_softirq+0x20c/0x48c This could be fixed by forcing instruction order after the new table information assignment or by switching to RCU for the synchronization. Fixes: `80055dab5d` ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore") Reported-by: Sean Tranchetti <stranche@codeaurora.org> Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-12-08 12:57:39 +01:00
Jakub Kicinski	819f56bad1	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2020-12-07 1) Sysbot reported fixes for the new 64/32 bit compat layer. From Dmitry Safonov. 2) Fix a memory leak in xfrm_user_policy that was introduced by adding the 64/32 bit compat layer. From Yu Kuai. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: net: xfrm: fix memory leak in xfrm_user_policy() xfrm/compat: Don't allocate memory with __GFP_ZERO xfrm/compat: memset(0) 64-bit padding at right place xfrm/compat: Translate by copying XFRMA_UNSPEC attribute ==================== Link: https://lore.kernel.org/r/20201207093937.2874932-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-07 18:29:54 -08:00
Jianguo Wu	f55628b3e7	mptcp: print new line in mptcp_seq_show() if mptcp isn't in use When do cat /proc/net/netstat, the output isn't append with a new line, it looks like this: [root@localhost ~]# cat /proc/net/netstat ... MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0[root@localhost ~]# This is because in mptcp_seq_show(), if mptcp isn't in use, net->mib.mptcp_statistics is NULL, so it just puts all 0 after "MPTcpExt:", and return, forgot the '\n'. After this patch: [root@localhost ~]# cat /proc/net/netstat ... MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0 [root@localhost ~]# Fixes: `fc518953bc` ("mptcp: add and use MIB counter infrastructure") Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn> Acked-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/142e2fd9-58d9-bb13-fb75-951cccc2331e@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-07 17:45:29 -08:00
Joseph Huang	851d0a73c9	bridge: Fix a deadlock when enabling multicast snooping When enabling multicast snooping, bridge module deadlocks on multicast_lock if 1) IPv6 is enabled, and 2) there is an existing querier on the same L2 network. The deadlock was caused by the following sequence: While holding the lock, br_multicast_open calls br_multicast_join_snoopers, which eventually causes IP stack to (attempt to) send out a Listener Report (in igmp6_join_group). Since the destination Ethernet address is a multicast address, br_dev_xmit feeds the packet back to the bridge via br_multicast_rcv, which in turn calls br_multicast_add_group, which then deadlocks on multicast_lock. The fix is to move the call br_multicast_join_snoopers outside of the critical section. This works since br_multicast_join_snoopers only deals with IP and does not modify any multicast data structures of the bridge, so there's no need to hold the lock. Steps to reproduce: 1. sysctl net.ipv6.conf.all.force_mld_version=1 2. have another querier 3. ip link set dev bridge type bridge mcast_snooping 0 && \ ip link set dev bridge type bridge mcast_snooping 1 < deadlock > A typical call trace looks like the following: [ 936.251495] _raw_spin_lock+0x5c/0x68 [ 936.255221] br_multicast_add_group+0x40/0x170 [bridge] [ 936.260491] br_multicast_rcv+0x7ac/0xe30 [bridge] [ 936.265322] br_dev_xmit+0x140/0x368 [bridge] [ 936.269689] dev_hard_start_xmit+0x94/0x158 [ 936.273876] __dev_queue_xmit+0x5ac/0x7f8 [ 936.277890] dev_queue_xmit+0x10/0x18 [ 936.281563] neigh_resolve_output+0xec/0x198 [ 936.285845] ip6_finish_output2+0x240/0x710 [ 936.290039] __ip6_finish_output+0x130/0x170 [ 936.294318] ip6_output+0x6c/0x1c8 [ 936.297731] NF_HOOK.constprop.0+0xd8/0xe8 [ 936.301834] igmp6_send+0x358/0x558 [ 936.305326] igmp6_join_group.part.0+0x30/0xf0 [ 936.309774] igmp6_group_added+0xfc/0x110 [ 936.313787] __ipv6_dev_mc_inc+0x1a4/0x290 [ 936.317885] ipv6_dev_mc_inc+0x10/0x18 [ 936.321677] br_multicast_open+0xbc/0x110 [bridge] [ 936.326506] br_multicast_toggle+0xec/0x140 [bridge] Fixes: `4effd28c12` ("bridge: join all-snoopers multicast address") Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20201204235628.50653-1-Joseph.Huang@garmin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-07 17:14:43 -08:00
Cong Wang	e3366884b3	lwt_bpf: Replace preempt_disable() with migrate_disable() migrate_disable() is just a wrapper for preempt_disable() in non-RT kernel. It is safe to replace it, and RT kernel will benefit. Note that it is introduced since Feb 2020. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201205075946.497763-2-xiyou.wangcong@gmail.com	2020-12-07 11:53:40 -08:00
Dongdong Wang	d9054a1ff5	lwt: Disable BH too in run_lwt_bpf() The per-cpu bpf_redirect_info is shared among all skb_do_redirect() and BPF redirect helpers. Callers on RX path are all in BH context, disabling preemption is not sufficient to prevent BH interruption. In production, we observed strange packet drops because of the race condition between LWT xmit and TC ingress, and we verified this issue is fixed after we disable BH. Although this bug was technically introduced from the beginning, that is commit `3a0af8fd61` ("bpf: BPF for lightweight tunnel infrastructure"), at that time call_rcu() had to be call_rcu_bh() to match the RCU context. So this patch may not work well before RCU flavor consolidation has been completed around v5.0. Update the comments above the code too, as call_rcu() is now BH friendly. Signed-off-by: Dongdong Wang <wangdongdong.6@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/bpf/20201205075946.497763-1-xiyou.wangcong@gmail.com	2020-12-07 11:53:39 -08:00
Marcel Holtmann	e6ed8b78ea	Bluetooth: Increment management interface revision Increment the mgmt revision due to the recently added new commands. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:02:00 +02:00
Abhishek Pandit-Subedi	dce0a4be80	Bluetooth: Set missing suspend task bits When suspending, mark SUSPEND_SCAN_ENABLE and SUSPEND_SCAN_DISABLE tasks correctly when either classic or le scanning is modified. Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Howard Chung <howardchung@google.com> Reviewed-by: Alain Michaud <alainm@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:01:46 +02:00
Daniel Winkler	4d9b952857	Bluetooth: Change MGMT security info CMD to be more generic For advertising, we wish to know the LE tx power capabilities of the controller in userspace, so this patch edits the Security Info MGMT command to be more generic, such that other various controller capabilities can be included in the EIR data. This change also includes the LE min and max tx power into this newly-named command. The change was tested by manually verifying that the MGMT command returns the tx power range as expected in userspace. Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Signed-off-by: Daniel Winkler <danielwinkler@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:01:42 +02:00
Daniel Winkler	7c395ea521	Bluetooth: Query LE tx power on startup Queries tx power via HCI_LE_Read_Transmit_Power command when the hci device is initialized, and stores resulting min/max LE power in hdev struct. If command isn't available (< BT5 support), min/max values both default to HCI_TX_POWER_INVALID. This patch is manually verified by ensuring BT5 devices correctly query and receive controller tx power range. Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Signed-off-by: Daniel Winkler <danielwinkler@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:01:38 +02:00
Daniel Winkler	9bf9f4b630	Bluetooth: Use intervals and tx power from mgmt cmds This patch takes the min/max intervals and tx power optionally provided in mgmt interface, stores them in the advertisement struct, and uses them when configuring the hci requests. While tx power is not used if extended advertising is unavailable, software rotation will use the min and max advertising intervals specified by the client. This change is validated manually by ensuring the min/max intervals are propagated to the controller on both hatch (extended advertising) and kukui (no extended advertising) chromebooks, and that tx power is propagated correctly on hatch. These tests are performed with multiple advertisements simultaneously. Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Signed-off-by: Daniel Winkler <danielwinkler@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:01:33 +02:00
Daniel Winkler	1241057283	Bluetooth: Break add adv into two mgmt commands This patch adds support for the new advertising add interface, with the first command setting advertising parameters and the second to set advertising data. The set parameters command allows the caller to leave some fields "unset", with a params bitfield defining which params were purposefully set. Unset parameters will be given defaults when calling hci_add_adv_instance. The data passed to the param mgmt command is allowed to be flexible, so in the future if bluetoothd passes a larger structure with new params, the mgmt command will ignore the unknown members at the end. This change has been validated on both hatch (extended advertising) and kukui (no extended advertising) chromebooks running bluetoothd that support this new interface. I ran the following manual tests: - Set several (3) advertisements using modified test_advertisement.py - For each, validate correct data and parameters in btmon trace - Verified both for software rotation and extended adv Automatic test suite also run, testing many (25) scenarios of single and multi-advertising for data/parameter correctness. Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Signed-off-by: Daniel Winkler <danielwinkler@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:01:28 +02:00
Daniel Winkler	31aab5c22e	Bluetooth: Add helper to set adv data We wish to handle advertising data separately from advertising parameters in our new MGMT requests. This change adds a helper that allows the advertising data and scan response to be updated for an existing advertising instance. Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org> Signed-off-by: Daniel Winkler <danielwinkler@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:01:25 +02:00
Howard Chung	80af16a3e4	Bluetooth: Add toggle to switch off interleave scan This patch add a configurable parameter to switch off the interleave scan feature. Signed-off-by: Howard Chung <howardchung@google.com> Reviewed-by: Alain Michaud <alainm@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:01:00 +02:00
Howard Chung	3bc615fa93	Bluetooth: Refactor read default sys config for various types Refactor read default system configuration function so that it's capable of returning different types than u16 Signed-off-by: Howard Chung <howardchung@google.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:56 +02:00
Howard Chung	422bb17f8a	Bluetooth: Handle active scan case This patch adds code to handle the active scan during interleave scan. The interleave scan will be canceled when users start active scan, and it will be restarted after active scan stopped. Signed-off-by: Howard Chung <howardchung@google.com> Reviewed-by: Alain Michaud <alainm@chromium.org> Reviewed-by: Manish Mandlik <mmandlik@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:52 +02:00
Howard Chung	36afe87ac1	Bluetooth: Handle system suspend resume case This patch adds code to handle the system suspension during interleave scan. The interleave scan will be canceled when the system is going to sleep, and will be restarted after waking up. Signed-off-by: Howard Chung <howardchung@google.com> Reviewed-by: Alain Michaud <alainm@chromium.org> Reviewed-by: Manish Mandlik <mmandlik@chromium.org> Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Miao-chen Chou <mcchou@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:47 +02:00
Howard Chung	c4f1f40816	Bluetooth: Interleave with allowlist scan This patch implements the interleaving between allowlist scan and no-filter scan. It'll be used to save power when at least one monitor is registered and at least one pending connection or one device to be scanned for. The durations of the allowlist scan and the no-filter scan are controlled by MGMT command: Set Default System Configuration. The default values are set randomly for now. Signed-off-by: Howard Chung <howardchung@google.com> Reviewed-by: Alain Michaud <alainm@chromium.org> Reviewed-by: Manish Mandlik <mmandlik@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:43 +02:00
Edward Vear	a31489d2a3	Bluetooth: Fix attempting to set RPA timeout when unsupported During controller initialization, an LE Set RPA Timeout command is sent to the controller if supported. However, the value checked to determine if the command is supported is incorrect. Page 1921 of the Bluetooth Core Spec v5.2 shows that bit 2 of octet 35 of the Supported_Commands field corresponds to the LE Set RPA Timeout command, but currently bit 6 of octet 35 is checked. This patch checks the correct value instead. This issue led to the error seen in the following btmon output during initialization of an adapter (rtl8761b) and prevented initialization from completing. < HCI Command: LE Set Resolvable Private Address Timeout (0x08\|0x002e) plen 2 Timeout: 900 seconds > HCI Event: Command Complete (0x0e) plen 4 LE Set Resolvable Private Address Timeout (0x08\|0x002e) ncmd 2 Status: Unsupported Remote Feature / Unsupported LMP Feature (0x1a) = Close Index: 00:E0:4C:6B:E5:03 The error did not appear when running with this patch. Signed-off-by: Edward Vear <edwardvear@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:38 +02:00
Luiz Augusto von Dentz	aeeae47d34	Bluetooth: Rename get_adv_instance_scan_rsp This renames get_adv_instance_scan_rsp to adv_instance_is_scannable and make it return a bool since it was not actually properly return the size of the scan response as one could expect. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:33 +02:00
Luiz Augusto von Dentz	a76a0d3650	Bluetooth: Fix not sending Set Extended Scan Response Current code is actually failing on the following tests of mgmt-tester because get_adv_instance_scan_rsp_len did not account for flags that cause scan response data to be included resulting in non-scannable instance when in fact it should be scannable. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:29 +02:00
Jimmy Wahlberg	5b8ec15d02	Bluetooth: Fix for Bluetooth SIG test L2CAP/COS/CFD/BV-14-C This test case is meant to verify that multiple unknown options is included in the response. BLUETOOTH CORE SPECIFICATION Version 5.2 \| Vol 3, Part A page 1057 'On an unknown option failure (Result=0x0003), the option(s) that contain anoption type field that is not understood by the recipient of the L2CAP_CONFIGURATION_REQ packet shall be included in the L2CAP_CONFIGURATION_RSP packet unless they are hints.' Before this patch: > ACL Data RX: Handle 11 flags 0x02 dlen 24 L2CAP: Configure Request (0x04) ident 18 len 16 Destination CID: 64 Flags: 0x0000 Option: Unknown (0x10) [mandatory] 10 00 11 02 11 00 12 02 12 00 < ACL Data TX: Handle 11 flags 0x00 dlen 17 L2CAP: Configure Response (0x05) ident 18 len 9 Source CID: 64 Flags: 0x0000 Result: Failure - unknown options (0x0003) Option: Unknown (0x10) [mandatory] 12 After this patch: > ACL Data RX: Handle 11 flags 0x02 dlen 24 L2CAP: Configure Request (0x04) ident 5 len 16 Destination CID: 64 Flags: 0x0000 Option: Unknown (0x10) [mandatory] 10 00 11 02 11 00 12 02 12 00 < ACL Data TX: Handle 11 flags 0x00 dlen 23 L2CAP: Configure Response (0x05) ident 5 len 15 Source CID: 64 Flags: 0x0000 Result: Failure - unknown options (0x0003) Option: Unknown (0x10) [mandatory] 10 11 01 11 12 01 12 Signed-off-by: Jimmy Wahlberg <jimmywa@spotify.com> Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:23 +02:00
Wei Yongjun	f6b8c6b554	Bluetooth: sco: Fix crash when using BT_SNDMTU/BT_RCVMTU option This commit add the invalid check for connected socket, without it will causes the following crash due to sco_pi(sk)->conn being NULL: KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057] CPU: 3 PID: 4284 Comm: test_sco Not tainted 5.10.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 RIP: 0010:sco_sock_getsockopt+0x45d/0x8e0 Code: 48 c1 ea 03 80 3c 02 00 0f 85 ca 03 00 00 49 8b 9d f8 04 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e b5 03 00 00 8b 43 50 48 8b 0c RSP: 0018:ffff88801bb17d88 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff83a4ecdf RDX: 000000000000000a RSI: ffffc90002fce000 RDI: 0000000000000050 RBP: 1ffff11003762fb4 R08: 0000000000000001 R09: ffff88810e1008c0 R10: ffffffffbd695dcf R11: fffffbfff7ad2bb9 R12: 0000000000000000 R13: ffff888018ff1000 R14: dffffc0000000000 R15: 000000000000000d FS: 00007fb4f76c1700(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555e3b7a938 CR3: 00000001117be001 CR4: 0000000000770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? sco_skb_put_cmsg+0x80/0x80 ? sco_skb_put_cmsg+0x80/0x80 __sys_getsockopt+0x12a/0x220 ? __ia32_sys_setsockopt+0x150/0x150 ? syscall_enter_from_user_mode+0x18/0x50 ? rcu_read_lock_bh_held+0xb0/0xb0 __x64_sys_getsockopt+0xba/0x150 ? syscall_enter_from_user_mode+0x1d/0x50 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `0fc1a726f8` ("Bluetooth: sco: new getsockopt options BT_SNDMTU/BT_RCVMTU") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 17:00:15 +02:00
Reo Shiseki	353021588c	Bluetooth: fix typo in struct name Signed-off-by: Reo Shiseki <reoshiseki@gmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2020-12-07 16:51:22 +02:00
Xin Long	10c678bd0a	udp: fix the proto value passed to ip_protocol_deliver_rcu for the segments Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu(). Otherwize, with a negtive value passed, it will underflow inet_protos. This can be reproduced with IPIP FOU: # ip fou add port 5555 ipproto 4 # ethtool -K eth1 rx-gro-list on Fixes: `cf329aa42b` ("udp: cope with UDP GRO packet misdirection") Reported-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-07 00:32:11 -08:00
Jakub Kicinski	78d6bb584d	This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - update include for min/max helpers, by Sven Eckelmann - add infrastructure and netlink functions for routing algo selection, by Sven Eckelmann (2 patches) - drop deprecated debugfs and sysfs support and obsoleted functionality, by Sven Eckelmann (3 patches) - drop unused include in fragmentation.c, by Simon Wunderlich -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl/KWIkWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoTrTEACUImdOCWv4+NnEQfChQv6Y3i18 gJABXoOkWLfFpGBUlw/uYzFKpMEWZ0orHig9gucC+rmjNc8veWwAOugJoTPTKQJZ /4yndhM0x39vWex03rdDmyqzCEh1V1Q9VcdEuD6XbJDaK5F4jDu3NQVneOijIkN+ 5PzhlvtUlfe8csykOCOoC9Y5wy82fEhcEvuSq+Z6dU3Cb3EGHtEUtZ4orDkpnnml 7XEcn5C5+OFGlz/ikiszKumTtNK+dmGluOxoyfAzEjQHK7PoTorcXFS2YUoSWeqQ gmYZ56RBqEHjo4eqcaEgcqq5v8cTPCEMCB8UQjAffxrhloRKHRhQOysG1+OnzGA8 IQ2ARHLQCVPVraXF2ixE0D3BvjKmtMmcvZOCXwhCHDajn9jFKAh0+hnInDyv6Fp1 7eUfpHACL9EQDxKWXeQg37X2mk3hHJ+4zgZOYidahVeKbiiexe2heaHTYAbr9rIf 8hvtlgMg4AnwL3IxadrKwsbJ5t7TEPLTInf47hPvpRg7SmthDTcgso4VDwmWgB8W Tlug8/NoXXDCmDhXUpvyi9+idHe0J8xvHl/2xGC7aSsPAbuhqOuefMKr36YhXJTY vBA5Ih5ppylJ8Dzwa0TbonvQbOAinA8YTa6izKxY4e+xTPB2jz/WT2vciEZv2+ig vNIPFrLZ7OFMohCDfQ== =tNKF -----END PGP SIGNATURE----- Merge tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - update include for min/max helpers, by Sven Eckelmann - add infrastructure and netlink functions for routing algo selection, by Sven Eckelmann (2 patches) - drop deprecated debugfs and sysfs support and obsoleted functionality, by Sven Eckelmann (3 patches) - drop unused include in fragmentation.c, by Simon Wunderlich * tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge: batman-adv: Drop unused soft-interface.h include in fragmentation.c batman-adv: Drop legacy code for auto deleting mesh interfaces batman-adv: Drop deprecated debugfs support batman-adv: Drop deprecated sysfs support batman-adv: Allow selection of routing algorithm over rtnetlink batman-adv: Prepare infrastructure for newlink settings batman-adv: Add new include for min/max helpers batman-adv: Start new development cycle ==================== Link: https://lore.kernel.org/r/20201204154631.21063-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-05 15:08:06 -08:00
Bongsu Jeon	bcd684aace	net/nfc/nci: Support NCI 2.x initial sequence implement the NCI 2.x initial sequence to support NCI 2.x NFCC. Since NCI 2.0, CORE_RESET and CORE_INIT sequence have been changed. If NFCEE supports NCI 2.x, then NCI 2.x initial sequence will work. In NCI 1.0, Initial sequence and payloads are as below: (DH) (NFCC) \| -- CORE_RESET_CMD --> \| \| <-- CORE_RESET_RSP -- \| \| -- CORE_INIT_CMD --> \| \| <-- CORE_INIT_RSP -- \| CORE_RESET_RSP payloads are Status, NCI version, Configuration Status. CORE_INIT_CMD payloads are empty. CORE_INIT_RSP payloads are Status, NFCC Features, Number of Supported RF Interfaces, Supported RF Interface, Max Logical Connections, Max Routing table Size, Max Control Packet Payload Size, Max Size for Large Parameters, Manufacturer ID, Manufacturer Specific Information. In NCI 2.0, Initial Sequence and Parameters are as below: (DH) (NFCC) \| -- CORE_RESET_CMD --> \| \| <-- CORE_RESET_RSP -- \| \| <-- CORE_RESET_NTF -- \| \| -- CORE_INIT_CMD --> \| \| <-- CORE_INIT_RSP -- \| CORE_RESET_RSP payloads are Status. CORE_RESET_NTF payloads are Reset Trigger, Configuration Status, NCI Version, Manufacturer ID, Manufacturer Specific Information Length, Manufacturer Specific Information. CORE_INIT_CMD payloads are Feature1, Feature2. CORE_INIT_RSP payloads are Status, NFCC Features, Max Logical Connections, Max Routing Table Size, Max Control Packet Payload Size, Max Data Packet Payload Size of the Static HCI Connection, Number of Credits of the Static HCI Connection, Max NFC-V RF Frame Size, Number of Supported RF Interfaces, Supported RF Interfaces. Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com> Link: https://lore.kernel.org/r/20201202223147.3472-1-bongsu.jeon@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 17:47:35 -08:00
Hoang Le	43fcd906d9	tipc: support 128bit node identity for peer removing We add the support to remove a specific node down with 128bit node identifier, as an alternative to legacy 32-bit node address. example: $tipc peer remove identiy <1001002\|16777777> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Link: https://lore.kernel.org/r/20201203035045.4564-1-hoang.h.le@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 17:40:27 -08:00
Eric Dumazet	905b2032fa	mac80211: mesh: fix mesh_pathtbl_init() error path If tbl_mpp can not be allocated, we call mesh_table_free(tbl_path) while tbl_path rhashtable has not yet been initialized, which causes panics. Simply factorize the rhashtable_init() call into mesh_table_alloc() WARNING: CPU: 1 PID: 8474 at kernel/workqueue.c:3040 __flush_work kernel/workqueue.c:3040 [inline] WARNING: CPU: 1 PID: 8474 at kernel/workqueue.c:3040 __cancel_work_timer+0x514/0x540 kernel/workqueue.c:3136 Modules linked in: CPU: 1 PID: 8474 Comm: syz-executor663 Not tainted 5.10.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__flush_work kernel/workqueue.c:3040 [inline] RIP: 0010:__cancel_work_timer+0x514/0x540 kernel/workqueue.c:3136 Code: 5d c3 e8 bf ae 29 00 0f 0b e9 f0 fd ff ff e8 b3 ae 29 00 0f 0b 43 80 3c 3e 00 0f 85 31 ff ff ff e9 34 ff ff ff e8 9c ae 29 00 <0f> 0b e9 dc fe ff ff 89 e9 80 e1 07 80 c1 03 38 c1 0f 8c 7d fd ff RSP: 0018:ffffc9000165f5a0 EFLAGS: 00010293 RAX: ffffffff814b7064 RBX: 0000000000000001 RCX: ffff888021c80000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff888024039ca0 R08: dffffc0000000000 R09: fffffbfff1dd3e64 R10: fffffbfff1dd3e64 R11: 0000000000000000 R12: 1ffff920002cbebd R13: ffff888024039c88 R14: 1ffff11004807391 R15: dffffc0000000000 FS: 0000000001347880(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000002cc0a000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: rhashtable_free_and_destroy+0x25/0x9c0 lib/rhashtable.c:1137 mesh_table_free net/mac80211/mesh_pathtbl.c:69 [inline] mesh_pathtbl_init+0x287/0x2e0 net/mac80211/mesh_pathtbl.c:785 ieee80211_mesh_init_sdata+0x2ee/0x530 net/mac80211/mesh.c:1591 ieee80211_setup_sdata+0x733/0xc40 net/mac80211/iface.c:1569 ieee80211_if_add+0xd5c/0x1cd0 net/mac80211/iface.c:1987 ieee80211_add_iface+0x59/0x130 net/mac80211/cfg.c:125 rdev_add_virtual_intf net/wireless/rdev-ops.h:45 [inline] nl80211_new_interface+0x563/0xb40 net/wireless/nl80211.c:3855 genl_family_rcv_msg_doit net/netlink/genetlink.c:739 [inline] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] genl_rcv_msg+0xe4e/0x1280 net/netlink/genetlink.c:800 netlink_rcv_skb+0x190/0x3a0 net/netlink/af_netlink.c:2494 genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x780/0x930 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x9a8/0xd40 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg net/socket.c:671 [inline] ____sys_sendmsg+0x519/0x800 net/socket.c:2353 ___sys_sendmsg net/socket.c:2407 [inline] __sys_sendmsg+0x2b1/0x360 net/socket.c:2440 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `60854fd945` ("mac80211: mesh: convert path table to rhashtable") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/r/20201204162428.2583119-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 17:34:25 -08:00
Wang Hai	bb2da7651a	openvswitch: fix error return code in validate_and_copy_dec_ttl() Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Changing 'return start' to 'return action_start' can fix this bug. Fixes: `69929d4c49` ("net: openvswitch: fix TTL decrement action netlink message format") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Reviewed-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20201204114314.1596-1-wanghai38@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 15:43:14 -08:00
Zhang Changzhong	ee4f52a8de	net: bridge: vlan: fix error return code in __vlan_add() Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: `f8ed289fab` ("bridge: vlan: use br_vlan_(get\|put)_master to deal with refcounts") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/1607071737-33875-1-git-send-email-zhangchangzhong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 15:41:06 -08:00
Zhang Changzhong	b410f04eb5	ipv4: fix error return code in rtm_to_fib_config() Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: `d15662682d` ("ipv4: Allow ipv6 gateway with ipv4 routes") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 15:38:16 -08:00
Davide Caratti	4eef8b1f36	net/sched: fq_pie: initialize timer earlier in fq_pie_init() with the following tdc testcase: 83be: (qdisc, fq_pie) Create FQ-PIE with invalid number of flows as fq_pie_init() fails, fq_pie_destroy() is called to clean up. Since the timer is not yet initialized, it's possible to observe a splat like this: INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 0 PID: 975 Comm: tc Not tainted 5.10.0-rc4+ #298 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 Call Trace: dump_stack+0x99/0xcb register_lock_class+0x12dd/0x1750 __lock_acquire+0xfe/0x3970 lock_acquire+0x1c8/0x7f0 del_timer_sync+0x49/0xd0 fq_pie_destroy+0x3f/0x80 [sch_fq_pie] qdisc_create+0x916/0x1160 tc_modify_qdisc+0x3c4/0x1630 rtnetlink_rcv_msg+0x346/0x8e0 netlink_unicast+0x439/0x630 netlink_sendmsg+0x719/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5ba/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 [...] ODEBUG: assert_init not available (active state 0) object type: timer_list hint: 0x0 WARNING: CPU: 0 PID: 975 at lib/debugobjects.c:508 debug_print_object+0x162/0x210 [...] Call Trace: debug_object_assert_init+0x268/0x380 try_to_del_timer_sync+0x6a/0x100 del_timer_sync+0x9e/0xd0 fq_pie_destroy+0x3f/0x80 [sch_fq_pie] qdisc_create+0x916/0x1160 tc_modify_qdisc+0x3c4/0x1630 rtnetlink_rcv_msg+0x346/0x8e0 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x439/0x630 netlink_sendmsg+0x719/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5ba/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 fix it moving timer_setup() before any failure, like it was done on 'red' with former commit `608b4adab1` ("net_sched: initialize timer earlier in red_init()"). Fixes: `ec97ecf1eb` ("net: sched: add Flow Queue PIE packet scheduler") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 14:15:01 -08:00
Arjun Roy	94ab9eb9b2	net-zerocopy: Defer vm zap unless actually needed. Zapping pages is required only if we are calling vm_insert_page into a region where pages had previously been mapped. Receive zerocopy allows reusing such regions, and hitherto called zap_page_range() before calling vm_insert_page() in that range. zap_page_range() can also be triggered from userspace with madvise(MADV_DONTNEED). If userspace is configured to call this before reusing a segment, or if there was nothing mapped at this virtual address to begin with, we can avoid calling zap_page_range() under the socket lock. That said, if userspace does not do that, then we are still responsible for calling zap_page_range(). This patch adds a flag that the user can use to hint to the kernel that a zap is not required. If the flag is not set, or if an older user application does not have a flags field at all, then the kernel calls zap_page_range as before. Also, if the flag is set but a zap is still required, the kernel performs that zap as necessary. Thus incorrectly indicating that a zap can be avoided does not change the correctness of operation. It also increases the batchsize for vm_insert_pages and prefetches the page struct for the batch since we're about to bump the refcount. An alternative mechanism could be to not have a flag, assume by default a zap is not needed, and fall back to zapping if needed. However, this would harm performance for older applications for which a zap is necessary, and thus we implement it with an explicit flag so newer applications can opt in. When using RPC-style traffic with medium sized (tens of KB) RPCs, this change yields an efficency improvement of about 30% for QPS/CPU usage. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	0c3936d32f	net-zerocopy: Set zerocopy hint when data is copied Set zerocopy hint, event when falling back to copy, so that the pending data can be efficiently received using zerocopy when possible. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	f21a3c4803	net-zerocopy: Introduce short-circuit small reads. Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, or inq is generally small enough that it is cheaper to copy rather than remap pages. In these cases, we may want to either return early (inq=0) or attempt to use the provided copy buffer to simply copy the received data. This allows us to save both system call overhead and the latency of acquiring mmap_sem in read mode for cases where it would be useless to do so. This patchset enables this behaviour by: 1. Returning quickly if inq is 0. 2. Attempting to perform a regular copy if a hybrid copybuffer is provided and it is large enough to absorb all available bytes. 3. Return quickly if no such buffer was provided and there are less than PAGE_SIZE bytes available. For small RPC ping-pong workloads, normally we would have 1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this change, we remove the recvmsg() call entirely, reducing the syscall overhead by about 33%. In testing with small (hundreds of bytes) RPC traffic, this yields a syscall reduction of about 33% and an efficiency gain of about 3-5% when defined as QPS/CPU Util. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	936ced4157	net-zerocopy: Fast return if inq < PAGE_SIZE Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, in which case we cannot remap pages. In this case, simply return the appropriate hint for regular copying without taking mmap_sem. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:53 -08:00
Arjun Roy	98917cf0d6	net-zerocopy: Refactor frag-is-remappable test. Refactor frag-is-remappable test for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Arjun Roy	7fba5309ef	net-zerocopy: Refactor skb frag fast-forward op. Refactor skb frag fast-forwarding for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. skb_advance_to_frag(), given a skb and an offset into the skb, iterates from the first frag for the skb until we're at the frag specified by the offset. Assuming the offset provided refers to how many bytes in the skb are already read, the returned frag points to the next frag we may read from, while offset_frag is set to the number of bytes from this frag that we have already read. If frag is not null and offset_frag is equal to 0, then we may be able to map this frag's page into the process address space with vm_insert_page(). However, if offset_frag is not equal to 0, then we cannot do so. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Arjun Roy	2cd8116184	net-tcp: Introduce tcp_recvmsg_locked(). Refactor tcp_recvmsg() by splitting it into locked and unlocked portions. Callers already holding the socket lock and not using ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked(). This is in preparation for a short-circuit copy performed by TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested by the user) reads. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Arjun Roy	18fb76ed53	net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy. When TCP receive zerocopy does not successfully map the entire requested space, it outputs a 'hint' that the caller should recvmsg(). Augment zerocopy to accept a user buffer that it tries to copy this hint into - if it is possible to copy the entire hint, it will do so. This elides a recvmsg() call for received traffic that isn't exactly page-aligned in size. This was tested with RPC-style traffic of arbitrary sizes. Normally, each received message required at least one getsockopt() call, and one recvmsg() call for the remaining unaligned data. With this change, almost all of the recvmsg() calls are eliminated, leading to a savings of about 25%-50% in number of system calls for RPC-style workloads. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-04 13:40:52 -08:00
Florent Revest	a50a85e40c	bpf: Expose bpf_sk_storage_* to iterator programs Iterators are currently used to expose kernel information to userspace over fast procfs-like files but iterators could also be used to manipulate local storage. For example, the task_file iterator could be used to initialize a socket local storage with associations between processes and sockets or to selectively delete local storage values. Signed-off-by: Florent Revest <revest@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20201204113609.1850150-3-revest@google.com	2020-12-04 22:32:40 +01:00
Florent Revest	dba4a9256b	net: Remove the err argument from sock_from_file Currently, the sock_from_file prototype takes an "err" pointer that is either not set or set to -ENOTSOCK IFF the returned socket is NULL. This makes the error redundant and it is ignored by a few callers. This patch simplifies the API by letting callers deduce the error based on whether the returned socket is NULL or not. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Florent Revest <revest@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com	2020-12-04 22:32:40 +01:00

... 2 3 4 5 6 ...

63624 Commits