Commit Graph

69486 Commits

Author SHA1 Message Date
Eric Dumazet
3a5f238f2b ip6_tunnel: fix possible NULL deref in ip6_tnl_xmit
Make sure to test that skb has a dst attached to it.

general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
CPU: 0 PID: 32650 Comm: syz-executor.4 Not tainted 5.17.0-rc2-next-20220204-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:ip6_tnl_xmit+0x2140/0x35f0 net/ipv6/ip6_tunnel.c:1127
Code: 4d 85 f6 0f 85 c5 04 00 00 e8 9c b0 66 f9 48 83 e3 fe 48 b8 00 00 00 00 00 fc ff df 48 8d bb 88 00 00 00 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 07 7f 05 e8 11 25 b2 f9 44 0f b6 b3 88 00 00
RSP: 0018:ffffc900141b7310 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc9000c77a000
RDX: 0000000000000011 RSI: ffffffff8811f854 RDI: 0000000000000088
RBP: ffffc900141b7480 R08: 0000000000000000 R09: 0000000000000008
R10: ffffffff8811f846 R11: 0000000000000008 R12: ffffc900141b7548
R13: ffff8880297c6000 R14: 0000000000000000 R15: ffff8880351c8dc0
FS:  00007f9827ba2700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b31322000 CR3: 0000000033a70000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ipxip6_tnl_xmit net/ipv6/ip6_tunnel.c:1386 [inline]
 ip6_tnl_start_xmit+0x71e/0x1830 net/ipv6/ip6_tunnel.c:1435
 __netdev_start_xmit include/linux/netdevice.h:4683 [inline]
 netdev_start_xmit include/linux/netdevice.h:4697 [inline]
 xmit_one net/core/dev.c:3473 [inline]
 dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3489
 __dev_queue_xmit+0x2a24/0x3760 net/core/dev.c:4116
 packet_snd net/packet/af_packet.c:3057 [inline]
 packet_sendmsg+0x2265/0x5460 net/packet/af_packet.c:3084
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:725
 sock_write_iter+0x289/0x3c0 net/socket.c:1061
 call_write_iter include/linux/fs.h:2075 [inline]
 do_iter_readv_writev+0x47a/0x750 fs/read_write.c:726
 do_iter_write+0x188/0x710 fs/read_write.c:852
 vfs_writev+0x1aa/0x630 fs/read_write.c:925
 do_writev+0x27f/0x300 fs/read_write.c:968
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f9828c2d059

Fixes: c1f55c5e04 ("ip6_tunnel: allow routing IPv4 traffic in NBMA mode")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Qing Deng <i@moy.cat>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09 11:56:19 +00:00
Duoming Zhou
7ec02f5ac8 ax25: fix NPD bug in ax25_disconnect
The ax25_disconnect() in ax25_kill_by_device() is not
protected by any locks, thus there is a race condition
between ax25_disconnect() and ax25_destroy_socket().
when ax25->sk is assigned as NULL by ax25_destroy_socket(),
a NULL pointer dereference bug will occur if site (1) or (2)
dereferences ax25->sk.

ax25_kill_by_device()                | ax25_release()
  ax25_disconnect()                  |   ax25_destroy_socket()
    ...                              |
    if(ax25->sk != NULL)             |     ...
      ...                            |     ax25->sk = NULL;
      bh_lock_sock(ax25->sk); //(1)  |     ...
      ...                            |
      bh_unlock_sock(ax25->sk); //(2)|

This patch moves ax25_disconnect() into lock_sock(), which can
synchronize with ax25_destroy_socket() in ax25_release().

Fail log:
===============================================================
BUG: kernel NULL pointer dereference, address: 0000000000000088
...
RIP: 0010:_raw_spin_lock+0x7e/0xd0
...
Call Trace:
ax25_disconnect+0xf6/0x220
ax25_device_event+0x187/0x250
raw_notifier_call_chain+0x5e/0x70
dev_close_many+0x17d/0x230
rollback_registered_many+0x1f1/0x950
unregister_netdevice_queue+0x133/0x200
unregister_netdev+0x13/0x20
...

Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-09 11:55:02 +00:00
Florian Westphal
5948ed297e netfilter: ctnetlink: use dump structure instead of raw args
netlink_dump structure has a union of 'long args[6]' and a context
buffer as scratch space.

Convert ctnetlink to use a structure, its easier to read than the
raw 'args' usage which comes with no type checks and no readable names.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09 12:07:16 +01:00
Nicolas Dichtel
98eee88b8d nfqueue: enable to set skb->priority
This is a follow up of the previous patch that enables to get
skb->priority. It's now posssible to set it also.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09 12:04:03 +01:00
Pablo Neira Ayuso
23f68d4629 netfilter: nft_cmp: optimize comparison for 16-bytes
Allow up to 16-byte comparisons with a new cmp fast version. Use two
64-bit words and calculate the mask representing the bits to be
compared. Make sure the comparison is 64-bit aligned and avoid
out-of-bound memory access on registers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09 12:00:28 +01:00
Florian Westphal
7afa38831a netfilter: cttimeout: use option structure
Instead of two exported functions, export a single option structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09 11:56:06 +01:00
Florian Westphal
8dd8678e42 netfilter: ecache: don't use nf_conn spinlock
For updating eache missed value we can use cmpxchg.
This also avoids need to disable BH.

kernel robot reported build failure on v1 because not all arches support
cmpxchg for u16, so extend this to u32.

This doesn't increase struct size, existing padding is used.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09 11:44:03 +01:00
Eric Dumazet
75063c9294 netfilter: xt_socket: fix a typo in socket_mt_destroy()
Calling nf_defrag_ipv4_disable() instead of nf_defrag_ipv6_disable()
was probably not the intent.

I found this by code inspection, while chasing a possible issue in TPROXY.

Fixes: de8c12110a ("netfilter: disable defrag once its no longer needed")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-09 11:06:59 +01:00
Leon Romanovsky
7c76ecd9c9 xfrm: enforce validity of offload input flags
struct xfrm_user_offload has flags variable that received user input,
but kernel didn't check if valid bits were provided. It caused a situation
where not sanitized input was forwarded directly to the drivers.

For example, XFRM_OFFLOAD_IPV6 define that was exposed, was used by
strongswan, but not implemented in the kernel at all.

As a solution, check and sanitize input flags to forward
XFRM_OFFLOAD_INBOUND to the drivers.

Fixes: d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-02-09 09:00:40 +01:00
Oliver Hartkopp
8375dfac4f can: isotp: fix error path in isotp_sendmsg() to unlock wait queue
Commit 43a08c3bda ("can: isotp: isotp_sendmsg(): fix TX buffer concurrent
access in isotp_sendmsg()") introduced a new locking scheme that may render
the userspace application in a locking state when an error is detected.
This issue shows up under high load on simultaneously running isotp channels
with identical configuration which is against the ISO specification and
therefore breaks any reasonable PDU communication anyway.

Fixes: 43a08c3bda ("can: isotp: isotp_sendmsg(): fix TX buffer concurrent access in isotp_sendmsg()")
Link: https://lore.kernel.org/all/20220209073601.25728-1-socketcan@hartkopp.net
Cc: stable@vger.kernel.org
Cc: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-02-09 08:47:47 +01:00
Oliver Hartkopp
7c759040c1 can: isotp: fix potential CAN frame reception race in isotp_rcv()
When receiving a CAN frame the current code logic does not consider
concurrently receiving processes which do not show up in real world
usage.

Ziyang Xuan writes:

The following syz problem is one of the scenarios. so->rx.len is
changed by isotp_rcv_ff() during isotp_rcv_cf(), so->rx.len equals
0 before alloc_skb() and equals 4096 after alloc_skb(). That will
trigger skb_over_panic() in skb_put().

=======================================================
CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc8-syzkaller #0
RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:113
Call Trace:
 <TASK>
 skb_over_panic net/core/skbuff.c:118 [inline]
 skb_put.cold+0x24/0x24 net/core/skbuff.c:1990
 isotp_rcv_cf net/can/isotp.c:570 [inline]
 isotp_rcv+0xa38/0x1e30 net/can/isotp.c:668
 deliver net/can/af_can.c:574 [inline]
 can_rcv_filter+0x445/0x8d0 net/can/af_can.c:635
 can_receive+0x31d/0x580 net/can/af_can.c:665
 can_rcv+0x120/0x1c0 net/can/af_can.c:696
 __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5465
 __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5579

Therefore we make sure the state changes and data structures stay
consistent at CAN frame reception time by adding a spin_lock in
isotp_rcv(). This fixes the issue reported by syzkaller but does not
affect real world operation.

Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Link: https://lore.kernel.org/linux-can/d7e69278-d741-c706-65e1-e87623d9a8e8@huawei.com/T/
Link: https://lore.kernel.org/all/20220208200026.13783-1-socketcan@hartkopp.net
Cc: stable@vger.kernel.org
Reported-by: syzbot+4c63f36709a642f801c5@syzkaller.appspotmail.com
Reported-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-02-09 08:47:25 +01:00
Eric Dumazet
5611a00697 ipmr,ip6mr: acquire RTNL before calling ip[6]mr_free_table() on failure path
ip[6]mr_free_table() can only be called under RTNL lock.

RTNL: assertion failed at net/core/dev.c (10367)
WARNING: CPU: 1 PID: 5890 at net/core/dev.c:10367 unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367
Modules linked in:
CPU: 1 PID: 5890 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller-11627-g422ee58dc0ef #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367
Code: 0f 85 9b ee ff ff e8 69 07 4b fa ba 7f 28 00 00 48 c7 c6 00 90 ae 8a 48 c7 c7 40 90 ae 8a c6 05 6d b1 51 06 01 e8 8c 90 d8 01 <0f> 0b e9 70 ee ff ff e8 3e 07 4b fa 4c 89 e7 e8 86 2a 59 fa e9 ee
RSP: 0018:ffffc900046ff6e0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff888050f51d00 RSI: ffffffff815fa008 RDI: fffff520008dfece
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815f3d6e R11: 0000000000000000 R12: 00000000fffffff4
R13: dffffc0000000000 R14: ffffc900046ff750 R15: ffff88807b7dc000
FS:  00007f4ab736e700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee0b4f8990 CR3: 000000001e7d2000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 mroute_clean_tables+0x244/0xb40 net/ipv6/ip6mr.c:1509
 ip6mr_free_table net/ipv6/ip6mr.c:389 [inline]
 ip6mr_rules_init net/ipv6/ip6mr.c:246 [inline]
 ip6mr_net_init net/ipv6/ip6mr.c:1306 [inline]
 ip6mr_net_init+0x3f0/0x4e0 net/ipv6/ip6mr.c:1298
 ops_init+0xaf/0x470 net/core/net_namespace.c:140
 setup_net+0x54f/0xbb0 net/core/net_namespace.c:331
 copy_net_ns+0x318/0x760 net/core/net_namespace.c:475
 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
 copy_namespaces+0x391/0x450 kernel/nsproxy.c:178
 copy_process+0x2e0c/0x7300 kernel/fork.c:2167
 kernel_clone+0xe7/0xab0 kernel/fork.c:2555
 __do_sys_clone+0xc8/0x110 kernel/fork.c:2672
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4ab89f9059
Code: Unable to access opcode bytes at RIP 0x7f4ab89f902f.
RSP: 002b:00007f4ab736e118 EFLAGS: 00000206 ORIG_RAX: 0000000000000038
RAX: ffffffffffffffda RBX: 00007f4ab8b0bf60 RCX: 00007f4ab89f9059
RDX: 0000000020000280 RSI: 0000000020000270 RDI: 0000000040200000
RBP: 00007f4ab8a5308d R08: 0000000020000300 R09: 0000000020000300
R10: 00000000200002c0 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffc3977cc1f R14: 00007f4ab736e300 R15: 0000000000022000
 </TASK>

Fixes: f243e5a785 ("ipmr,ip6mr: call ip6mr_free_table() on failure path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cong.wang@bytedance.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220208053451.2885398-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:49:52 -08:00
Eric Dumazet
ee403248fa net: remove default_device_exit()
For some reason default_device_ops kept two exit method:

1) default_device_exit() is called for each netns being dismantled in
a cleanup_net() round. This acquires rtnl for each invocation.

2) default_device_exit_batch() is called once with the list of all netns
int the batch, allowing for a single rtnl invocation.

Get rid of the .exit() method to handle the logic from
default_device_exit_batch(), to decrease the number of rtnl acquisition
to one.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:35 -08:00
Eric Dumazet
ef0de6696c can: gw: switch cangw_pernet_exit() to batch mode
cleanup_net() is competing with other rtnl users.

Avoiding to acquire rtnl for each netns before calling
cgw_remove_all_jobs() gives chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:35 -08:00
Eric Dumazet
696e595f70 ipmr: introduce ipmr_net_exit_batch()
cleanup_net() is competing with other rtnl users.

Avoiding to acquire rtnl for each netns before calling
ipmr_rules_exit() gives chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:34 -08:00
Eric Dumazet
e2f736b753 ip6mr: introduce ip6mr_net_exit_batch()
cleanup_net() is competing with other rtnl users.

Avoiding to acquire rtnl for each netns before calling
ip6mr_rules_exit() gives chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:34 -08:00
Eric Dumazet
ea3e91666d ipv6: change fib6_rules_net_exit() to batch mode
cleanup_net() is competing with other rtnl users.

fib6_rules_net_exit() seems a good candidate for exit_batch(),
as this gives chance for cleanup_net() to progress much faster,
holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:34 -08:00
Eric Dumazet
1c69576461 ipv4: add fib_net_exit_batch()
cleanup_net() is competing with other rtnl users.

Instead of acquiring rtnl at each fib_net_exit() invocation,
add fib_net_exit_batch() so that rtnl is acquired once.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:33 -08:00
Eric Dumazet
fea7b20132 nexthop: change nexthop_net_exit() to nexthop_net_exit_batch()
cleanup_net() is competing with other rtnl users.

nexthop_net_exit() seems a good candidate for exit_batch(),
as this gives chance for cleanup_net() to progress much faster,
holding rtnl a bit longer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:33 -08:00
Eric Dumazet
e66d117222 ipv6/addrconf: switch to per netns inet6_addr_lst hash table
IPv6 does not scale very well with the number of IPv6 addresses.
It uses a global (shared by all netns) hash table with 256 buckets.

Some functions like addrconf_verify_rtnl() and addrconf_ifdown()
have to iterate all addresses in the hash table.

I have seen addrconf_verify_rtnl() holding the cpu for 10ms or more.

Switch to the per netns hashtable (and spinlock) added
in prior patches.

This considerably speeds up netns dismantle times on hosts
with thousands of netns. This also has an impact
on regular (fast path) IPv6 processing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:32 -08:00
Eric Dumazet
8805d13ff1 ipv6/addrconf: use one delayed work per netns
Next step for using per netns inet6_addr_lst
is to have per netns work item to ultimately
call addrconf_verify_rtnl() and addrconf_verify()
with a new 'struct net*' argument.

Everything is still using the global inet6_addr_lst[] table.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:32 -08:00
Eric Dumazet
21a216a8fc ipv6/addrconf: allocate a per netns hash table
Add a per netns hash table and a dedicated spinlock,
first step to get rid of the global inet6_addr_lst[] one.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:41:32 -08:00
Eric Dumazet
b2309a71c1 net: add dev->dev_registered_tracker
Convert one dev_hold()/dev_put() pair in register_netdevice()
and unregister_netdevice_many() to dev_hold_track()
and dev_put_track().

This would allow to detect a rogue dev_put() a bit earlier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08 20:23:20 -08:00
Linus Torvalds
e6251ab455 Merge tag 'nfs-for-5.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
 "Stable Fixes:

   - Fix initialization of nfs_client cl_flags

  Other Fixes:

   - Fix performance issues with uncached readdir calls

   - Fix potential pointer dereferences in rpcrdma_ep_create

   - Fix nfs4_proc_get_locations() kernel-doc comment

   - Fix locking during sunrpc sysfs reads

   - Update my email address in the MAINTAINERS file to my new
     kernel.org email"

* tag 'nfs-for-5.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: lock against ->sock changing during sysfs read
  MAINTAINERS: Update my email address
  NFS: Fix nfs4_proc_get_locations() kernel-doc comment
  xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_create
  NFS: Fix initialisation of nfs_client cl_flags field
  NFS: Avoid duplicate uncached readdir calls on eof
  NFS: Don't skip directory entries when doing uncached readdir
  NFS: Don't overfill uncached readdir pages
2022-02-08 12:03:07 -08:00
NeilBrown
b49ea673e1 SUNRPC: lock against ->sock changing during sysfs read
->sock can be set to NULL asynchronously unless ->recv_mutex is held.
So it is important to hold that mutex.  Otherwise a sysfs read can
trigger an oops.
Commit 17f09d3f61 ("SUNRPC: Check if the xprt is connected before
handling sysfs reads") appears to attempt to fix this problem, but it
only narrows the race window.

Fixes: 17f09d3f61 ("SUNRPC: Check if the xprt is connected before handling sysfs reads")
Fixes: a8482488a7 ("SUNRPC query transport's source port")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08 09:14:26 -05:00
Dan Aloni
a9c10b5b3b xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_create
If there are failures then we must not leave the non-NULL pointers with
the error value, otherwise `rpcrdma_ep_destroy` gets confused and tries
free them, resulting in an Oops.

Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08 09:14:26 -05:00
Guillaume Nault
32ccf11079 ipv4: Use dscp_t in struct fib_alias
Use the new dscp_t type to replace the fa_tos field of fib_alias. This
ensures ECN bits are ignored and makes the field compatible with the
fc_dscp field of struct fib_config.

Converting old *tos variables and fields to dscp_t allows sparse to
flag incorrect uses of DSCP and ECN bits. This patch is entirely about
type annotation and shouldn't change any existing behaviour.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-07 20:12:46 -08:00
Guillaume Nault
f55fbb6afb ipv4: Reject routes specifying ECN bits in rtm_tos
Use the new dscp_t type to replace the fc_tos field of fib_config, to
ensure IPv4 routes aren't influenced by ECN bits when configured with
non-zero rtm_tos.

Before this patch, IPv4 routes specifying an rtm_tos with some of the
ECN bits set were accepted. However they wouldn't work (never match) as
IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB
lookup (although a few buggy code paths don't).

After this patch, IPv4 routes specifying an rtm_tos with any ECN bit
set is rejected.

Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted,
but treated as if it were 0.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-07 20:12:46 -08:00
Guillaume Nault
563f8e97e0 ipv4: Stop taking ECN bits into account in fib4-rules
Use the new dscp_t type to replace the tos field of struct fib4_rule,
so that fib4-rules consistently ignore ECN bits.

Before this patch, fib4-rules did accept rules with the high order ECN
bit set (but not the low order one). Also, it relied on its callers
masking the ECN bits of ->flowi4_tos to prevent those from influencing
the result. This was brittle and a few call paths still do the lookup
without masking the ECN bits first.

After this patch fib4-rules only compare the DSCP bits. ECN can't
influence the result anymore, even if the caller didn't mask these
bits. Also, fib4-rules now must have both ECN bits cleared or they will
be rejected.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-07 20:12:46 -08:00
Guillaume Nault
a410a0cf98 ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules
Define a dscp_t type and its appropriate helpers that ensure ECN bits
are not taken into account when handling DSCP.

Use this new type to replace the tclass field of struct fib6_rule, so
that fib6-rules don't get influenced by ECN bits anymore.

Before this patch, fib6-rules didn't make any distinction between the
DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield
options in iproute2) stopped working as soon a packets had at least one
of its ECN bits set (as a work around one could create four rules for
each DSCP value to match, one for each possible ECN value).

After this patch fib6-rules only compare the DSCP bits. ECN doesn't
influence the result anymore. Also, fib6-rules now must have the ECN
bits cleared or they will be rejected.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-07 20:12:45 -08:00
Stanislav Fomichev
5d1e9f437d bpf: test_run: Fix overflow in bpf_test_finish frags parsing
This place also uses signed min_t and passes this singed int to
copy_to_user (which accepts unsigned argument). I don't think
there is an issue, but let's be consistent.

Fixes: 7855e0db15 ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220204235849.14658-2-sdf@google.com
2022-02-07 18:26:13 -08:00
Stanislav Fomichev
9d63b59d1e bpf: test_run: Fix overflow in xdp frags parsing
When kattr->test.data_size_in > INT_MAX, signed min_t will assign
negative value to data_len. This negative value then gets passed
over to copy_from_user where it is converted to (big) unsigned.

Use unsigned min_t to avoid this overflow.

usercopy: Kernel memory overwrite attempt detected to wrapped address
(offset 0, size 18446612140539162846)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102!
invalid opcode: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 3781 Comm: syz-executor226 Not tainted 4.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:usercopy_abort+0xbd/0xbf mm/usercopy.c:102
RSP: 0018:ffff8801e9703a38 EFLAGS: 00010286
RAX: 000000000000006c RBX: ffffffff84fc7040 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff816560a2 RDI: ffffed003d2e0739
RBP: ffff8801e9703a90 R08: 000000000000006c R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff84fc73a0
R13: ffffffff84fc7180 R14: ffffffff84fc7040 R15: ffffffff84fc7040
FS:  00007f54e0bec300(0000) GS:ffff8801f6600000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000280 CR3: 00000001e90ea000 CR4: 00000000003426f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 check_bogus_address mm/usercopy.c:155 [inline]
 __check_object_size mm/usercopy.c:263 [inline]
 __check_object_size.cold+0x8c/0xad mm/usercopy.c:253
 check_object_size include/linux/thread_info.h:112 [inline]
 check_copy_size include/linux/thread_info.h:143 [inline]
 copy_from_user include/linux/uaccess.h:142 [inline]
 bpf_prog_test_run_xdp+0xe57/0x1240 net/bpf/test_run.c:989
 bpf_prog_test_run kernel/bpf/syscall.c:3377 [inline]
 __sys_bpf+0xdf2/0x4a50 kernel/bpf/syscall.c:4679
 SYSC_bpf kernel/bpf/syscall.c:4765 [inline]
 SyS_bpf+0x26/0x50 kernel/bpf/syscall.c:4763
 do_syscall_64+0x21a/0x3e0 arch/x86/entry/common.c:305
 entry_SYSCALL_64_after_hwframe+0x46/0xbb

Fixes: 1c19499825 ("bpf: introduce frags support to bpf_prog_test_run_xdp()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220204235849.14658-1-sdf@google.com
2022-02-07 18:26:13 -08:00
Eric Dumazet
7d9b1b578d ip6mr: fix use-after-free in ip6mr_sk_done()
Apparently addrconf_exit_net() is called before igmp6_net_exit()
and ndisc_net_exit() at netns dismantle time:

 net_namespace: call ip6table_mangle_net_exit()
 net_namespace: call ip6_tables_net_exit()
 net_namespace: call ipv6_sysctl_net_exit()
 net_namespace: call ioam6_net_exit()
 net_namespace: call seg6_net_exit()
 net_namespace: call ping_v6_proc_exit_net()
 net_namespace: call tcpv6_net_exit()
 ip6mr_sk_done sk=ffffa354c78a74c0
 net_namespace: call ipv6_frags_exit_net()
 net_namespace: call addrconf_exit_net()
 net_namespace: call ip6addrlbl_net_exit()
 net_namespace: call ip6_flowlabel_net_exit()
 net_namespace: call ip6_route_net_exit_late()
 net_namespace: call fib6_rules_net_exit()
 net_namespace: call xfrm6_net_exit()
 net_namespace: call fib6_net_exit()
 net_namespace: call ip6_route_net_exit()
 net_namespace: call ipv6_inetpeer_exit()
 net_namespace: call if6_proc_net_exit()
 net_namespace: call ipv6_proc_exit_net()
 net_namespace: call udplite6_proc_exit_net()
 net_namespace: call raw6_exit_net()
 net_namespace: call igmp6_net_exit()
 ip6mr_sk_done sk=ffffa35472b2a180
 ip6mr_sk_done sk=ffffa354c78a7980
 net_namespace: call ndisc_net_exit()
 ip6mr_sk_done sk=ffffa35472b2ab00
 net_namespace: call ip6mr_net_exit()
 net_namespace: call inet6_net_exit()

This was fine because ip6mr_sk_done() would not reach the point decreasing
net->ipv6.devconf_all->mc_forwarding until my patch in ip6mr_sk_done().

To fix this without changing struct pernet_operations ordering,
we can clear net->ipv6.devconf_dflt and net->ipv6.devconf_all
when they are freed from addrconf_exit_net()

BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
BUG: KASAN: use-after-free in ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
Read of size 4 at addr ffff88801ff08688 by task kworker/u4:4/963

CPU: 0 PID: 963 Comm: kworker/u4:4 Not tainted 5.17.0-rc2-syzkaller-00650-g5a8fb33e5305 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
 __kasan_report mm/kasan/report.c:442 [inline]
 kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
 instrument_atomic_read include/linux/instrumented.h:71 [inline]
 atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
 ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
 rawv6_close+0x58/0x80 net/ipv6/raw.c:1201
 inet_release+0x12e/0x280 net/ipv4/af_inet.c:428
 inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478
 __sock_release net/socket.c:650 [inline]
 sock_release+0x87/0x1b0 net/socket.c:678
 inet_ctl_sock_destroy include/net/inet_common.h:65 [inline]
 igmp6_net_exit+0x6b/0x170 net/ipv6/mcast.c:3173
 ops_exit_list+0xb0/0x170 net/core/net_namespace.c:168
 cleanup_net+0x4ea/0xb00 net/core/net_namespace.c:600
 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
 worker_thread+0x657/0x1110 kernel/workqueue.c:2454
 kthread+0x2e9/0x3a0 kernel/kthread.c:377
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
 </TASK>

Fixes: f2f2325ec7 ("ip6mr: ip6mr_sk_done() can exit early in common cases")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 12:07:45 +00:00
Tom Rix
0812beb705 caif: cleanup double word in comment
Replace the second 'so' with 'free'.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 12:06:54 +00:00
Eric Dumazet
94fdd7c02a net/smc: use GFP_ATOMIC allocation in smc_pnet_add_eth()
My last patch moved the netdev_tracker_alloc() call to a section
protected by a write_lock().

I should have replaced GFP_KERNEL with GFP_ATOMIC to avoid the infamous:

BUG: sleeping function called from invalid context at include/linux/sched/mm.h:256

Fixes: 28f9222138 ("net/smc: fix ref_tracker issue in smc_pnet_add()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 12:02:49 +00:00
Menglong Dong
08d4c0370c net: udp: use kfree_skb_reason() in __udp_queue_rcv_skb()
Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
Following new drop reasons are introduced:

SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 11:18:49 +00:00
Menglong Dong
1379a92d38 net: udp: use kfree_skb_reason() in udp_queue_rcv_one_skb()
Replace kfree_skb() with kfree_skb_reason() in udp_queue_rcv_one_skb().

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 11:18:49 +00:00
Menglong Dong
10580c4791 net: ipv4: use kfree_skb_reason() in ip_protocol_deliver_rcu()
Replace kfree_skb() with kfree_skb_reason() in ip_protocol_deliver_rcu().
Following new drop reasons are introduced:

SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 11:18:49 +00:00
Menglong Dong
c1f166d1f7 net: ipv4: use kfree_skb_reason() in ip_rcv_finish_core()
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_finish_core(),
following drop reasons are introduced:

SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 11:18:49 +00:00
Menglong Dong
33cba42985 net: ipv4: use kfree_skb_reason() in ip_rcv_core()
Replace kfree_skb() with kfree_skb_reason() in ip_rcv_core(). Three new
drop reasons are introduced:

SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 11:18:49 +00:00
Menglong Dong
2df3041ba3 net: netfilter: use kfree_drop_reason() for NF_DROP
Replace kfree_skb() with kfree_skb_reason() in nf_hook_slow() when
skb is dropped by reason of NF_DROP. Following new drop reasons
are introduced:

SKB_DROP_REASON_NETFILTER_DROP

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-07 11:18:49 +00:00
Eric Dumazet
28f9222138 net/smc: fix ref_tracker issue in smc_pnet_add()
I added the netdev_tracker_alloc() right after ndev was
stored into the newly allocated object:

  new_pe->ndev = ndev;
  if (ndev)
      netdev_tracker_alloc(ndev, &new_pe->dev_tracker, GFP_KERNEL);

But I missed that later, we could end up freeing new_pe,
then calling dev_put(ndev) to release the reference on ndev.

The new_pe->dev_tracker would not be freed.

To solve this issue, move the netdev_tracker_alloc() call to
the point we know for sure new_pe will be kept.

syzbot report (on net-next tree, but the bug is present in net tree)
WARNING: CPU: 0 PID: 6019 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Modules linked in:
CPU: 0 PID: 6019 Comm: syz-executor.3 Not tainted 5.17.0-rc2-syzkaller-00650-g5a8fb33e5305 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Code: 1d f4 70 a0 09 31 ff 89 de e8 4d bc 99 fd 84 db 75 e0 e8 64 b8 99 fd 48 c7 c7 20 0c 06 8a c6 05 d4 70 a0 09 01 e8 9e 4e 28 05 <0f> 0b eb c4 e8 48 b8 99 fd 0f b6 1d c3 70 a0 09 31 ff 89 de e8 18
RSP: 0018:ffffc900043b7400 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000040000 RSI: ffffffff815fb318 RDI: fffff52000876e72
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815f507e R11: 0000000000000000 R12: 1ffff92000876e85
R13: 0000000000000000 R14: ffff88805c1c6600 R15: 0000000000000000
FS:  00007f1ef6feb700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2d02b000 CR3: 00000000223f4000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __refcount_dec include/linux/refcount.h:344 [inline]
 refcount_dec include/linux/refcount.h:359 [inline]
 ref_tracker_free+0x53f/0x6c0 lib/ref_tracker.c:119
 netdev_tracker_free include/linux/netdevice.h:3867 [inline]
 dev_put_track include/linux/netdevice.h:3884 [inline]
 dev_put_track include/linux/netdevice.h:3880 [inline]
 dev_put include/linux/netdevice.h:3910 [inline]
 smc_pnet_add_eth net/smc/smc_pnet.c:399 [inline]
 smc_pnet_enter net/smc/smc_pnet.c:493 [inline]
 smc_pnet_add+0x5fc/0x15f0 net/smc/smc_pnet.c:556
 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
 genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
 genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
 netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:725
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: b60645248a ("net/smc: add net device tracker to struct smc_pnetentry")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-06 11:08:03 +00:00
Eric Dumazet
9c1be1935f net: initialize init_net earlier
While testing a patch that will follow later
("net: add netns refcount tracker to struct nsproxy")
I found that devtmpfs_init() was called before init_net
was initialized.

This is a bug, because devtmpfs_setup() calls
ksys_unshare(CLONE_NEWNS);

This has the effect of increasing init_net refcount,
which will be later overwritten to 1, as part of setup_net(&init_net)

We had too many prior patches [1] trying to work around the root cause.

Really, make sure init_net is in BSS section, and that net_ns_init()
is called earlier at boot time.

Note that another patch ("vfs: add netns refcount tracker
to struct fs_context") also will need net_ns_init() being called
before vfs_caches_init()

As a bonus, this patch saves around 4KB in .data section.

[1]

f8c46cb390 ("netns: do not call pernet ops for not yet set up init_net namespace")
b5082df801 ("net: Initialise init_net.count to 1")
734b65417b ("net: Statically initialize init_net.dev_base_head")

v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-06 11:04:29 +00:00
Juhee Kang
4acc45db71 net: hsr: use hlist_head instead of list_head for mac addresses
Currently, HSR manages mac addresses of known HSR nodes by using list_head.
It takes a lot of time when there are a lot of registered nodes due to
finding specific mac address nodes by using linear search. We can be
reducing the time by using hlist. Thus, this patch moves list_head to
hlist_head for mac addresses and this allows for further improvement of
network performance.

    Condition: registered 10,000 known HSR nodes
    Before:
    # iperf3 -c 192.168.10.1 -i 1 -t 10
    Connecting to host 192.168.10.1, port 5201
    [  5] local 192.168.10.2 port 59442 connected to 192.168.10.1 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.49   sec  3.75 MBytes  21.1 Mbits/sec    0    158 KBytes
    [  5]   1.49-2.05   sec  1.25 MBytes  18.7 Mbits/sec    0    166 KBytes
    [  5]   2.05-3.06   sec  2.44 MBytes  20.3 Mbits/sec   56   16.9 KBytes
    [  5]   3.06-4.08   sec  1.43 MBytes  11.7 Mbits/sec   11   38.0 KBytes
    [  5]   4.08-5.00   sec   951 KBytes  8.49 Mbits/sec    0   56.3 KBytes

    After:
    # iperf3 -c 192.168.10.1 -i 1 -t 10
    Connecting to host 192.168.10.1, port 5201
    [  5] local 192.168.10.2 port 36460 connected to 192.168.10.1 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  7.39 MBytes  62.0 Mbits/sec    3    130 KBytes
    [  5]   1.00-2.00   sec  5.06 MBytes  42.4 Mbits/sec   16    113 KBytes
    [  5]   2.00-3.00   sec  8.58 MBytes  72.0 Mbits/sec   42   94.3 KBytes
    [  5]   3.00-4.00   sec  7.44 MBytes  62.4 Mbits/sec    2    131 KBytes
    [  5]   4.00-5.07   sec  8.13 MBytes  63.5 Mbits/sec   38   92.9 KBytes

Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-06 10:55:52 +00:00
Eric Dumazet
5a8fb33e53 skmsg: convert struct sk_msg_sg::copy to a bitmap
We have plans for increasing MAX_SKB_FRAGS, but sk_msg_sg::copy
is currently an unsigned long, limiting MAX_SKB_FRAGS to 30 on 32bit arches.

Convert it to a bitmap, as Jakub suggested.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05 15:34:47 +00:00
Eric Dumazet
4c6c11ea0f net: refine dev_put()/dev_hold() debugging
We are still chasing some syzbot reports where we think a rogue dev_put()
is called with no corresponding prior dev_hold().
Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
dev_hold_track(), meaning that the refcount saturation splat comes
too late to be useful.

Make sure that 'not tracked' dev_put() and dev_hold() better use
CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:

Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
to be called with a NULL @trackerp parameter, and to use a separate refcount
only to detect too many put() even in the following case:

dev_hold_track(dev, tracker_1, GFP_ATOMIC);
 dev_hold(dev);
 dev_put(dev);
 dev_put(dev); // Should complain loudly here.
dev_put_track(dev, tracker_1); // instead of here

Add clarification about netdev_tracker_alloc() role.

v2: I replaced the dev_put() in linkwatch_do_dev()
    with __dev_put() because callers called netdev_tracker_free().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05 15:22:45 +00:00
Eric Dumazet
f2f2325ec7 ip6mr: ip6mr_sk_done() can exit early in common cases
In many cases, ip6mr_sk_done() is called while no ipmr socket
has been registered.

This removes 4 rtnl acquisitions per netns dismantle,
with following callers:

igmp6_net_exit(), tcpv6_net_exit(), ndisc_net_exit()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05 15:20:34 +00:00
Eric Dumazet
145c7a7938 ipv6: make mc_forwarding atomic
This fixes minor data-races in ip6_mc_input() and
batadv_mcast_mla_rtr_flags_softif_get_ipv6()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05 15:20:34 +00:00
Paolo Abeni
de5a1f3ce4 net: gro: minor optimization for dev_gro_receive()
While inspecting some perf report, I noticed that the compiler
emits suboptimal code for the napi CB initialization, fetching
and storing multiple times the memory for flags bitfield.
This is with gcc 10.3.1, but I observed the same with older compiler
versions.

We can help the compiler to do a nicer work clearing several
fields at once using an u32 alias. The generated code is quite
smaller, with the same number of conditional.

Before:
objdump -t net/core/gro.o | grep " F .text"
0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive

After:
0000000000000bb0 l     F .text	000000000000033c dev_gro_receive

v1  -> v2:
 - use struct_group (Alexander and Alex)

RFC -> v1:
 - use __struct_group to delimit the zeroed area (Alexander)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05 15:13:52 +00:00
Paolo Abeni
7881453e4a net: gro: avoid re-computing truesize twice on recycle
After commit 5e10da5385 ("skbuff: allow 'slow_gro' for skb
carring sock reference") and commit af352460b4 ("net: fix GRO
skb truesize update") the truesize of the skb with stolen head is
properly updated by the GRO engine, we don't need anymore resetting
it at recycle time.

v1 -> v2:
 - clarify the commit message (Alexander)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-05 15:13:52 +00:00