Commit Graph

57846 Commits

Author SHA1 Message Date
YueHaibing
45d5cb137c net/sched: cbs: Fix error path of cbs_module_init
If register_qdisc fails, we should unregister
netdevice notifier.

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: e0a7683d30 ("net/sched: cbs: fix port_rate miscalculation")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23 11:32:48 -07:00
Li RongQing
b272a0ad73 netns: restore ops before calling ops_exit_list
ops has been iterated to first element when call pre_exit, and
it needs to restore from save_ops, not save ops to save_ops

Fixes: d7d99872c1 ("netns: add pre_exit method to struct pernet_operations")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 16:55:36 -07:00
Xin Long
c492d4c74d tipc: change to use register_pernet_device
This patch is to fix a dst defcnt leak, which can be reproduced by doing:

  # ip net a c; ip net a s; modprobe tipc
  # ip net e s ip l a n eth1 type veth peer n eth1 netns c
  # ip net e c ip l s lo up; ip net e c ip l s eth1 up
  # ip net e s ip l s lo up; ip net e s ip l s eth1 up
  # ip net e c ip a a 1.1.1.2/8 dev eth1
  # ip net e s ip a a 1.1.1.1/8 dev eth1
  # ip net e c tipc b e m udp n u1 localip 1.1.1.2
  # ip net e s tipc b e m udp n u1 localip 1.1.1.1
  # ip net d c; ip net d s; rmmod tipc

and it will get stuck and keep logging the error:

  unregister_netdevice: waiting for lo to become free. Usage count = 1

The cause is that a dst is held by the udp sock's sk_rx_dst set on udp rx
path with udp_early_demux == 1, and this dst (eventually holding lo dev)
can't be released as bearer's removal in tipc pernet .exit happens after
lo dev's removal, default_device pernet .exit.

 "There are two distinct types of pernet_operations recognized: subsys and
  device.  At creation all subsys init functions are called before device
  init functions, and at destruction all device exit functions are called
  before subsys exit function."

So by calling register_pernet_device instead to register tipc_net_ops, the
pernet .exit() will be invoked earlier than loopback dev's removal when a
netns is being destroyed, as fou/gue does.

Note that vxlan and geneve udp tunnels don't have this issue, as the udp
sock is released in their device ndo_stop().

This fix is also necessary for tipc dst_cache, which will hold dsts on tx
path and I will introduce in my next patch.

Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 16:51:47 -07:00
Ido Schimmel
9eee3b4913 ipv6: Error when route does not have any valid nexthops
When user space sends invalid information in RTA_MULTIPATH, the nexthop
list in ip6_route_multipath_add() is empty and 'rt_notif' is set to
NULL.

The code that emits the in-kernel notifications does not check for this
condition, which results in a NULL pointer dereference [1].

Fix this by bailing earlier in the function if the parsed nexthop list
is empty. This is consistent with the corresponding IPv4 code.

v2:
* Check if parsed nexthop list is empty and bail with extack set

[1]
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9190 Comm: syz-executor149 Not tainted 5.2.0-rc5+ #38
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:call_fib6_multipath_entry_notifiers+0xd1/0x1a0
net/ipv6/ip6_fib.c:396
Code: 8b b5 30 ff ff ff 48 c7 85 68 ff ff ff 00 00 00 00 48 c7 85 70 ff ff
ff 00 00 00 00 89 45 88 4c 89 e0 48 c1 e8 03 4c 89 65 80 <42> 80 3c 28 00
0f 85 9a 00 00 00 48 b8 00 00 00 00 00 fc ff df 4d
RSP: 0018:ffff88809788f2c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 1ffff11012f11e59 RCX: 00000000ffffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88809788f390 R08: ffff88809788f8c0 R09: 000000000000000c
R10: ffff88809788f5d8 R11: ffff88809788f527 R12: 0000000000000000
R13: dffffc0000000000 R14: ffff88809788f8c0 R15: ffffffff89541d80
FS:  000055555632c880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000080 CR3: 000000009ba7c000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  ip6_route_multipath_add+0xc55/0x1490 net/ipv6/route.c:5094
  inet6_rtm_newroute+0xed/0x180 net/ipv6/route.c:5208
  rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5219
  netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
  rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:646 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:665
  ___sys_sendmsg+0x803/0x920 net/socket.c:2286
  __sys_sendmsg+0x105/0x1d0 net/socket.c:2324
  __do_sys_sendmsg net/socket.c:2333 [inline]
  __se_sys_sendmsg net/socket.c:2331 [inline]
  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2331
  do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4401f9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc09fd0028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401f9
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a80
R13: 0000000000401b10 R14: 0000000000000000 R15: 0000000000000000

Reported-by: syzbot+382566d339d52cd1a204@syzkaller.appspotmail.com
Fixes: ebee3cad83 ("ipv6: Add IPv6 multipath notifications for add / replace")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 16:50:35 -07:00
Ard Biesheuvel
438ac88009 net: fastopen: robustness and endianness fixes for SipHash
Some changes to the TCP fastopen code to make it more robust
against future changes in the choice of key/cookie size, etc.

- Instead of keeping the SipHash key in an untyped u8[] buffer
  and casting it to the right type upon use, use the correct
  type directly. This ensures that the key will appear at the
  correct alignment if we ever change the way these data
  structures are allocated. (Currently, they are only allocated
  via kmalloc so they always appear at the correct alignment)

- Use DIV_ROUND_UP when sizing the u64[] array to hold the
  cookie, so it is always of sufficient size, even if
  TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8.

- Drop the 'len' parameter from the tcp_fastopen_reset_cipher()
  function, which is no longer used.

- Add endian swabbing when setting the keys and calculating the hash,
  to ensure that cookie values are the same for a given key and
  source/destination address pair regardless of the endianness of
  the server.

Note that none of these are functional changes wrt the current
state of the code, with the exception of the swabbing, which only
affects big endian systems.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 16:30:37 -07:00
Marcel Holtmann
693cd8ce3f Bluetooth: Fix regression with minimum encryption key size alignment
When trying to align the minimum encryption key size requirement for
Bluetooth connections, it turns out doing this in a central location in
the HCI connection handling code is not possible.

Original Bluetooth version up to 2.0 used a security model where the
L2CAP service would enforce authentication and encryption.  Starting
with Bluetooth 2.1 and Secure Simple Pairing that model has changed into
that the connection initiator is responsible for providing an encrypted
ACL link before any L2CAP communication can happen.

Now connecting Bluetooth 2.1 or later devices with Bluetooth 2.0 and
before devices are causing a regression.  The encryption key size check
needs to be moved out of the HCI connection handling into the L2CAP
channel setup.

To achieve this, the current check inside hci_conn_security() has been
moved into l2cap_check_enc_key_size() helper function and then called
from four decisions point inside L2CAP to cover all combinations of
Secure Simple Pairing enabled devices and device using legacy pairing
and legacy service security model.

Fixes: d5bb334a8e ("Bluetooth: Align minimum encryption key size for LE and BR/EDR connections")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203643
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-22 09:07:39 -07:00
David S. Miller
92ad6325cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor SPDX change conflict.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 08:59:24 -04:00
Linus Torvalds
c356dc4b54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume
    Nault.

 2) Don't access the DDM interface unless the transceiver implements it
    in bnx2x, from Mauro S. M. Rodrigues.

 3) Don't double fetch 'len' from userspace in sock_getsockopt(), from
    JingYi Hou.

 4) Sign extension overflow in lio_core, from Colin Ian King.

 5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski.

 6) Fix epollout hang in hvsock, from Sunil Muthuswamy.

 7) Fix regression in default fib6_type, from David Ahern.

 8) Handle memory limits in tcp_fragment more appropriately, from Eric
    Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits)
  tcp: refine memory limit test in tcp_fragment()
  inet: clear num_timeout reqsk_alloc()
  net: mvpp2: debugfs: Add pmap to fs dump
  ipv6: Default fib6_type to RTN_UNICAST when not set
  net: hns3: Fix inconsistent indenting
  net/af_iucv: always register net_device notifier
  net/af_iucv: build proper skbs for HiperTransport
  net/af_iucv: remove GFP_DMA restriction for HiperTransport
  net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge()
  hvsock: fix epollout hang from race condition
  net/udp_gso: Allow TX timestamp with UDP GSO
  net: netem: fix use after free and double free with packet corruption
  net: netem: fix backlog accounting for corrupted GSO frames
  net: lio_core: fix potential sign-extension overflow on large shift
  tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb
  ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL
  ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL
  tun: wake up waitqueues after IFF_UP is set
  net: remove duplicate fetch in sock_getsockopt
  tipc: fix issues with early FAILOVER_MSG from peer
  ...
2019-06-21 22:23:35 -07:00
Eric Dumazet
b6653b3629 tcp: refine memory limit test in tcp_fragment()
tcp_fragment() might be called for skbs in the write queue.

Memory limits might have been exceeded because tcp_sendmsg() only
checks limits at full skb (64KB) boundaries.

Therefore, we need to make sure tcp_fragment() wont punish applications
that might have setup very low SO_SNDBUF values.

Fixes: f070ef2ac6 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-21 20:58:42 -04:00
Linus Torvalds
c036f7dabc More NFS client fixes for Linux 5.2
Bugfixes:
 - SUNRPC: Fix a credential refcount leak
 - Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
 - SUNRPC: Fix xps refcount imbalance on the error path
 - NFS4: Only set creation opendata if O_CREAT
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl0NMHQACgkQ18tUv7Cl
 QOujAw//bL4p0ADvCSglqO0ceZcHpHuQVhj10f+tyXfkq3R7WDQmP2qok/+6uZjk
 rYxWh6nCqQqTBSmstf4ouLMjdQTk+zDQf38zSSWGPW0U6rCX4kl/yOyskBriYqnf
 W4U5+hOCvQ8prKdkcjbhYqdvz6qT9bhc5X7kHgtD66CtvyzUDraYw04Mojzodl91
 CPV97rDZbiHBAgZPBFKF+qoTXEL7hQlHUREcR/DZBLV3qrMBsraog1T1OpWRGev0
 OAxVAwZyXFcWDFm7mFcItMA4WcUnZbL77gPNtvZSfgYPxHsytHfH8KOmEG7zP5Yy
 +blko41nvNR2UVOPQ/zTvWj9pkuHQlacDUrlYdgnOmHuGhufj1/xx4Z5C2dkTTnp
 ufjcVXJOqZE7lWyA4IOWapc7gLM3q6I8sUR9w5nLc1meYphNAqHZgK0fTgfbMoo7
 JweUBcgGmNvkMpkX561HBe16ENKvcgQQp666VsnTTI/BPZ/BBdlayKGBbxA3j22F
 znF4gwznvQ0jVxtlNsibzSwM8GQbc6UM5fGPM+atlPJEzban0waQaIbCW347DViZ
 fTXP2NQmvH1x+YNgx6RTRwVBWWT02u/ijQHf7+NvwWLWTKb9OXwQVjlnWHu/E/Qi
 MpNXxtBTT6y+DG09VNV9CYwmAsULnlRQWe9RNBfMz+4sjLCyIjg=
 =+z30
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.2-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull more NFS client fixes from Anna Schumaker:
 "These are mostly refcounting issues that people have found recently.
  The revert fixes a suspend recovery performance issue.

   - SUNRPC: Fix a credential refcount leak

   - Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"

   - SUNRPC: Fix xps refcount imbalance on the error path

   - NFS4: Only set creation opendata if O_CREAT"

* tag 'nfs-for-5.2-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Fix a credential refcount leak
  Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
  net :sunrpc :clnt :Fix xps refcount imbalance on the error path
  NFS4: Only set creation opendata if O_CREAT
2019-06-21 13:45:41 -07:00
Trond Myklebust
19d55046cd SUNRPC: Fix a credential refcount leak
All callers of __rpc_clone_client() pass in a value for args->cred,
meaning that the credential gets assigned and referenced in
the call to rpc_new_client().

Reported-by: Ido Schimmel <idosch@idosch.org>
Fixes: 79caa5fad4 ("SUNRPC: Cache cred of process creating the rpc_client")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-21 14:45:09 -04:00
Anna Schumaker
502980e84e Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
Jon Hunter reports:
  "I have been noticing intermittent failures with a system suspend test on
   some of our machines that have a NFS mounted root file-system. Bisecting
   this issue points to your commit 431235818b ("SUNRPC: Declare RPC
   timers as TIMER_DEFERRABLE") and reverting this on top of v5.2-rc3 does
   appear to resolve the problem.

   The cause of the suspend failure appears to be a long delay observed
   sometimes when resuming from suspend, and this is causing our test to
   timeout."

This reverts commit 431235818b.

Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-21 14:43:42 -04:00
Lin Yi
b962261484 net :sunrpc :clnt :Fix xps refcount imbalance on the error path
rpc_clnt_add_xprt take a reference to struct rpc_xprt_switch, but forget
to release it before return, may lead to a memory leak.

Signed-off-by: Lin Yi <teroincn@163.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-21 14:43:35 -04:00
Linus Torvalds
c884d8ac7f SPDX update for 5.2-rc6
Another round of SPDX updates for 5.2-rc6
 
 Here is what I am guessing is going to be the last "big" SPDX update for
 5.2.  It contains all of the remaining GPLv2 and GPLv2+ updates that
 were "easy" to determine by pattern matching.  The ones after this are
 going to be a bit more difficult and the people on the spdx list will be
 discussing them on a case-by-case basis now.
 
 Another 5000+ files are fixed up, so our overall totals are:
 	Files checked:            64545
 	Files with SPDX:          45529
 
 Compared to the 5.1 kernel which was:
 	Files checked:            63848
 	Files with SPDX:          22576
 This is a huge improvement.
 
 Also, we deleted another 20000 lines of boilerplate license crud, always
 nice to see in a diffstat.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv
 kbsjBGEJpLbMRB2krnaU
 =RMcT
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull still more SPDX updates from Greg KH:
 "Another round of SPDX updates for 5.2-rc6

  Here is what I am guessing is going to be the last "big" SPDX update
  for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
  that were "easy" to determine by pattern matching. The ones after this
  are going to be a bit more difficult and the people on the spdx list
  will be discussing them on a case-by-case basis now.

  Another 5000+ files are fixed up, so our overall totals are:
	Files checked:            64545
	Files with SPDX:          45529

  Compared to the 5.1 kernel which was:
	Files checked:            63848
	Files with SPDX:          22576

  This is a huge improvement.

  Also, we deleted another 20000 lines of boilerplate license crud,
  always nice to see in a diffstat"

* tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
  ...
2019-06-21 09:58:42 -07:00
Stephen Suryaputra
dbb5281a1f netfilter: nf_tables: add support for matching IPv4 options
This is the kernel change for the overall changes with this description:
Add capability to have rules matching IPv4 options. This is developed
mainly to support dropping of IP packets with loose and/or strict source
route route options.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21 18:35:51 +02:00
Julian Anastasov
cf47a0b882 ipvs: defer hook registration to avoid leaks
syzkaller reports for memory leak when registering hooks [1]

As we moved the nf_unregister_net_hooks() call into
__ip_vs_dev_cleanup(), defer the nf_register_net_hooks()
call, so that hooks are allocated and freed from same
pernet_operations (ipvs_core_dev_ops).

[1]
BUG: memory leak
unreferenced object 0xffff88810acd8a80 (size 96):
 comm "syz-executor073", pid 7254, jiffies 4294950560 (age 22.250s)
 hex dump (first 32 bytes):
   02 00 00 00 00 00 00 00 50 8b bb 82 ff ff ff ff  ........P.......
   00 00 00 00 00 00 00 00 00 77 bb 82 ff ff ff ff  .........w......
 backtrace:
   [<0000000013db61f1>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
   [<0000000013db61f1>] slab_post_alloc_hook mm/slab.h:439 [inline]
   [<0000000013db61f1>] slab_alloc_node mm/slab.c:3269 [inline]
   [<0000000013db61f1>] kmem_cache_alloc_node_trace+0x15b/0x2a0 mm/slab.c:3597
   [<000000001a27307d>] __do_kmalloc_node mm/slab.c:3619 [inline]
   [<000000001a27307d>] __kmalloc_node+0x38/0x50 mm/slab.c:3627
   [<0000000025054add>] kmalloc_node include/linux/slab.h:590 [inline]
   [<0000000025054add>] kvmalloc_node+0x4a/0xd0 mm/util.c:431
   [<0000000050d1bc00>] kvmalloc include/linux/mm.h:637 [inline]
   [<0000000050d1bc00>] kvzalloc include/linux/mm.h:645 [inline]
   [<0000000050d1bc00>] allocate_hook_entries_size+0x3b/0x60 net/netfilter/core.c:61
   [<00000000e8abe142>] nf_hook_entries_grow+0xae/0x270 net/netfilter/core.c:128
   [<000000004b94797c>] __nf_register_net_hook+0x9a/0x170 net/netfilter/core.c:337
   [<00000000d1545cbc>] nf_register_net_hook+0x34/0xc0 net/netfilter/core.c:464
   [<00000000876c9b55>] nf_register_net_hooks+0x53/0xc0 net/netfilter/core.c:480
   [<000000002ea868e0>] __ip_vs_init+0xe8/0x170 net/netfilter/ipvs/ip_vs_core.c:2280
   [<000000002eb2d451>] ops_init+0x4c/0x140 net/core/net_namespace.c:130
   [<000000000284ec48>] setup_net+0xde/0x230 net/core/net_namespace.c:316
   [<00000000a70600fa>] copy_net_ns+0xf0/0x1e0 net/core/net_namespace.c:439
   [<00000000ff26c15e>] create_new_namespaces+0x141/0x2a0 kernel/nsproxy.c:107
   [<00000000b103dc79>] copy_namespaces+0xa1/0xe0 kernel/nsproxy.c:165
   [<000000007cc008a2>] copy_process.part.0+0x11fd/0x2150 kernel/fork.c:2035
   [<00000000c344af7c>] copy_process kernel/fork.c:1800 [inline]
   [<00000000c344af7c>] _do_fork+0x121/0x4f0 kernel/fork.c:2369

Reported-by: syzbot+722da59ccb264bc19910@syzkaller.appspotmail.com
Fixes: 719c7d563c ("ipvs: Fix use-after-free in ip_vs_in")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21 17:44:01 +02:00
Fernando Fernandez Mancera
f76c7bfca4 netfilter: synproxy: fix manual bump of the reference counter
This operation is handled by nf_synproxy_ipv4_init() now.

Fixes: d7f9b2f18e ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21 17:30:23 +02:00
wenxu
2909946235 netfilter: bridge: Fix non-untagged fragment packet
ip netns exec ns1 ip a a dev eth0 10.0.0.7/24
ip netns exec ns2 ip link a link eth0 name vlan type vlan id 200
ip netns exec ns2 ip a a dev vlan 10.0.0.8/24

ip l add dev br0 type bridge vlan_filtering 1
brctl addif br0 veth1
brctl addif br0 veth2

bridge vlan add dev veth1 vid 200 pvid untagged
bridge vlan add dev veth2 vid 200

A two fragment packet sent from ns2 contains the vlan tag 200.  In the
bridge conntrack, this packet will defrag to one skb with fraglist.
When the packet is forwarded to ns1 through veth1, the first skb vlan
tag will be cleared by the "untagged" flags. But the vlan tag in the
second skb is still tagged, so the second fragment ends up with tag 200
to ns1. So if the first fragment packet doesn't contain the vlan tag,
all of the remain should not contain vlan tag.

Fixes: 3c171f496e ("netfilter: bridge: add connection tracking system")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21 17:21:12 +02:00
Greg Kroah-Hartman
6987738dfa 6lowpan: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Because we don't care if debugfs works or not, this trickles back a bit
so we can clean things up by making some functions return void instead
of an error value that is never going to fail.

Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-wpan@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-21 15:46:32 +02:00
Arnd Bergmann
597179b0ba ipsec: select crypto ciphers for xfrm_algo
kernelci.org reports failed builds on arc because of what looks
like an old missed 'select' statement:

net/xfrm/xfrm_algo.o: In function `xfrm_probe_algs':
xfrm_algo.c:(.text+0x1e8): undefined reference to `crypto_has_ahash'

I don't see this in randconfig builds on other architectures, but
it's fairly clear we want to select the hash code for it, like we
do for all its other users. As Herbert points out, CRYPTO_BLKCIPHER
is also required even though it has not popped up in build tests.

Fixes: 17bc197022 ("ipsec: Use skcipher and ahash when probing algorithms")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-21 14:48:31 +02:00
Ming Lei
4635873c56 scsi: lib/sg_pool.c: improve APIs for allocating sg pool
sg_alloc_table_chained() currently allows the caller to provide one
preallocated SGL and returns if the requested number isn't bigger than
size of that SGL. This is used to inline an SGL for an IO request.

However, scattergather code only allows that size of the 1st preallocated
SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory
(4KB) is claimed for the SGL for each IO request. If the I/O is small, it
would be prudent to allocate a smaller SGL.

Introduce an extra parameter to sg_alloc_table_chained() and
sg_free_table_chained() for specifying size of the preallocated SGL.

Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the
same size except for the last one.  Change the code to allow both functions
to accept a variable size for the 1st preallocated SGL.

[mkp: attempted to clarify commit desc]

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: netdev@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-06-20 15:21:33 -04:00
Amir Goldstein
a35d632c72 rpc_pipefs: call fsnotify_{unlink,rmdir}() hooks
This will allow generating fsnotify delete events after the
fsnotify_nameremove() hook is removed from d_delete().

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-06-20 14:46:04 +02:00
Christian Brauner
7e6daf50e1 netfilter: bridge: prevent UAF in brnf_exit_net()
Prevent a UAF in brnf_exit_net().

When unregister_net_sysctl_table() is called the ctl_hdr pointer will
obviously be freed and so accessing it righter after is invalid. Fix
this by stashing a pointer to the table we want to free before we
unregister the sysctl header.

Note that syzkaller falsely chased this down to the drm tree so the
Fixes tag that syzkaller requested would be wrong. This commit uses a
different but the correct Fixes tag.

/* Splat */

BUG: KASAN: use-after-free in br_netfilter_sysctl_exit_net
net/bridge/br_netfilter_hooks.c:1121 [inline]
BUG: KASAN: use-after-free in brnf_exit_net+0x38c/0x3a0
net/bridge/br_netfilter_hooks.c:1141
Read of size 8 at addr ffff8880a4078d60 by task kworker/u4:4/8749

CPU: 0 PID: 8749 Comm: kworker/u4:4 Not tainted 5.2.0-rc5-next-20190618 #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
Workqueue: netns cleanup_net
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
 __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline]
 brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141
 ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154
 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 11374:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 __do_kmalloc mm/slab.c:3645 [inline]
 __kmalloc+0x15c/0x740 mm/slab.c:3654
 kmalloc include/linux/slab.h:552 [inline]
 kzalloc include/linux/slab.h:743 [inline]
 __register_sysctl_table+0xc7/0xef0 fs/proc/proc_sysctl.c:1327
 register_net_sysctl+0x29/0x30 net/sysctl_net.c:121
 br_netfilter_sysctl_init_net net/bridge/br_netfilter_hooks.c:1105 [inline]
 brnf_init_net+0x379/0x6a0 net/bridge/br_netfilter_hooks.c:1126
 ops_init+0xb3/0x410 net/core/net_namespace.c:130
 setup_net+0x2d3/0x740 net/core/net_namespace.c:316
 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:103
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:202
 ksys_unshare+0x444/0x980 kernel/fork.c:2822
 __do_sys_unshare kernel/fork.c:2890 [inline]
 __se_sys_unshare kernel/fork.c:2888 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2888
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3417 [inline]
 kfree+0x10a/0x2c0 mm/slab.c:3746
 __rcu_reclaim kernel/rcu/rcu.h:215 [inline]
 rcu_do_batch kernel/rcu/tree.c:2092 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
 rcu_core+0xcc7/0x1500 kernel/rcu/tree.c:2291
 __do_softirq+0x25c/0x94c kernel/softirq.c:292

The buggy address belongs to the object at ffff8880a4078d40
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 32 bytes inside of
 512-byte region [ffff8880a4078d40, ffff8880a4078f40)
The buggy address belongs to the page:
page:ffffea0002901e00 refcount:1 mapcount:0 mapping:ffff8880aa400a80
index:0xffff8880a40785c0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea0001d636c8 ffffea0001b07308 ffff8880aa400a80
raw: ffff8880a40785c0 ffff8880a40780c0 0000000100000004 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880a4078c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a4078c80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> ffff8880a4078d00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                       ^
 ffff8880a4078d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a4078e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Reported-by: syzbot+43a3fa52c0d9c5c94f41@syzkaller.appspotmail.com
Fixes: 22567590b2 ("netfilter: bridge: namespace bridge netfilter sysctls")
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20 12:23:33 +02:00
Pablo Neira Ayuso
22f2efd337 netfilter: synproxy: use nf_cookie_v6_check() from core
This helper function is never used and it is intended to avoid a direct
dependency with the ipv6 module.

Fixes: d7f9b2f18e ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20 12:04:17 +02:00
Arnd Bergmann
8527fa6cc6 netfilter: synproxy: fix building syncookie calls
When either CONFIG_IPV6 or CONFIG_SYN_COOKIES are disabled, the kernel
fails to build:

include/linux/netfilter_ipv6.h:180:9: error: implicit declaration of function '__cookie_v6_init_sequence'
      [-Werror,-Wimplicit-function-declaration]
        return __cookie_v6_init_sequence(iph, th, mssp);
include/linux/netfilter_ipv6.h:194:9: error: implicit declaration of function '__cookie_v6_check'
      [-Werror,-Wimplicit-function-declaration]
        return __cookie_v6_check(iph, th, cookie);
net/ipv6/netfilter.c:237:26: error: use of undeclared identifier '__cookie_v6_init_sequence'; did you mean 'cookie_init_sequence'?
net/ipv6/netfilter.c:238:21: error: use of undeclared identifier '__cookie_v6_check'; did you mean '__cookie_v4_check'?

Fix the IS_ENABLED() checks to match the function declaration
and definitions for these.

Fixes: 3006a5224f ("netfilter: synproxy: remove module dependency on IPv6 SYNPROXY")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-20 11:59:36 +02:00
Ard Biesheuvel
4be297016c net/lib80211: move TKIP handling to ARC4 library code
The crypto API abstraction is not very useful for invoking ciphers
directly, especially in the case of arc4, which only has a generic
implementation in C. So let's invoke the library code directly.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-20 14:18:33 +08:00
Ard Biesheuvel
af1f3d327d net/lib80211: move WEP handling to ARC4 library code
The crypto API abstraction is not very useful for invoking ciphers
directly, especially in the case of arc4, which only has a generic
implementation in C. So let's invoke the library code directly.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-20 14:18:33 +08:00
Ard Biesheuvel
5fdb373570 net/mac80211: move WEP handling to ARC4 library interface
The WEP code in the mac80211 subsystem currently uses the crypto
API to access the arc4 (RC4) cipher, which is overly complicated,
and doesn't really have an upside in this particular case, since
ciphers are always synchronous and therefore always implemented in
software. Given that we have no accelerated software implementations
either, it is much more straightforward to invoke a generic library
interface directly.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-20 14:18:33 +08:00
David S. Miller
dca73a65a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-06-19

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin.

2) BTF based map definition, from Andrii.

3) support bpf_map_lookup_elem for xskmap, from Jonathan.

4) bounded loops and scalar precision logic in the verifier, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-20 00:06:27 -04:00
Eric Dumazet
85f9aa7565 inet: clear num_timeout reqsk_alloc()
KMSAN caught uninit-value in tcp_create_openreq_child() [1]
This is caused by a recent change, combined by the fact
that TCP cleared num_timeout, num_retrans and sk fields only
when a request socket was about to be queued.

Under syncookie mode, a temporary request socket is used,
and req->num_timeout could contain garbage.

Lets clear these three fields sooner, there is really no
point trying to defer this and risk other bugs.

[1]

BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x191/0x1f0 lib/dump_stack.c:113
 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611
 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304
 tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
 tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152
 tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209
 cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252
 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
 tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
 ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
 dst_input include/net/dst.h:439 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
 __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
 __netif_receive_skb net/core/dev.c:5095 [inline]
 process_backlog+0x721/0x1410 net/core/dev.c:5906
 napi_poll net/core/dev.c:6329 [inline]
 net_rx_action+0x738/0x1940 net/core/dev.c:6395
 __do_softirq+0x4ad/0x858 kernel/softirq.c:293
 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
 </IRQ>
 do_softirq kernel/softirq.c:338 [inline]
 __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
 ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
 dst_output include/net/dst.h:433 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
 tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
 __sock_release net/socket.c:601 [inline]
 sock_close+0x156/0x490 net/socket.c:1273
 __fput+0x4c9/0xba0 fs/file_table.c:280
 ____fput+0x37/0x40 fs/file_table.c:313
 task_work_run+0x22e/0x2a0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:185 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
 prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
 entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x401d50
Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00
RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50
RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c
R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0
R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline]
 kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160
 kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177
 kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781
 reqsk_alloc include/net/request_sock.h:84 [inline]
 inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384
 cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173
 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
 tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
 ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
 dst_input include/net/dst.h:439 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
 __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
 __netif_receive_skb net/core/dev.c:5095 [inline]
 process_backlog+0x721/0x1410 net/core/dev.c:5906
 napi_poll net/core/dev.c:6329 [inline]
 net_rx_action+0x738/0x1940 net/core/dev.c:6395
 __do_softirq+0x4ad/0x858 kernel/softirq.c:293
 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
 do_softirq kernel/softirq.c:338 [inline]
 __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
 ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
 dst_output include/net/dst.h:433 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
 tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
 __sock_release net/socket.c:601 [inline]
 sock_close+0x156/0x490 net/socket.c:1273
 __fput+0x4c9/0xba0 fs/file_table.c:280
 ____fput+0x37/0x40 fs/file_table.c:313
 task_work_run+0x22e/0x2a0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:185 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
 prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
 entry_SYSCALL_64_after_hwframe+0x63/0xe7

Fixes: 336c39a031 ("tcp: undo init congestion window on false SYNACK timeout")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 17:46:57 -04:00
Matthias Kaehlcke
25cec75689 net/ipv4: fib_trie: Avoid cryptic ternary expressions
empty_child_inc/dec() use the ternary operator for conditional
operations. The conditions involve the post/pre in/decrement
operator and the operation is only performed when the condition
is *not* true. This is hard to parse for humans, use a regular
'if' construct instead and perform the in/decrement separately.

This also fixes two warnings that are emitted about the value
of the ternary expression being unused, when building the kernel
with clang + "kbuild: Remove unnecessary -Wno-unused-value"
(https://lore.kernel.org/patchwork/patch/1089869/):

CC      net/ipv4/fib_trie.o
net/ipv4/fib_trie.c:351:2: error: expression result unused [-Werror,-Wunused-value]
        ++tn_info(n)->empty_children ? : ++tn_info(n)->full_children;

Fixes: 95f60ea3e9 ("fib_trie: Add collapse() and should_collapse() to resize")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 17:29:38 -04:00
David Ahern
c7036d97ac ipv6: Default fib6_type to RTN_UNICAST when not set
A user reported that routes are getting installed with type 0 (RTN_UNSPEC)
where before the routes were RTN_UNICAST. One example is from accel-ppp
which apparently still uses the ioctl interface and does not set
rtmsg_type. Another is the netlink interface where ipv6 does not require
rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.

Fixes: e8478e80e5 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 17:14:05 -04:00
Chuck Lever
1e091c3bbf svcrdma: Ignore source port when computing DRC hash
The DRC appears to be effectively empty after an RPC/RDMA transport
reconnect. The problem is that each connection uses a different
source port, which defeats the DRC hash.

Clients always have to disconnect before they send retransmissions
to reset the connection's credit accounting, thus every retransmit
on NFS/RDMA will miss the DRC.

An NFS/RDMA client's IP source port is meaningless for RDMA
transports. The transport layer typically sets the source port value
on the connection to a random ephemeral port. The server already
ignores it for the "secure port" check. See commit 16e4d93f6d
("NFSD: Ignore client's source port on RDMA transports").

The Linux NFS server's DRC resolves XID collisions from the same
source IP address by using the checksum of the first 200 bytes of
the RPC call header.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-06-19 16:40:40 -04:00
Julian Wiedmann
06996c1d40 net/af_iucv: always register net_device notifier
Even when running as VM guest (ie pr_iucv != NULL), af_iucv can still
open HiperTransport-based connections. For robust operation these
connections require the af_iucv_netdev_notifier, so register it
unconditionally.

Also handle any error that register_netdevice_notifier() returns.

Fixes: 9fbd87d413 ("af_iucv: handle netdev events")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 16:26:33 -04:00
Julian Wiedmann
238965b71b net/af_iucv: build proper skbs for HiperTransport
The HiperSockets-based transport path in af_iucv is still too closely
entangled with qeth.
With commit a647a02512 ("s390/qeth: speed-up L3 IQD xmit"), the
relevant xmit code in qeth has begun to use skb_cow_head(). So to avoid
unnecessary skb head expansions, af_iucv must learn to
1) respect dev->needed_headroom when allocating skbs, and
2) drop the header reference before cloning the skb.

While at it, also stop hard-coding the LL-header creation stage and just
use the appropriate helper.

Fixes: a647a02512 ("s390/qeth: speed-up L3 IQD xmit")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 16:26:33 -04:00
Julian Wiedmann
fdbf632691 net/af_iucv: remove GFP_DMA restriction for HiperTransport
af_iucv sockets over z/VM IUCV require that their skbs are allocated
in DMA memory. This restriction doesn't apply to connections over
HiperSockets. So only set this limit for z/VM IUCV sockets, thereby
increasing the likelihood that the large (and linear!) allocations for
HiperTransport messages succeed.

Fixes: 3881ac441f ("af_iucv: add HiperSockets transport")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 16:26:33 -04:00
Laura Garcia Liebana
79ebb5bb4e netfilter: nf_tables: enable set expiration time for set elements
Currently, the expiration of every element in a set or map
is a read-only parameter generated at kernel side.

This change will permit to set a certain expiration date
per element that will be required, for example, during
stateful replication among several nodes.

This patch handles the NFTA_SET_ELEM_EXPIRATION in order
to configure the expiration parameter per element, or
will use the timeout in the case that the expiration
is not set.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19 17:48:36 +02:00
Stéphane Veyret
2f0513d487 netfilter: nft_ct: fix null pointer in ct expectations support
nf_ct_helper_ext_add may return null, which must then be checked.

Fixes: 857b46027d ("netfilter: nft_ct: add ct expectations support")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Stéphane Veyret <sveyret@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19 17:48:35 +02:00
Colin Ian King
72c5e11854 netfilter: synproxy: ensure zero is returned on non-error return path
Currently functions nf_synproxy_{ipc4|ipv6}_init return an uninitialized
garbage value in variable ret on a successful return.  Fix this by
returning zero on success.

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: d7f9b2f18e ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19 17:48:34 +02:00
Eric Dumazet
d5dd88794a inet: fix various use-after-free in defrags units
syzbot reported another issue caused by my recent patches. [1]

The issue here is that fqdir_exit() is initiating a work queue
and immediately returns. A bit later cleanup_net() was able
to free the MIB (percpu data) and the whole struct net was freed,
but we had active frag timers that fired and triggered use-after-free.

We need to make sure that timers can catch fqdir->dead being set,
to bailout.

Since RCU is used for the reader side, this means
we want to respect an RCU grace period between these operations :

1) qfdir->dead = 1;

2) netns dismantle (freeing of various data structure)

This patch uses new new (struct pernet_operations)->pre_exit
infrastructure to ensures a full RCU grace period
happens between fqdir_pre_exit() and fqdir_exit()

This also means we can use a regular work queue, we no
longer need rcu_work.

Tested:

$ time for i in {1..1000}; do unshare -n /bin/false;done

real	0m2.585s
user	0m0.160s
sys	0m2.214s

[1]

BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860

CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
 call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
 expire_timers kernel/time/timer.c:1366 [inline]
 __run_timers kernel/time/timer.c:1685 [inline]
 __run_timers kernel/time/timer.c:1653 [inline]
 run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
 __do_softirq+0x25c/0x94c kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
 </IRQ>
RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
 tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
 tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
 tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
 tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
 security_file_ioctl+0x77/0xc0 security/security.c:1370
 ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4592c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff

Allocated by task 9047:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
 slab_post_alloc_hook mm/slab.h:437 [inline]
 slab_alloc mm/slab.c:3326 [inline]
 kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
 kmem_cache_zalloc include/linux/slab.h:732 [inline]
 net_alloc net/core/net_namespace.c:386 [inline]
 copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
 ksys_unshare+0x440/0x980 kernel/fork.c:2692
 __do_sys_unshare kernel/fork.c:2760 [inline]
 __se_sys_unshare kernel/fork.c:2758 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 2541:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kmem_cache_free+0x86/0x260 mm/slab.c:3698
 net_free net/core/net_namespace.c:402 [inline]
 net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
 net_drop_ns net/core/net_namespace.c:408 [inline]
 cleanup_net+0x538/0x960 net/core/net_namespace.c:571
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at ffff88808b9fe100
 which belongs to the cache net_namespace of size 6784
The buggy address is located 560 bytes inside of
 6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
The buggy address belongs to the page:
page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                     ^
 ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 3c8fc87820 ("inet: frags: rework rhashtable dismantle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:37:47 -04:00
Eric Dumazet
d7d99872c1 netns: add pre_exit method to struct pernet_operations
Current struct pernet_operations exit() handlers are highly
discouraged to call synchronize_rcu().

There are cases where we need them, and exit_batch() does
not help the common case where a single netns is dismantled.

This patch leverages the existing synchronize_rcu() call
in cleanup_net()

Calling optional ->pre_exit() method before ->exit() or
->exit_batch() allows to benefit from a single synchronize_rcu()
call.

Note that the synchronize_rcu() calls added in this patch
are only in error paths or slow paths.

Tested:

$ time for i in {1..1000}; do unshare -n /bin/false;done

real	0m2.612s
user	0m0.171s
sys	0m2.216s

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:37:47 -04:00
Jesper Dangaard Brouer
f71fec47c2 page_pool: make sure struct device is stable
For DMA mapping use-case the page_pool keeps a pointer
to the struct device, which is used in DMA map/unmap calls.

For our in-flight handling, we also need to make sure that
the struct device have not disappeared.  This is assured
via using get_device/put_device API.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
32c28f7e41 page_pool: add tracepoints for page_pool with details need by XDP
The xdp tracepoints for mem id disconnect don't carry information about, why
it was not safe_to_remove.  The tracepoint page_pool:page_pool_inflight in
this patch can be used for extract this info for further debugging.

This patchset also adds tracepoint for the pages_state_* release/hold
transitions, including a pointer to the page.  This can be used for stats
about in-flight pages, or used to debug page leakage via keeping track of
page pointer and combining this with kprobe for __put_page().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
f033b688c1 xdp: add tracepoints for XDP mem
These tracepoints make it easier to troubleshoot XDP mem id disconnect.

The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
placed at the last stable place for the pointer to struct xdp_mem_allocator,
just before it's scheduled for RCU removal. It also extract info on
'safe_to_remove' and 'force'.

Detailed info about in-flight pages is not available at this layer. The next
patch will added tracepoints needed at the page_pool layer for this.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
d956a048cd xdp: force mem allocator removal and periodic warning
If bugs exists or are introduced later e.g. by drivers misusing the API,
then we want to warn about the issue, such that developer notice. This patch
will generate a bit of noise in form of periodic pr_warn every 30 seconds.

It is not nice to have this stall warning running forever. Thus, this patch
will (after 120 attempts) force disconnect the mem id (from the rhashtable)
and free the page_pool object. This will cause fallback to the put_page() as
before, which only potentially leak DMA-mappings, if objects are really
stuck for this long. In that unlikely case, a WARN_ONCE should show us the
call stack.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
99c07c43c4 xdp: tracking page_pool resources and safe removal
This patch is needed before we can allow drivers to use page_pool for
DMA-mappings. Today with page_pool and XDP return API, it is possible to
remove the page_pool object (from rhashtable), while there are still
in-flight packet-pages. This is safely handled via RCU and failed lookups in
__xdp_return() fallback to call put_page(), when page_pool object is gone.
In-case page is still DMA mapped, this will result in page note getting
correctly DMA unmapped.

To solve this, the page_pool is extended with tracking in-flight pages. And
XDP disconnect system queries page_pool and waits, via workqueue, for all
in-flight pages to be returned.

To avoid killing performance when tracking in-flight pages, the implement
use two (unsigned) counters, that in placed on different cache-lines, and
can be used to deduct in-flight packets. This is done by mapping the
unsigned "sequence" counters onto signed Two's complement arithmetic
operations. This is e.g. used by kernel's time_after macros, described in
kernel commit 1ba3aab303 and 5a581b367b, and also explained in RFC1982.

The trick is these two incrementing counters only need to be read and
compared, when checking if it's safe to free the page_pool structure. Which
will only happen when driver have disconnected RX/alloc side. Thus, on a
non-fast-path.

It is chosen that page_pool tracking is also enabled for the non-DMA
use-case, as this can be used for statistics later.

After this patch, using page_pool requires more strict resource "release",
e.g. via page_pool_release_page() that was introduced in this patchset, and
previous patches implement/fix this more strict requirement.

Drivers no-longer call page_pool_destroy(). Drivers already call
xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
attempt to disconnect the mem id, and if attempt fails schedule the
disconnect for later via delayed workqueue.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
e54cfd7e17 page_pool: introduce page_pool_free and use in mlx5
In case driver fails to register the page_pool with XDP return API (via
xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
resources more directly than calling page_pool_destroy(), which does a
unnecessarily RCU free procedure.

This patch is preparing for removing page_pool_destroy(), from driver
invocation.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
6bf071bf09 xdp: page_pool related fix to cpumap
When converting an xdp_frame into an SKB, and sending this into the network
stack, then the underlying XDP memory model need to release associated
resources, because the network stack don't have callbacks for XDP memory
models.  The only memory model that needs this is page_pool, when a driver
use the DMA-mapping feature.

Introduce page_pool_release_page(), which basically does the same as
page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
save the function call overhead for others. Have cpumap call
xdp_release_frame() before xdp_scrub_frame().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
516a7593fd xdp: fix leak of IDA cyclic id if rhashtable_insert_slow fails
Fix error handling case, where inserting ID with rhashtable_insert_slow
fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA
ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus
ida_simple_remove() is never called.

Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with
zero, which is already checked in xdp_rxq_info_unreg_mem_model().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Ilias Apalodimas
a25d50bfe6 net: page_pool: add helper function to unmap dma addresses
On a previous patch dma addr was stored in 'struct page'.
Use that to unmap DMA addresses used by network drivers

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Thomas Gleixner
7f904d7e1f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
Based on 1 normalized pattern(s):

  gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 58 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081207.556988620@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:11:22 +02:00
Thomas Gleixner
21042e4142 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation see readme and copying for
  more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 9 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081207.060259192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:56 +02:00
Thomas Gleixner
d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00
Thomas Gleixner
f361c4dc5e treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 484
Based on 1 normalized pattern(s):

  this source code is licensed under general public license version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.871734026@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:52 +02:00
Thomas Gleixner
7a338472f2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482
Based on 1 normalized pattern(s):

  this work is licensed under the terms of the gnu gpl version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 48 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:52 +02:00
Thomas Gleixner
2504ba9f59 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 235
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license this
  program is distributed in the hope that it will be useful but
  without any warranty without even the implied warranty of
  merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 53 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.904365654@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:07 +02:00
Thomas Gleixner
caab277b1d treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not see http www gnu org
  licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 503 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.811534538@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:07 +02:00
Thomas Gleixner
40b0b3f8fb treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 230
Based on 2 normalized pattern(s):

  this source code is licensed under the gnu general public license
  version 2 see the file copying for more details

  this source code is licensed under general public license version 2
  see

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 52 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:06 +02:00
Jiri Pirko
9558a83aee net: flow_offload: implement support for meta key
Implement support for previously added flow dissector meta key.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 10:09:22 -04:00
Jiri Pirko
8212ed777f net: sched: cls_flower: use flow_dissector for ingress ifindex
Use previously introduced infra to obtain and store ingress ifindex
instead doing it locally.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 10:09:22 -04:00
Jiri Pirko
82828b88f0 flow_dissector: add support for ingress ifindex dissection
Add new key meta that contains ingress ifindex value and add a function
to dissect this from skb. The key and function is prepared to cover
other potential skb metadata values dissection.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 10:09:21 -04:00
Doug Ledford
12dbc04db0 Merge remote-tracking branch 'mlx5-next/mlx5-next' into HEAD
Take mlx5-next so we can take a dependent two patch series next.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-18 22:44:36 -04:00
David S. Miller
d470e720ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Module autoload for masquerade and redirection does not work.

2) Leak in unqueued packets in nf_ct_frag6_queue(). Ignore duplicated
   fragments, pretend they are placed into the queue. Patches from
   Guillaume Nault.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 21:43:40 -04:00
Sunil Muthuswamy
cb359b6041 hvsock: fix epollout hang from race condition
Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will
not return even when the hvsock socket is writable, under some race
condition. This can happen under the following sequence:
- fd = socket(hvsocket)
- fd_out = dup(fd)
- fd_in = dup(fd)
- start a writer thread that writes data to fd_out with a combination of
  epoll_wait(fd_out, EPOLLOUT) and
- start a reader thread that reads data from fd_in with a combination of
  epoll_wait(fd_in, EPOLLIN)
- On the host, there are two threads that are reading/writing data to the
  hvsocket

stack:
hvs_stream_has_space
hvs_notify_poll_out
vsock_poll
sock_poll
ep_poll

Race condition:
check for epollout from ep_poll():
	assume no writable space in the socket
	hvs_stream_has_space() returns 0
check for epollin from ep_poll():
	assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE)
	hvs_stream_has_space() will clear the channel pending send size
	host will not notify the guest because the pending send size has
		been cleared and so the hvsocket will never mark the
		socket writable

Now, the EPOLLOUT will never return even if the socket write buffer is
empty.

The fix is to set the pending size to the default size and never change it.
This way the host will always notify the guest whenever the writable space
is bigger than the pending size. The host is already optimized to *only*
notify the guest when the pending size threshold boundary is crossed and
not everytime.

This change also reduces the cpu usage somewhat since hv_stream_has_space()
is in the hotpath of send:
vsock_stream_sendmsg()->hv_stream_has_space()
Earlier hv_stream_has_space was setting/clearing the pending size on every
call.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 21:41:12 -04:00
Fred Klassen
76e21533a4 net/udp_gso: Allow TX timestamp with UDP GSO
Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
This can be illustrated with an updated updgso_bench_tx program which
includes the '-T' option to test for this condition. It also introduces
the '-P' option which will call poll() before reading the error queue.

    ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
    poll timeout
    udp tx:      0 MB/s        1 calls/s      1 msg/s

The "poll timeout" message above indicates that TX timestamp never
arrived.

This patch preserves tx_flags for the first UDP GSO segment. Only the
first segment is timestamped, even though in some cases there may be
benefital in timestamping both the first and last segment.

Factors in deciding on first segment timestamp only:

- Timestamping both first and last segmented is not feasible. Hardware
can only have one outstanding TS request at a time.

- Timestamping last segment may under report network latency of the
previous segments. Even though the doorbell is suppressed, the ring
producer counter has been incremented.

- Timestamping the first segment has the upside in that it reports
timestamps from the application's view, e.g. RTT.

- Timestamping the first segment has the downside that it may
underreport tx host network latency. It appears that we have to pick
one or the other. And possibly follow-up with a config flag to choose
behavior.

v2: Remove tests as noted by Willem de Bruijn <willemb@google.com>
    Moving tests from net to net-next

v3: Update only relevant tx_flag bits as per
    Willem de Bruijn <willemb@google.com>

v4: Update comments and commit message as per
    Willem de Bruijn <willemb@google.com>

Fixes: ee80d1ebe5 ("udp: add udp gso")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 21:38:07 -04:00
Jakub Kicinski
3e14c383de net: netem: fix use after free and double free with packet corruption
Brendan reports that the use of netem's packet corruption capability
leads to strange crashes.  This seems to be caused by
commit d66280b12b ("net: netem: use a list in addition to rbtree")
which uses skb->next pointer to construct a fast-path queue of
in-order skbs.

Packet corruption code has to invoke skb_gso_segment() in case
of skbs in need of GSO.  skb_gso_segment() returns a list of
skbs.  If next pointers of the skbs on that list do not get cleared
fast path list may point to freed skbs or skbs which are also on
the RB tree.

Let's say skb gets segmented into 3 frames:

A -> B -> C

A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's
next pointer didn't get cleared so we have:

h t
|/
A -> B -> C

Now if B and C get also get enqueued successfully all is fine, because
tfifo_enqueue() will overwrite the list in order.  IOW:

Enqueue B:

h    t
|    |
A -> B    C

Enqueue C:

h         t
|         |
A -> B -> C

But if B and C get reordered we may end up with:

h t            RB tree
|/                |
A -> B -> C       B
                   \
                    C

Or if they get dropped just:

h t
|/
A -> B -> C

where A and B are already freed.

To reproduce either limit has to be set low to cause freeing of
segs or reorders have to happen (due to delay jitter).

Note that we only have to mark the first segment as not on the
list, "finish_segs" handling of other frags already does that.

Another caveat is that qdisc_drop_all() still has to free all
segments correctly in case of drop of first segment, therefore
we re-link segs before calling it.

v2:
 - re-link before drop, v1 was leaking non-first segs if limit
   was hit at the first seg
 - better commit message which lead to discovering the above :)

Reported-by: Brendan Galloway <brendan.galloway@netronome.com>
Fixes: d66280b12b ("net: netem: use a list in addition to rbtree")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 21:30:38 -04:00
Jakub Kicinski
177b800746 net: netem: fix backlog accounting for corrupted GSO frames
When GSO frame has to be corrupted netem uses skb_gso_segment()
to produce the list of frames, and re-enqueues the segments one
by one.  The backlog length has to be adjusted to account for
new frames.

The current calculation is incorrect, leading to wrong backlog
lengths in the parent qdisc (both bytes and packets), and
incorrect packet backlog count in netem itself.

Parent backlog goes negative, netem's packet backlog counts
all non-first segments twice (thus remaining non-zero even
after qdisc is emptied).

Move the variables used to count the adjustment into local
scope to make 100% sure they aren't used at any stage in
backports.

Fixes: 6071bd1aa1 ("netem: Segment GSO packets on enqueue")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 21:30:38 -04:00
Xin Long
c3bcde0266 tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb
udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device
to count packets on dev->tstats, a perpcu variable. However, TIPC is using
udp tunnel with no tunnel device, and pass the lower dev, like veth device
that only initializes dev->lstats(a perpcu variable) when creating it.

Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as
a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each
pointer points to a bigger struct than lstats, so when tstats->tx_bytes is
increased, other percpu variable's members could be overwritten.

syzbot has reported quite a few crashes due to fib_nh_common percpu member
'nhc_pcpu_rth_output' overwritten, call traces are like:

  BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190
  net/ipv4/route.c:1556
    rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556
    __mkroute_output net/ipv4/route.c:2332 [inline]
    ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564
    ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393
    __ip_route_output_key include/net/route.h:125 [inline]
    ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651
    ip_route_output_key include/net/route.h:135 [inline]
  ...

or:

  kasan: GPF could be caused by NULL-ptr deref or user memory access
  RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168
    <IRQ>
    rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline]
    free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217
    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
    rcu_do_batch kernel/rcu/tree.c:2437 [inline]
    invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline]
    rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697
  ...

The issue exists since tunnel stats update is moved to iptunnel_xmit by
Commit 039f50629b ("ip_tunnel: Move stats update to iptunnel_xmit()"),
and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb
so that the packets counting won't happen on dev->tstats.

Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com
Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com
Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com
Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com
Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com
Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com
Fixes: 039f50629b ("ip_tunnel: Move stats update to iptunnel_xmit()")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 20:48:45 -04:00
Xin Long
5684abf702 ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL
iptunnel_xmit() works as a common function, also used by a udp tunnel
which doesn't have to have a tunnel device, like how TIPC works with
udp media.

In these cases, we should allow not to count pkts on dev's tstats, so
that udp tunnel can work with no tunnel device safely.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 20:48:45 -04:00
Denis Kirjanov
75345f888f ipoib: show VF broadcast address
in IPoIB case we can't see a VF broadcast address for but
can see for PF

Before:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
    link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    vf 0 MAC 14:80:00:00:66:fe, spoof checking off, link-state disable,
trust off, query_rss off
...

After:
11: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 256
    link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    vf 0     link/infiniband
80:00:00:66:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a4:3e:7c brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff, spoof
checking off, link-state disable, trust off, query_rss off

v1->v2: add the IFLA_VF_BROADCAST constant
v2->v3: put IFLA_VF_BROADCAST at the end
to avoid KABI breakage and set NLA_REJECT
dev_setlink

Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 10:41:28 -07:00
JingYi Hou
d0bae4a0e3 net: remove duplicate fetch in sock_getsockopt
In sock_getsockopt(), 'optlen' is fetched the first time from userspace.
'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is
fetched the second time from userspace.

If change it between two fetches may cause security problems or unexpected
behaivor, and there is no reason to fetch it a second time.

To fix this, we need to remove the second fetch.

Signed-off-by: JingYi Hou <houjingyi647@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 10:04:16 -07:00
Tuong Lien
d0f84d0856 tipc: fix issues with early FAILOVER_MSG from peer
It appears that a FAILOVER_MSG can come from peer even when the failure
link is resetting (i.e. just after the 'node_write_unlock()'...). This
means the failover procedure on the node has not been started yet.
The situation is as follows:

         node1                                node2
  linkb          linka                  linka        linkb
    |              |                      |            |
    |              |                      x failure    |
    |              |                  RESETTING        |
    |              |                      |            |
    |              x failure            RESET          |
    |          RESETTING             FAILINGOVER       |
    |              |   (FAILOVER_MSG)     |            |
    |<-------------------------------------------------|
    | *FAILINGOVER |                      |            |
    |              | (dummy FAILOVER_MSG) |            |
    |------------------------------------------------->|
    |            RESET                    |            | FAILOVER_END
    |         FAILINGOVER               RESET          |
    .              .                      .            .
    .              .                      .            .
    .              .                      .            .

Once this happens, the link failover procedure will be triggered
wrongly on the receiving node since the node isn't in FAILINGOVER state
but then another link failover will be carried out.
The consequences are:

1) A peer might get stuck in FAILINGOVER state because the 'sync_point'
was set, reset and set incorrectly, the criteria to end the failover
would not be met, it could keep waiting for a message that has already
received.

2) The early FAILOVER_MSG(s) could be queued in the link failover
deferdq but would be purged or not pulled out because the 'drop_point'
was not set correctly.

3) The early FAILOVER_MSG(s) could be dropped too.

4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state
shortly, but later on it would be restarted.

The same situation can also happen when the link is in PEER_RESET state
and a FAILOVER_MSG arrives.

The commit resolves the issues by forcing the link down immediately, so
the failover procedure will be started normally (which is the same as
when receiving a FAILOVER_MSG and the link is in up state).

Also, the function "tipc_node_link_failover()" is toughen to avoid such
a situation from happening.

Acked-by: Jon Maloy <jon.maloy@ericsson.se>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 10:03:37 -07:00
Ido Schimmel
d5382fef70 ipv6: Stop sending in-kernel notifications for each nexthop
Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now
ready to handle IPv6 multipath notifications.

Therefore, stop ignoring such notifications in both drivers and stop
sending notification for each added / deleted nexthop.

v2:
* Remove 'multipath_rt' from 'struct fib6_entry_notifier_info'

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:37 -07:00
Ido Schimmel
2881fd61b6 ipv6: Add IPv6 multipath notification for route delete
If all the nexthops of a multipath route are being deleted, send one
notification for the entire route, instead of one per-nexthop.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:36 -07:00
Ido Schimmel
ebee3cad83 ipv6: Add IPv6 multipath notifications for add / replace
Emit a notification when a multipath routes is added or replace.

Note that unlike the replace notifications sent from fib6_add_rt2node(),
it is possible we are sending a 'FIB_EVENT_ENTRY_REPLACE' when a route
was merely added and not replaced.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:36 -07:00
Ido Schimmel
d4b96c7b51 ipv6: Extend notifier info for multipath routes
Extend the IPv6 FIB notifier info with number of sibling routes being
notified.

This will later allow listeners to process one notification for a
multipath routes instead of N, where N is the number of nexthops.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:36 -07:00
David S. Miller
13091aa305 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 20:20:36 -07:00
Florian Westphal
40008e9211 net: ipv4: remove erroneous advancement of list pointer
Causes crash when lifetime expires on an adress as garbage is
dereferenced soon after.

This used to look like this:

 for (ifap = &ifa->ifa_dev->ifa_list;
      *ifap != NULL; ifap = &(*ifap)->ifa_next) {
          if (*ifap == ifa) ...

but this was changed to:

struct in_ifaddr *tmp;

ifap = &ifa->ifa_dev->ifa_list;
tmp = rtnl_dereference(*ifap);
while (tmp) {
   tmp = rtnl_dereference(tmp->ifa_next); // Bogus
   if (rtnl_dereference(*ifap) == ifa) {
     ...
   ifap = &tmp->ifa_next;		// Can be NULL
   tmp = rtnl_dereference(*ifap);	// Dereference
   }
}

Remove the bogus assigment/list entry skip.

Fixes: 2638eb8b50 ("net: ipv4: provide __rcu annotation for ifa_list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 16:27:42 -07:00
Linus Torvalds
da0f382029 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of bug fixes here:

   1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

   2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
      Crispin.

   3) Use after free in psock backlog workqueue, from John Fastabend.

   4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
      Salem.

   5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

   6) Network header needs to be set for packet redirect in nfp, from
      John Hurley.

   7) Fix udp zerocopy refcnt, from Willem de Bruijn.

   8) Don't assume linear buffers in vxlan and geneve error handlers,
      from Stefano Brivio.

   9) Fix TOS matching in mlxsw, from Jiri Pirko.

  10) More SCTP cookie memory leak fixes, from Neil Horman.

  11) Fix VLAN filtering in rtl8366, from Linus Walluij.

  12) Various TCP SACK payload size and fragmentation memory limit fixes
      from Eric Dumazet.

  13) Use after free in pneigh_get_next(), also from Eric Dumazet.

  14) LAPB control block leak fix from Jeremy Sowden"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
  lapb: fixed leak of control-blocks.
  tipc: purge deferredq list for each grp member in tipc_group_delete
  ax25: fix inconsistent lock state in ax25_destroy_timer
  neigh: fix use-after-free read in pneigh_get_next
  tcp: fix compile error if !CONFIG_SYSCTL
  hv_sock: Suppress bogus "may be used uninitialized" warnings
  be2net: Fix number of Rx queues used for flow hashing
  net: handle 802.1P vlan 0 packets properly
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  Revert "net: phylink: set the autoneg state in phylink_phy_change"
  bpf: fix nested bpf tracepoints with per-cpu data
  bpf: Fix out of bounds memory access in bpf_sk_storage
  vsock/virtio: set SOCK_DONE on peer shutdown
  net: dsa: rtl8366: Fix up VLAN filtering
  net: phylink: set the autoneg state in phylink_phy_change
  net: add high_order_alloc_disable sysctl/static key
  tcp: add tcp_tx_skb_cache sysctl
  ...
2019-06-17 15:55:34 -07:00
Anton Protopopov
56f0f84e69 bpf: fix the check that forwarding is enabled in bpf_ipv6_fib_lookup
The bpf_ipv6_fib_lookup function should return BPF_FIB_LKUP_RET_FWD_DISABLED
when forwarding is disabled for the input device.  However instead of checking
if forwarding is enabled on the input device, it checked the global
net->ipv6.devconf_all->forwarding flag.  Change it to behave as expected.

Fixes: 87f5fc7e48 ("bpf: Provide helper to do forwarding lookups in kernel FIB table")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-17 23:18:50 +02:00
Jiri Pirko
f517f2716c net: sched: cls_matchall: allow to delete filter
Currently user is unable to delete the filter. See following example:
$ tc filter add dev ens16np1 ingress pref 1 handle 1 matchall action drop
$ tc filter show dev ens16np1 ingress
filter protocol all pref 1 matchall chain 0
filter protocol all pref 1 matchall chain 0 handle 0x1
  in_hw
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1

$ tc filter del dev ens16np1 ingress pref 1 handle 1 matchall action drop
RTNETLINK answers: Operation not supported

Implement tcf_proto_ops->delete() op and allow user to delete the filter.

Reported-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 14:05:32 -07:00
Kevin Darbyshire-Bryant
c197d63627 net: sched: act_ctinfo: fix policy validation
Fix nla_policy definition by specifying an exact length type attribute
to CTINFO action paraneter block structure.  Without this change,
netlink parsing will fail validation and the action will not be
instantiated.

8cb081746c ("netlink: make validation more configurable for future")
introduced much stricter checking to attributes being passed via
netlink.  Existing actions were updated to use less restrictive
deprecated versions of nla_parse_nested.

As a new module, act_ctinfo should be designed to use the strict
checking model otherwise, well, what was the point of implementing it.

Confession time: Until very recently, development of this module has
been done on 'net-next' tree to 'clean compile' level with run-time
testing on backports to 4.14 & 4.19 kernels under openwrt.  This is how
I managed to miss the run-time impacts of the new strict
nla_parse_nested function.  I hopefully have learned something from this
(glances toward laptop running a net-next kernel)

There is however a still outstanding implication on iproute2 user space
in that it needs to be told to pass nested netlink messages with the
nested attribute actually set.  So even with this kernel fix to do
things correctly you still cannot instantiate a new 'strict'
nla_parse_nested based action such as act_ctinfo with iproute2's tc.

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 14:00:30 -07:00
Kevin Darbyshire-Bryant
a658c2e49f net: sched: act_ctinfo: fix action creation
Use correct return value on action creation: ACT_P_CREATED.

The use of incorrect return value could result in a situation where the
system thought a ctinfo module was listening but actually wasn't
instantiated correctly leading to an OOPS in tcf_generic_walker().

Confession time: Until very recently, development of this module has
been done on 'net-next' tree to 'clean compile' level with run-time
testing on backports to 4.14 & 4.19 kernels under openwrt.  During the
back & forward porting during development & testing, the critical
ACT_P_CREATED return code got missed despite being in the 4.14 & 4.19
backports.  I have now gone through the init functions, using act_csum
as reference with a fine toothed comb.  Bonus, no more OOPSes.  I
managed to also miss this issue till now due to the new strict
nla_parse_nested function failing validation before action creation.

As an inexperienced developer I've learned that
copy/pasting/backporting/forward porting code correctly is hard.  If I
ever get to a developer conference I shall don the cone of shame.

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 14:00:30 -07:00
Ard Biesheuvel
c681edae33 net: ipv4: move tcp_fastopen server side code to SipHash library
Using a bare block cipher in non-crypto code is almost always a bad idea,
not only for security reasons (and we've seen some examples of this in
the kernel in the past), but also for performance reasons.

In the TCP fastopen case, we call into the bare AES block cipher one or
two times (depending on whether the connection is IPv4 or IPv6). On most
systems, this results in a call chain such as

  crypto_cipher_encrypt_one(ctx, dst, src)
    crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
      aesni_encrypt
        kernel_fpu_begin();
        aesni_enc(ctx, dst, src); // asm routine
        kernel_fpu_end();

It is highly unlikely that the use of special AES instructions has a
benefit in this case, especially since we are doing the above twice
for IPv6 connections, instead of using a transform which can process
the entire input in one go.

We could switch to the cbcmac(aes) shash, which would at least get
rid of the duplicated overhead in *some* cases (i.e., today, only
arm64 has an accelerated implementation of cbcmac(aes), while x86 will
end up using the generic cbcmac template wrapping the AES-NI cipher,
which basically ends up doing exactly the above). However, in the given
context, it makes more sense to use a light-weight MAC algorithm that
is more suitable for the purpose at hand, such as SipHash.

Since the output size of SipHash already matches our chosen value for
TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
sizes, this greatly simplifies the code as well.

NOTE: Server farms backing a single server IP for load balancing purposes
      and sharing a single fastopen key will be adversely affected by
      this change unless all systems in the pool receive their kernel
      upgrades at the same time.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 13:56:26 -07:00
Tuong Lien
6a6b5c8bff tipc: include retrans failure detection for unicast
In patch series, commit 9195948fbf ("tipc: improve TIPC throughput by
Gap ACK blocks"), as for simplicity, the repeated retransmit failures'
detection in the function - "tipc_link_retrans()" was kept there for
broadcast retransmissions only.

This commit now reapplies this feature for link unicast retransmissions
that has been done via the function - "tipc_link_advance_transmq()".

Also, the "tipc_link_retrans()" is renamed to "tipc_link_bc_retrans()"
as it is used only for broadcast.

Acked-by: Jon Maloy <jon.maloy@ericsson.se>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 13:27:32 -07:00
David S. Miller
4fddbf8a99 Merge branch 'tcp-fixes'
Eric Dumazet says:

====================
tcp: make sack processing more robust

Jonathan Looney brought to our attention multiple problems
in TCP stack at the sender side.

SACK processing can be abused by malicious peers to either
cause overflows, or increase of memory usage.

First two patches fix the immediate problems.

Since the malicious peers abuse senders by advertizing a very
small MSS in their SYN or SYNACK packet, the last two
patches add a new sysctl so that admins can chose a higher
limit for MSS clamping.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 10:39:56 -07:00
Fernando Fernandez Mancera
d7f9b2f18e netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY
Add common functions into nf_synproxy_core.c to prepare for nftables support.
The prototypes of the functions used by {ipt, ip6t}_SYNPROXY are in the new
file nf_synproxy.h

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 17:12:55 +02:00
Fernando Fernandez Mancera
3006a5224f netfilter: synproxy: remove module dependency on IPv6 SYNPROXY
This is a prerequisite for the infrastructure module NETFILTER_SYNPROXY.
The new module is needed to avoid duplicated code for the SYNPROXY
nftables support.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 17:12:09 +02:00
Pablo Neira Ayuso
667ec21ebf Merge branch 'master' of git://blackhole.kfki.hu/nf-next
Jozsef Kadlecsik says:

====================
ipset patches for nf-next

- Remove useless memset() calls, nla_parse_nested/nla_parse
  erase the tb array properly, from Florent Fourcot.
- Merge the uadd and udel functions, the code is nicer
  this way, also from Florent Fourcot.
- Add a missing check for the return value of a
  nla_parse[_deprecated] call, from Aditya Pakki.
- Add the last missing check for the return value
  of nla_parse[_deprecated] call.
- Fix error path and release the references properly
  in set_target_v3_checkentry().
- Fix memory accounting which is reported to userspace
  for hash types on resize, from Stefano Brivio.
- Update my email address to kadlec@netfilter.org.
  The patch covers all places in the source tree where
  my kadlec@blackhole.kfki.hu address could be found.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:37:24 +02:00
Christian Brauner
22567590b2 netfilter: bridge: namespace bridge netfilter sysctls
Currently, the /proc/sys/net/bridge folder is only created in the initial
network namespace. This patch ensures that the /proc/sys/net/bridge folder
is available in each network namespace if the module is loaded and
disappears from all network namespaces when the module is unloaded.

In doing so the patch makes the sysctls:

bridge-nf-call-arptables
bridge-nf-call-ip6tables
bridge-nf-call-iptables
bridge-nf-filter-pppoe-tagged
bridge-nf-filter-vlan-tagged
bridge-nf-pass-vlan-input-dev

apply per network namespace. This unblocks some use-cases where users would
like to e.g. not do bridge filtering for bridges in a specific network
namespace while doing so for bridges located in another network namespace.

The netfilter rules are afaict already per network namespace so it should
be safe for users to specify whether bridge devices inside a network
namespace are supposed to go through iptables et al. or not. Also, this can
already be done per-bridge by setting an option for each individual bridge
via Netlink. It should also be possible to do this for all bridges in a
network namespace via sysctls.

Cc: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:36:31 +02:00
Christian Brauner
ff6d090d0d netfilter: bridge: port sysctls to use brnf_net
This ports the sysctls to use struct brnf_net.

With this patch we make it possible to namespace the br_netfilter module in
the following patch.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:36:30 +02:00
Pablo Neira Ayuso
9911c1139f netfilter: xt_owner: bail out with EINVAL in case of unsupported flags
Reject flags that are not supported with EINVAL.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:36:30 +02:00
Florian Westphal
87e389b4c2 netfilter: conntrack: small conntrack lookup optimization
____nf_conntrack_find() performs checks on the conntrack objects in
this order:

1. if (nf_ct_is_expired(ct))

This fetches ct->timeout, in third cache line.

The hnnode that is used to store the list pointers resides in the first
(origin) or second (reply tuple) cache lines.

This test rarely passes, but its necessary to reap obsolete entries.

2. if (nf_ct_is_dying(ct))

This fetches ct->status, also in third cache line.

The test is useless, and can be removed:
  Consider:
     cpu0                                           cpu1
    ct = ____nf_conntrack_find()
    atomic_inc_not_zero(ct) -> ok
    nf_ct_key_equal -> ok
    is_dying -> DYING bit not set, ok
                                                    set_bit(ct, DYING);
						    ... unhash ... etc.
    return ct
    -> returning a ct with dying bit set, despite
    having a test for it.

This (unlikely) case is fine - refcount prevents ct from getting free'd.

3. if (nf_ct_key_equal(h, tuple, zone, net))

nf_ct_key_equal checks in following order:

1. Tuple equal (first or second cacheline)
2. Zone equal (third cacheline)
3. confirmed bit set (->status, third cacheline)
4. net namespace match (third cacheline).

Swapping "timeout" and "cpu" places timeout in the first cacheline.
This has two advantages:

1. For a conntrack that won't even match the original tuple,
   we will now only fetch the first and maybe the second cacheline
   instead of always accessing the 3rd one as well.

2.  in case of TCP ct->timeout changes frequently because we
    reduce/increase it when there are packets outstanding in the network.

The first cacheline contains both the reference count and the ct spinlock,
i.e. moving timeout there avoids writes to 3rd cacheline.

The restart sequence in __nf_conntrack_find() is removed, if we found a
candidate, but then fail to increment the refcount or discover the tuple
has changed (object recycling), just pretend we did not find an entry.

A second lookup won't find anything until another CPU adds a new conntrack
with identical tuple into the hash table, which is very unlikely.

We have the confirmation-time checks (when we hold hash lock) that deal
with identical entries and even perform clash resolution in some cases.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:35:30 +02:00
Stéphane Veyret
857b46027d netfilter: nft_ct: add ct expectations support
This patch allows to add, list and delete expectations via nft objref
infrastructure and assigning these expectations via nft rule.

This allows manual port triggering when no helper is defined to manage a
specific protocol. For example, if I have an online game which protocol
is based on initial connection to TCP port 9753 of the server, and where
the server opens a connection to port 9876, I can set rules as follow:

table ip filter {
    ct expectation mygame {
        protocol udp;
        dport 9876;
        timeout 2m;
        size 1;
    }

    chain input {
        type filter hook input priority 0; policy drop;
        tcp dport 9753 ct expectation set "mygame";
    }

    chain output {
        type filter hook output priority 0; policy drop;
        udp dport 9876 ct status expected accept;
    }
}

Signed-off-by: Stéphane Veyret <sveyret@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:35:20 +02:00
Nicolas Dichtel
b8d6d00797 xfrm: fix sa selector validation
After commit b38ff4075a, the following command does not work anymore:
$ ip xfrm state add src 10.125.0.2 dst 10.125.0.1 proto esp spi 34 reqid 1 \
  mode tunnel enc 'cbc(aes)' 0xb0abdba8b782ad9d364ec81e3a7d82a1 auth-trunc \
  'hmac(sha1)' 0xe26609ebd00acb6a4d51fca13e49ea78a72c73e6 96 flag align4

In fact, the selector is not mandatory, allow the user to provide an empty
selector.

Fixes: b38ff4075a ("xfrm: Fix xfrm sel prefix length validation")
CC: Anirudh Gupta <anirudh.gupta@sophos.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-17 11:22:35 +02:00
Jeremy Sowden
6be8e297f9 lapb: fixed leak of control-blocks.
lapb_register calls lapb_create_cb, which initializes the control-
block's ref-count to one, and __lapb_insert_cb, which increments it when
adding the new block to the list of blocks.

lapb_unregister calls __lapb_remove_cb, which decrements the ref-count
when removing control-block from the list of blocks, and calls lapb_put
itself to decrement the ref-count before returning.

However, lapb_unregister also calls __lapb_devtostruct to look up the
right control-block for the given net_device, and __lapb_devtostruct
also bumps the ref-count, which means that when lapb_unregister returns
the ref-count is still 1 and the control-block is leaked.

Call lapb_put after __lapb_devtostruct to fix leak.

Reported-by: syzbot+afb980676c836b4a0afa@syzkaller.appspotmail.com
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 20:44:20 -07:00
Xin Long
5cf02612b3 tipc: purge deferredq list for each grp member in tipc_group_delete
Syzbot reported a memleak caused by grp members' deferredq list not
purged when the grp is be deleted.

The issue occurs when more(msg_grp_bc_seqno(hdr), m->bc_rcv_nxt) in
tipc_group_filter_msg() and the skb will stay in deferredq.

So fix it by calling __skb_queue_purge for each member's deferredq
in tipc_group_delete() when a tipc sk leaves the grp.

Fixes: b87a5ea31c ("tipc: guarantee group unicast doesn't bypass group broadcast")
Reported-by: syzbot+78fbe679c8ca8d264a8d@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 20:42:05 -07:00
Jeremy Sowden
4201c9260a lapb: moved export of lapb_register.
The EXPORT_SYMBOL for lapb_register was next to a different function.
Moved it to the right place.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:25:01 -07:00
Eric Dumazet
d4d5d8e83c ax25: fix inconsistent lock state in ax25_destroy_timer
Before thread in process context uses bh_lock_sock()
we must disable bh.

sysbot reported :

WARNING: inconsistent lock state
5.2.0-rc3+ #32 Not tainted

inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
blkid/26581 [HC0[0]:SC1[1]:HE1:SE0] takes:
00000000e0da85ee (slock-AF_AX25){+.?.}, at: spin_lock include/linux/spinlock.h:338 [inline]
00000000e0da85ee (slock-AF_AX25){+.?.}, at: ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275
{SOFTIRQ-ON-W} state was registered at:
  lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303
  __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
  _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
  spin_lock include/linux/spinlock.h:338 [inline]
  ax25_rt_autobind+0x3ca/0x720 net/ax25/ax25_route.c:429
  ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1221
  __sys_connect+0x264/0x330 net/socket.c:1834
  __do_sys_connect net/socket.c:1845 [inline]
  __se_sys_connect net/socket.c:1842 [inline]
  __x64_sys_connect+0x73/0xb0 net/socket.c:1842
  do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
irq event stamp: 2272
hardirqs last  enabled at (2272): [<ffffffff810065f3>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (2271): [<ffffffff8100660f>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (1522): [<ffffffff87400654>] __do_softirq+0x654/0x94c kernel/softirq.c:320
softirqs last disabled at (2267): [<ffffffff81449010>] invoke_softirq kernel/softirq.c:374 [inline]
softirqs last disabled at (2267): [<ffffffff81449010>] irq_exit+0x180/0x1d0 kernel/softirq.c:414

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(slock-AF_AX25);
  <Interrupt>
    lock(slock-AF_AX25);

 *** DEADLOCK ***

1 lock held by blkid/26581:
 #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:175 [inline]
 #0: 0000000010fd154d ((&ax25->dtimer)){+.-.}, at: call_timer_fn+0xe0/0x720 kernel/time/timer.c:1312

stack backtrace:
CPU: 1 PID: 26581 Comm: blkid Not tainted 5.2.0-rc3+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_usage_bug.cold+0x393/0x4a2 kernel/locking/lockdep.c:2935
 valid_state kernel/locking/lockdep.c:2948 [inline]
 mark_lock_irq kernel/locking/lockdep.c:3138 [inline]
 mark_lock+0xd46/0x1370 kernel/locking/lockdep.c:3513
 mark_irqflags kernel/locking/lockdep.c:3391 [inline]
 __lock_acquire+0x159f/0x5490 kernel/locking/lockdep.c:3745
 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4303
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:338 [inline]
 ax25_destroy_timer+0x53/0xc0 net/ax25/af_ax25.c:275
 call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
 expire_timers kernel/time/timer.c:1366 [inline]
 __run_timers kernel/time/timer.c:1685 [inline]
 __run_timers kernel/time/timer.c:1653 [inline]
 run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
 __do_softirq+0x25c/0x94c kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
 </IRQ>
RIP: 0033:0x7f858d5c3232
Code: 8b 61 08 48 8b 84 24 d8 00 00 00 4c 89 44 24 28 48 8b ac 24 d0 00 00 00 4c 8b b4 24 e8 00 00 00 48 89 7c 24 68 48 89 4c 24 78 <48> 89 44 24 58 8b 84 24 e0 00 00 00 89 84 24 84 00 00 00 8b 84 24
RSP: 002b:00007ffcaf0cf5c0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: 00007f858d7d27a8 RBX: 00007f858d7d8820 RCX: 00007f858d3940d8
RDX: 00007ffcaf0cf798 RSI: 00000000f5e616f3 RDI: 00007f858d394fee
RBP: 0000000000000000 R08: 00007ffcaf0cf780 R09: 00007f858d7db480
R10: 0000000000000000 R11: 0000000009691a75 R12: 0000000000000005
R13: 00000000f5e616f3 R14: 0000000000000000 R15: 00007ffcaf0cf798

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:22:37 -07:00
Eric Dumazet
f3e92cb8e2 neigh: fix use-after-free read in pneigh_get_next
Nine years ago, I added RCU handling to neighbours, not pneighbours.
(pneigh are not commonly used)

Unfortunately I missed that /proc dump operations would use a
common entry and exit point : neigh_seq_start() and neigh_seq_stop()

We need to read_lock(tbl->lock) or risk use-after-free while
iterating the pneigh structures.

We might later convert pneigh to RCU and revert this patch.

sysbot reported :

BUG: KASAN: use-after-free in pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158
Read of size 8 at addr ffff888097f2a700 by task syz-executor.0/9825

CPU: 1 PID: 9825 Comm: syz-executor.0 Not tainted 5.2.0-rc4+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158
 neigh_seq_next+0xdb/0x210 net/core/neighbour.c:3240
 seq_read+0x9cf/0x1110 fs/seq_file.c:258
 proc_reg_read+0x1fc/0x2c0 fs/proc/inode.c:221
 do_loop_readv_writev fs/read_write.c:714 [inline]
 do_loop_readv_writev fs/read_write.c:701 [inline]
 do_iter_read+0x4a4/0x660 fs/read_write.c:935
 vfs_readv+0xf0/0x160 fs/read_write.c:997
 kernel_readv fs/splice.c:359 [inline]
 default_file_splice_read+0x475/0x890 fs/splice.c:414
 do_splice_to+0x127/0x180 fs/splice.c:877
 splice_direct_to_actor+0x2d2/0x970 fs/splice.c:954
 do_splice_direct+0x1da/0x2a0 fs/splice.c:1063
 do_sendfile+0x597/0xd00 fs/read_write.c:1464
 __do_sys_sendfile64 fs/read_write.c:1525 [inline]
 __se_sys_sendfile64 fs/read_write.c:1511 [inline]
 __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4592c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f4aab51dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000004592c9
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000080000000 R11: 0000000000000246 R12: 00007f4aab51e6d4
R13: 00000000004c689d R14: 00000000004db828 R15: 00000000ffffffff

Allocated by task 9827:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 __do_kmalloc mm/slab.c:3660 [inline]
 __kmalloc+0x15c/0x740 mm/slab.c:3669
 kmalloc include/linux/slab.h:552 [inline]
 pneigh_lookup+0x19c/0x4a0 net/core/neighbour.c:731
 arp_req_set_public net/ipv4/arp.c:1010 [inline]
 arp_req_set+0x613/0x720 net/ipv4/arp.c:1026
 arp_ioctl+0x652/0x7f0 net/ipv4/arp.c:1226
 inet_ioctl+0x2a0/0x340 net/ipv4/af_inet.c:926
 sock_do_ioctl+0xd8/0x2f0 net/socket.c:1043
 sock_ioctl+0x3ed/0x780 net/socket.c:1194
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:509 [inline]
 do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 9824:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kfree+0xcf/0x220 mm/slab.c:3755
 pneigh_ifdown_and_unlock net/core/neighbour.c:812 [inline]
 __neigh_ifdown+0x236/0x2f0 net/core/neighbour.c:356
 neigh_ifdown+0x20/0x30 net/core/neighbour.c:372
 arp_ifdown+0x1d/0x21 net/ipv4/arp.c:1274
 inetdev_destroy net/ipv4/devinet.c:319 [inline]
 inetdev_event+0xa14/0x11f0 net/ipv4/devinet.c:1544
 notifier_call_chain+0xc2/0x230 kernel/notifier.c:95
 __raw_notifier_call_chain kernel/notifier.c:396 [inline]
 raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:403
 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1749
 call_netdevice_notifiers_extack net/core/dev.c:1761 [inline]
 call_netdevice_notifiers net/core/dev.c:1775 [inline]
 rollback_registered_many+0x9b9/0xfc0 net/core/dev.c:8178
 rollback_registered+0x109/0x1d0 net/core/dev.c:8220
 unregister_netdevice_queue net/core/dev.c:9267 [inline]
 unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9260
 unregister_netdevice include/linux/netdevice.h:2631 [inline]
 __tun_detach+0xd8a/0x1040 drivers/net/tun.c:724
 tun_detach drivers/net/tun.c:741 [inline]
 tun_chr_close+0xe0/0x180 drivers/net/tun.c:3451
 __fput+0x2ff/0x890 fs/file_table.c:280
 ____fput+0x16/0x20 fs/file_table.c:313
 task_work_run+0x145/0x1c0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:185 [inline]
 exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:168
 prepare_exit_to_usermode arch/x86/entry/common.c:199 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:279 [inline]
 do_syscall_64+0x58e/0x680 arch/x86/entry/common.c:304
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888097f2a700
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 0 bytes inside of
 64-byte region [ffff888097f2a700, ffff888097f2a740)
The buggy address belongs to the page:
page:ffffea00025fca80 refcount:1 mapcount:0 mapping:ffff8880aa400340 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea000250d548 ffffea00025726c8 ffff8880aa400340
raw: 0000000000000000 ffff888097f2a000 0000000100000020 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888097f2a600: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
 ffff888097f2a680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888097f2a700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                   ^
 ffff888097f2a780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff888097f2a800: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc

Fixes: 767e97e1e0 ("neigh: RCU conversion of struct neighbour")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:15:58 -07:00
Eric Dumazet
2e05fcae83 tcp: fix compile error if !CONFIG_SYSCTL
tcp_tx_skb_cache_key and tcp_rx_skb_cache_key must be available
even if CONFIG_SYSCTL is not set.

Fixes: 0b7d7f6b22 ("tcp: add tcp_tx_skb_cache sysctl")
Fixes: ede61ca474 ("tcp: add tcp_rx_skb_cache sysctl")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:15:07 -07:00
Dexuan Cui
d424a2afd7 hv_sock: Suppress bogus "may be used uninitialized" warnings
gcc 8.2.0 may report these bogus warnings under some condition:

warning: ‘vnew’ may be used uninitialized in this function
warning: ‘hvs_new’ may be used uninitialized in this function

Actually, the 2 pointers are only initialized and used if the variable
"conn_from_host" is true. The code is not buggy here.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 14:00:51 -07:00
Govindarajulu Varadarajan
36b2f61a42 net: handle 802.1P vlan 0 packets properly
When stack receives pkt: [802.1P vlan 0][802.1AD vlan 100][IPv4],
vlan_do_receive() returns false if it does not find vlan_dev. Later
__netif_receive_skb_core() fails to find packet type handler for
skb->protocol 801.1AD and drops the packet.

801.1P header with vlan id 0 should be handled as untagged packets.
This patch fixes it by checking if vlan_id is 0 and processes next vlan
header.

Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-16 13:45:14 -07:00
Leon Romanovsky
98fdbea550 net/mlx5: Declare more strictly devlink encap mode
Devlink has UAPI declaration for encap mode, so there is no
need to be loose on the data get/set by drivers.

Update call sites to use enum devlink_eswitch_encap_mode
instead of plain u8.

Suggested-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
2019-06-16 15:40:03 +03:00
Eric Dumazet
967c05aee4 tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.

Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Eric Dumazet
5f3e2bf008 tcp: add tcp_min_snd_mss sysctl
Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.

This forces the stack to send packets with a very high network/cpu
overhead.

Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.

In some cases, it can be useful to increase the minimal value
to a saner value.

We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.

Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f1 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.

We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Eric Dumazet
f070ef2ac6 tcp: tcp_fragment() should apply sane memory limits
Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.

TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.

A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.

Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.

CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the
	socket is already using more than half the allowed space

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Eric Dumazet
3b4929f65b tcp: limit payload size of sacked skbs
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
David S. Miller
1eb4169c1e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-06-15

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix stack layout of JITed x64 bpf code, from Alexei.

2) fix out of bounds memory access in bpf_sk_storage, from Arthur.

3) fix lpm trie walk, from Jonathan.

4) fix nested bpf_perf_event_output, from Matt.

5) and several other fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:19:47 -07:00
Arthur Fabre
85749218e3 bpf: Fix out of bounds memory access in bpf_sk_storage
bpf_sk_storage maps use multiple spin locks to reduce contention.
The number of locks to use is determined by the number of possible CPUs.
With only 1 possible CPU, bucket_log == 0, and 2^0 = 1 locks are used.

When updating elements, the correct lock is determined with hash_ptr().
Calling hash_ptr() with 0 bits is undefined behavior, as it does:

x >> (64 - bits)

Using the value results in an out of bounds memory access.
In my case, this manifested itself as a page fault when raw_spin_lock_bh()
is called later, when running the self tests:

./tools/testing/selftests/bpf/test_verifier 773 775
[   16.366342] BUG: unable to handle page fault for address: ffff8fe7a66f93f8

Force the minimum number of locks to two.

Signed-off-by: Arthur Fabre <afabre@cloudflare.com>
Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-15 14:37:56 -07:00
Jiri Pirko
a51486266c net: sched: remove NET_CLS_IND config option
This config option makes only couple of lines optional.
Two small helpers and an int in couple of cls structs.

Remove the config option and always compile this in.
This saves the user from unexpected surprises when he adds
a filter with ingress device match which is silently ignored
in case the config option is not set.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 14:06:13 -07:00
Stephen Barber
42f5cda5ea vsock/virtio: set SOCK_DONE on peer shutdown
Set the SOCK_DONE flag to match the TCP_CLOSING state when a peer has
shut down and there is nothing left to read.

This fixes the following bug:
1) Peer sends SHUTDOWN(RDWR).
2) Socket enters TCP_CLOSING but SOCK_DONE is not set.
3) read() returns -ENOTCONN until close() is called, then returns 0.

Signed-off-by: Stephen Barber <smbarber@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 14:01:09 -07:00
Vivien Didelot
79b139f4bc net: dsa: use switchdev handle helpers
Get rid of the dsa_slave_switchdev_port_{attr_set,obj}_event functions
in favor of the switchdev_handle_port_{attr_set,obj_add,obj_del}
helpers which recurse into the lower devices of the target interface.

This has the benefit of being aware of the operations made on the
bridge device itself, where orig_dev is the bridge, and dev is the
slave. This can be used later to configure the hardware switches.

Only VLAN and (port) MDB objects not directly targeting the slave
device are unsupported at the moment, so skip this case in their
respective case statements.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:20:07 -07:00
Vivien Didelot
f3b78049d4 net: dsa: make dsa_slave_dev_check use const
The switchdev handle helpers make use of a device checking helper
requiring a const net_device. Make dsa_slave_dev_check compliant
to this.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:20:07 -07:00
Vivien Didelot
14aafbd822 net: dsa: do not check orig_dev in vlan del
The current DSA code handling switchdev objects does not recurse into
the lower devices thus is never called with an orig_dev member being
a bridge device, hence remove this useless check.

At the same time, remove the comments about the callers, which is
unlikely to be updated if the code changes and thus will be confusing.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:20:07 -07:00
Eric Dumazet
ce27ec6064 net: add high_order_alloc_disable sysctl/static key
>From linux-3.7, (commit 5640f76858 "net: use a per task frag
allocator") TCP sendmsg() has preferred using order-3 allocations.

While it gives good results for most cases, we had reports
that heavy uses of TCP over loopback were hitting a spinlock
contention in page allocations/freeing.

This commits adds a sysctl so that admins can opt-in
for order-0 allocations. Hopefully mm layer might optimize
order-3 allocations in the future since it could give us
a nice boost  (see 8 lines of following benchmark)

The following benchmark shows a win when more than 8 TCP_STREAM
threads are running (56 x86 cores server in my tests)

for thr in {1..30}
do
 sysctl -wq net.core.high_order_alloc_disable=0
 T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
 sysctl -wq net.core.high_order_alloc_disable=1
 T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
 echo $thr:$T0:$T1
done

1: 49979: 37267
2: 98745: 76286
3: 141088: 110051
4: 177414: 144772
5: 197587: 173563
6: 215377: 208448
7: 241061: 234087
8: 267155: 263373
9: 295069: 297402
10: 312393: 335213
11: 340462: 368778
12: 371366: 403954
13: 412344: 443713
14: 426617: 473580
15: 474418: 507861
16: 503261: 538539
17: 522331: 563096
18: 532409: 567084
19: 550824: 605240
20: 525493: 641988
21: 564574: 665843
22: 567349: 690868
23: 583846: 710917
24: 588715: 736306
25: 603212: 763494
26: 604083: 792654
27: 602241: 796450
28: 604291: 797993
29: 611610: 833249
30: 577356: 841062

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet
0b7d7f6b22 tcp: add tcp_tx_skb_cache sysctl
Feng Tang reported a performance regression after introduction
of per TCP socket tx/rx caches, for TCP over loopback (netperf)

There is high chance the regression is caused by a change on
how well the 32 KB per-thread page (current->task_frag) can
be recycled, and lack of pcp caches for order-3 pages.

I could not reproduce the regression myself, cpus all being
spinning on the mm spinlocks for page allocs/freeing, regardless
of enabling or disabling the per tcp socket caches.

It seems best to disable the feature by default, and let
admins enabling it.

MM layer either needs to provide scalable order-3 pages
allocations, or could attempt a trylock on zone->lock if
the caller only attempts to get a high-order page and is
able to fallback to order-0 ones in case of pressure.

Tests run on a 56 cores host (112 hyper threads)

-	35.49%	netperf 		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
   - 35.49% queued_spin_lock_slowpath
	  - 18.18% get_page_from_freelist
		 - __alloc_pages_nodemask
			- 18.18% alloc_pages_current
				 skb_page_frag_refill
				 sk_page_frag_refill
				 tcp_sendmsg_locked
				 tcp_sendmsg
				 inet_sendmsg
				 sock_sendmsg
				 __sys_sendto
				 __x64_sys_sendto
				 do_syscall_64
				 entry_SYSCALL_64_after_hwframe
				 __libc_send
	  + 17.31% __free_pages_ok
+	31.43%	swapper 		 [kernel.vmlinux]	  [k] intel_idle
+	 9.12%	netperf 		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
+	 6.53%	netserver		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
+	 0.69%	netserver		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
+	 0.68%	netperf 		 [kernel.vmlinux]	  [k] skb_release_data
+	 0.52%	netperf 		 [kernel.vmlinux]	  [k] tcp_sendmsg_locked
	 0.46%	netperf 		 [kernel.vmlinux]	  [k] _raw_spin_lock_irqsave

Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet
ede61ca474 tcp: add tcp_rx_skb_cache sysctl
Instead of relying on rps_needed, it is safer to use a separate
static key, since we do not want to enable TCP rx_skb_cache
by default. This feature can cause huge increase of memory
usage on hosts with millions of sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Tim Beale
f48d2ccee1 udp: Remove unused variable/function (exact_dif)
This was originally passed through to the VRF logic in compute_score().
But that logic has now been replaced by udp_sk_bound_dev_eq() and so
this code is no longer used or needed.

Signed-off-by: Tim Beale <timbeale@catalyst.net.nz>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:52:07 -07:00
Tim Beale
735453730a udp: Remove unused parameter (exact_dif)
Originally this was used by the VRF logic in compute_score(), but that
was later replaced by udp_sk_bound_dev_eq() and the parameter became
unused.

Note this change adds an 'unused variable' compiler warning that will be
removed in the next patch (I've split the removal in two to make review
slightly easier).

Signed-off-by: Tim Beale <timbeale@catalyst.net.nz>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:52:07 -07:00
Eric Dumazet
d6fb396cfa ipv4: tcp: fix ACK/RST sent with a transmit delay
If we want to set a EDT time for the skb we want to send
via ip_send_unicast_reply(), we have to pass a new parameter
and initialize ipc.sockc.transmit_time with it.

This fixes the EDT time for ACK/RST packets sent on behalf of
a TIME_WAIT socket.

Fixes: a842fe1425 ("tcp: add optional per socket transmit delay")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:51:12 -07:00
Li RongQing
3e18943333 net: remove empty netlink_tap_exit_net
Pointer members of an object with static storage duration, if not
explicitly initialized, will be initialized to a NULL pointer. The
net namespace API checks if this pointer is not NULL before using it,
it are safe to remove the function.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:50:33 -07:00
David S. Miller
b4a6d9abeb mlx5-updates-2019-06-13
Mlx5 devlink health fw reporters and sw reset support
 
 This series provides mlx5 firmware reset support and firmware devlink health
 reporters.
 
 1) Add initial mlx5 kernel documentation and include devlink health reporters
 
 2) Add CR-Space access and FW Crdump snapshot support via devlink region_snapshot
 
 3) Issue software reset upon FW asserts
 
 4) Add fw and fw_fatal devlink heath reporters to follow fw errors indication by
 dump and recover procedures and enable trigger these functionality by user.
 
 4.1) fw reporter:
 The fw reporter implements diagnose and dump callbacks.
 It follows symptoms of fw error such as fw syndrome by triggering
 fw core dump and storing it and any other fw trace into the dump buffer.
 The fw reporter diagnose command can be triggered any time by the user to check
 current fw status.
 
 4.2) fw_fatal repoter:
 The fw_fatal reporter implements dump and recover callbacks.
 It follows fatal errors indications by CR-space dump and recover flow.
 The CR-space dump uses vsc interface which is valid even if the FW command
 interface is not functional, which is the case in most FW fatal errors. The
 CR-space dump is stored as a memory region snapshot to ease read by address.
 The recover function runs recover flow which reloads the driver and triggers fw
 reset if needed.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl0CsLgACgkQSD+KveBX
 +j7mFwf+MYvIbUO4mXyoZIezci1UCzt1vNAkUYPceE94O9fK68ItrwtwrstgIqqS
 58Tgx//MXxPpe9k9NIWjeS3i8sjcb8fDoqkjOCj7KAchv0IhSUvYFRpBrUK+yTOW
 NIIXZzuCgIoR9a/hVlT/lhG+dm4MX2L5dWFtORLxMoO+ff3yiy4nNf9+Zdt0H7LT
 YCELWnKeIQCvdzJAxX7OyTh3eOfc/h7o1nOsU4VugBHxKxx4T+9A26d+cZeZH5Ox
 3ikTCc01ivVHqcLydAy96HQu0MENSNYNpmyDxWum3oJGFFu6hBQTM2ueRmVWZfwH
 DRu+hhxONZROxxtpmP/ULmwYcLnBHg==
 =VhXt
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2019-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2019-06-13

Mlx5 devlink health fw reporters and sw reset support

This series provides mlx5 firmware reset support and firmware devlink health
reporters.

1) Add initial mlx5 kernel documentation and include devlink health reporters

2) Add CR-Space access and FW Crdump snapshot support via devlink region_snapshot

3) Issue software reset upon FW asserts

4) Add fw and fw_fatal devlink heath reporters to follow fw errors indication by
dump and recover procedures and enable trigger these functionality by user.

4.1) fw reporter:
The fw reporter implements diagnose and dump callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it and any other fw trace into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
current fw status.

4.2) fw_fatal repoter:
The fw_fatal reporter implements dump and recover callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors. The
CR-space dump is stored as a memory region snapshot to ease read by address.
The recover function runs recover flow which reloads the driver and triggers fw
reset if needed.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:44:29 -07:00
Stephen Suryaputra
363887a2cd ipv4: Support multipath hashing on inner IP pkts for GRE tunnel
Multipath hash policy value of 0 isn't distributing since the outer IP
dest and src aren't varied eventhough the inner ones are. Since the flow
is on the inner ones in the case of tunneled traffic, hashing on them is
desired.

This is done mainly for IP over GRE, hence only tested for that. But
anything else supported by flow dissection should work.

v2: Use skb_flow_dissect_flow_keys() directly so that other tunneling
    can be supported through flow dissection (per Nikolay Aleksandrov).
v3: Remove accidental inclusion of ports in the hash keys and clarify
    the documentation (Nikolay Alexandrov).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:42:35 -07:00
Vlad Buslov
87f373921c net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops
To remove rtnl lock dependency in tc filter update API when using clsact
Qdisc, set QDISC_CLASS_OPS_DOIT_UNLOCKED flag in clsact Qdisc_class_ops.

Clsact Qdisc ops don't require any modifications to be used without rtnl
lock on tc filter update path. Implementation never changes its q->block
and only releases it when Qdisc is being destroyed. This means it is enough
for RTM_{NEWTFILTER|DELTFILTER|GETTFILTER} message handlers to hold clsact
Qdisc reference while using it without relying on rtnl lock protection.
Unlocked Qdisc ops support is already implemented in filter update path by
unlocked cls API patch set.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:32:56 -07:00
Willem de Bruijn
7b58139f98 tcp: use static_branch_deferred_inc for clean_acked_data_enabled
Deferred static key clean_acked_data_enabled uses the deferred
variants of dec and flush. Do the same for inc.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:31:48 -07:00
Vlad Buslov
99815f5031 net: sched: flower: don't call synchronize_rcu() on mask creation
Current flower mask creating code assumes that temporary mask that is used
when inserting new filter is stack allocated. To prevent race condition
with data patch synchronize_rcu() is called every time fl_create_new_mask()
replaces temporary stack allocated mask. As reported by Jiri, this
increases runtime of creating 20000 flower classifiers from 4 seconds to
163 seconds. However, this design is no longer necessary since temporary
mask was converted to be dynamically allocated by commit 2cddd20147
("net/sched: cls_flower: allocate mask dynamically in fl_change()").

Remove synchronize_rcu() calls from mask creation code. Instead, refactor
fl_change() to always deallocate temporary mask with rcu grace period.

Fixes: 195c234d15 ("net: sched: flower: handle concurrent mask insertion")
Reported-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:29:57 -07:00
Neil Horman
ce950f1050 sctp: Free cookie before we memdup a new one
Based on comments from Xin, even after fixes for our recent syzbot
report of cookie memory leaks, its possible to get a resend of an INIT
chunk which would lead to us leaking cookie memory.

To ensure that we don't leak cookie memory, free any previously
allocated cookie first.

Change notes
v1->v2
update subsystem tag in subject (davem)
repeat kfree check for peer_random and peer_hmacs (xin)

v2->v3
net->sctp
also free peer_chunks

v3->v4
fix subject tags

v4->v5
remove cut line

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
CC: Xin Long <lucien.xin@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:27:24 -07:00
Sunil Muthuswamy
d5afa82c97 vsock: correct removal of socket from the list
The current vsock code for removal of socket from the list is both
subject to race and inefficient. It takes the lock, checks whether
the socket is in the list, drops the lock and if the socket was on the
list, deletes it from the list. This is subject to race because as soon
as the lock is dropped once it is checked for presence, that condition
cannot be relied upon for any decision. It is also inefficient because
if the socket is present in the list, it takes the lock twice.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:20:20 -07:00
Eric Dumazet
9bb6cd6524 net/packet: introduce packet_rcv_try_clear_pressure() helper
There are two places where we want to clear the pressure
if possible, add a helper to make it more obvious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Eric Dumazet
3a2bb84e4a net/packet: remove locking from packet_rcv_has_room()
__packet_rcv_has_room() can now be run without lock being held.

po->pressure is only a non persistent hint, we can mark
all read/write accesses with READ_ONCE()/WRITE_ONCE()
to document the fact that the field could be written
without any synchronization.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Eric Dumazet
2c51c6273c net/packet: implement shortcut in tpacket_rcv()
tpacket_rcv() can be hit under DDOS quite hard, since
it will always grab a socket spinlock, to eventually find
there is no room for an additional packet.

Using tcpdump [1] on a busy host can lead to catastrophic consequences,
because of all cpus spinning on a contended spinlock.

This replicates a similar strategy used in packet_rcv()

[1] Also some applications mistakenly use af_packet socket
bound to ETH_P_ALL only to send packets.
Receive queue is never drained and immediately full.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Eric Dumazet
8e8e2951e3 net/packet: make tp_drops atomic
Under DDOS, we want to be able to increment tp_drops without
touching the spinlock. This will help readers to drain
the receive queue slightly faster :/

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Eric Dumazet
0338a14523 net/packet: constify __packet_rcv_has_room()
Goal is use the helper without lock being held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Eric Dumazet
dcf70cefaf net/packet: constify prb_lookup_block() and __tpacket_v3_has_room()
Goal is to be able to use __tpacket_v3_has_room() without holding
a lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Eric Dumazet
d4b5bd98fa net/packet: constify packet_lookup_frame() and __tpacket_has_room()
Goal is to be able to use __tpacket_has_room() without holding a lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Eric Dumazet
96f657e6cf net/packet: constify __packet_get_status() argument
struct packet_sock  is only read.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 18:52:14 -07:00
Stanislav Fomichev
1314ef5611 bpf: export bpf_sock for BPF_PROG_TYPE_SOCK_OPS prog type
And let it use bpf_sk_storage_{get,delete} helpers to access socket
storage. Kernel context (struct bpf_sock_ops_kern) already has sk
member, so I just expose it to the BPF hooks. I use
PTR_TO_SOCKET_OR_NULL and return NULL in !is_fullsock case.

I also export bpf_tcp_sock to make it possible to access tcp socket stats.

Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 01:21:57 +02:00
Stanislav Fomichev
fb85c4a730 bpf: export bpf_sock for BPF_PROG_TYPE_CGROUP_SOCK_ADDR prog type
And let it use bpf_sk_storage_{get,delete} helpers to access socket
storage. Kernel context (struct bpf_sock_addr_kern) already has sk
member, so I just expose it to the BPF hooks. Using PTR_TO_SOCKET
instead of PTR_TO_SOCK_COMMON should be safe because the hook is
called on bind/connect.

Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 01:21:56 +02:00
Martin KaFai Lau
99f3a064bc bpf: net: Add SO_DETACH_REUSEPORT_BPF
There is SO_ATTACH_REUSEPORT_[CE]BPF but there is no DETACH.
This patch adds SO_DETACH_REUSEPORT_BPF sockopt.  The same
sockopt can be used to undo both SO_ATTACH_REUSEPORT_[CE]BPF.

reseport_detach_prog() is added and it is mostly a mirror
of the existing reuseport_attach_prog().  The differences are,
it does not call reuseport_alloc() and returns -ENOENT when
there is no old prog.

Cc: Craig Gallek <kraig@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 01:21:19 +02:00
Mauro Carvalho Chehab
151f4e2bdc docs: power: convert docs to ReST and rename to *.rst
Convert the PM documents to ReST, in order to allow them to
build with Sphinx.

The conversion is actually:
  - add blank lines and indentation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
2019-06-14 16:08:36 -05:00
Mauro Carvalho Chehab
cd238effef docs: kbuild: convert docs to ReST and rename to *.rst
The kbuild documentation clearly shows that the documents
there are written at different times: some use markdown,
some use their own peculiar logic to split sections.

Convert everything to ReST without affecting too much
the author's style and avoiding adding uneeded markups.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-14 14:21:21 -06:00
David S. Miller
d96ec97511 Many changes all over:
* HE (802.11ax) work continues
  * WPA3 offloads
  * work on extended key ID handling continues
  * fixes to honour AP supported rates with auth/assoc frames
  * nl80211 netlink policy improvements to fix some issues
    with strict validation on new commands with old attrs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl0Dq/sACgkQB8qZga/f
 l8RqFg/+MBcuqvW2xTy5o5Lbw7Drx5ROgFT2ZRAO6PTeboQ43NOBiXt2dEhDbp+w
 mHChImF85px3SFMBSvuf97zlScNV6+VJraDDjoZFixt/gIZ/XsdURo5i4IGmUbfj
 +LY1oPm7suC5Cold+yPicHTukFpeU7cSwceslFsecqiN5unlzIxf6gY9H7OL7WGT
 s0Wis0x3y2m9mMi4cvQfHkFzplcTc5SBgPLyLQtHUNx1eySEZ+AymlNVmbGrRWr9
 vaCU5W9+Wz0N6lEB/UI5y6fZzj5mhkcimGck1Os7dFeC7KWjntjT9iKIkFHWehxi
 QfLcK6pGjLpPpMTQtOEfl34ZGnOyO8N9GmOLaaUaBeaZItabYJwfgbdr7NxiJvta
 1cyqXek+D2G7WOa0aIrWhmwswKGBa3nIBqS/ZP/SEWLEzU1Cn0NiAD5Ba016TC4C
 D+1BBXIdpQDoZCgfd6KkGs2Ynf/8N3OwHW+EwjpAu3IARTQzb6tMWSvkAuAgJt1F
 dBD7NqdFhWXFfxqf9NpB8bkmpyNKM4Km6eO2HKpCg/5suKqYJ1Xj9EeQin1B+QsE
 Jntj69hQ6Kj2gKBPy+RnCBFbxMNuFhpc1kmUOGj9U9aAcOntV0woVOyFGsbRmFo3
 MI8aVU/gjQDCcHHD5xtJGHa11uIefXq1r2H7Um3sxKYeBsqFjP4=
 =j+Um
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2019-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Many changes all over:
 * HE (802.11ax) work continues
 * WPA3 offloads
 * work on extended key ID handling continues
 * fixes to honour AP supported rates with auth/assoc frames
 * nl80211 netlink policy improvements to fix some issues
   with strict validation on new commands with old attrs
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 11:27:26 -07:00
David S. Miller
2a2af5e6e6 Various fixes, all over:
* a few memory leaks
  * fixes for management frame protection security
    and A2/A3 confusion (affecting TDLS as well)
  * build fix for certificates
  * etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl0DpdQACgkQB8qZga/f
 l8RvRhAAnJGBRb73LMCUGQdgv8IXXVX8jwRwIwLT9FeJc9Pg9I7o/d4UXvH0ss2D
 qIPWC7CYuI8LUuyu8RXiO3iFKKbWaWDI9cQj9jKXTRWSTUGgSs1zgS3yEcJPJY/V
 q74g3MjK9yYE7UUbhI/ud5yrKEc6XXAWgaGKZzuNYS/SR6vpmy/v+jH8SLKjIS48
 iXQUAQJn/TgIynjfm/d8GNLr5TN5i4uqRD6trdSeWaKIVK/3Q8GO4C6DvqLJuClJ
 n7XTUG0Xbzs4U+k5abtTsRIz6Mh5nHiqCPS/ueeQuLASJzVeXg2mfNGzsbJdLh0c
 J65kbvBqeG0/AD5uybl8VmUgcW/mSDevM6g1pOVbHDrPcg1dyzQBAihKRaoAkM0f
 9YpzWxkQSt9loE1Md9Fn0knhesttt/2wc72Rs/jEeDftj1NP7nt3fnHF2xufHHdb
 JYjsgcLX3rmIrRSvn4yup8kPWmaCI0dvPDbfSQEH9PrthhQVCtHuiFwAmu9LO0o4
 CQ0RuiYFKYOuigabVn32w3S57jKBo/ie09Nnw/sJIsXDiaPLGyxp9L+5wbf8Dhnd
 OeqlhYZD26oTx/gz0lxXjX19ZfWZEN2rpXTCPq7FVMgWMjGJCgQok4WfTfB39hc+
 lPJOa4YM6G1rRppZQfUUmayPXYLw1VJTioNSf8TMLW7opf2aT1o=
 =AHi5
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Various fixes, all over:
 * a few memory leaks
 * fixes for management frame protection security
   and A2/A3 confusion (affecting TDLS as well)
 * build fix for certificates
 * etc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 09:36:51 -07:00
Kevin Darbyshire-Bryant
733f0766c3 sched: act_ctinfo: use extack error reporting
Use extack error reporting mechanism in addition to returning -EINVAL

NL_SET_ERR_* code shamelessy copy/paste/adjusted from act_pedit &
sch_cake and used as reference as to what I should have done in the
first place.

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 08:54:27 -07:00
Young Xiao
385097a367 nfc: Ensure presence of required attributes in the deactivate_target handler
Check that the NFC_ATTR_TARGET_INDEX attributes (in addition to
NFC_ATTR_DEVICE_INDEX) are provided by the netlink client prior to
accessing them. This prevents potential unhandled NULL pointer dereference
exceptions which can be triggered by malicious user-mode programs,
if they omit one or both of these attributes.

Signed-off-by: Young Xiao <92siuyang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 08:48:08 -07:00
Greg Kroah-Hartman
3adcfa4478 l2tp: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Also, there is no need to store the individual debugfs file name, just
remove the whole directory all at once, saving a local variable.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guillaume Nault <g.nault@alphalink.fr>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 08:45:05 -07:00
James Prestwood
ddb754aa31 mac80211: notify offchannel expire on mgmt_tx
When the offchannel TX wait time expires, send the appropriate event.

Signed-off-by: James Prestwood <james.prestwood@linux.intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 16:08:28 +02:00
James Prestwood
1c38c7f220 nl80211: send event when CMD_FRAME duration expires
cfg80211_remain_on_channel_expired is used to notify userspace when
the remain on channel duration expired by sending an event. There is
no such equivalent to CMD_FRAME, where if offchannel and a duration
is provided, the card will go offchannel for that duration. Currently
there is no way for userspace to tell when that duration expired
apart from setting an independent timeout. This timeout is quite
erroneous as the kernel may not immediately send out the frame
because of scheduling or work queue delays. In testing, it was found
this timeout had to be quite large to accomidate any potential delays.

A better solution is to have the kernel send an event when this
duration has expired. There is already NL80211_CMD_FRAME_WAIT_CANCEL
which can be used to cancel a NL80211_CMD_FRAME offchannel. Using this
command matches perfectly to how NL80211_CMD_CANCEL_REMAIN_ON_CHANNEL
works, where its both used to cancel and notify if the duration has
expired.

Signed-off-by: James Prestwood <james.prestwood@linux.intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 16:08:28 +02:00
Greg Kroah-Hartman
5a7bb7ce26 mac80211: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 16:08:28 +02:00
Avraham Stern
b658420253 cfg80211: report measurement start TSF correctly
Instead of reporting the AP's TSF, host time was reported. Fix it.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 15:46:33 +02:00
Eric Biggers
4f488fbca2 cfg80211: fix memory leak of wiphy device name
In wiphy_new_nm(), if an error occurs after dev_set_name() and
device_initialize() have already been called, it's necessary to call
put_device() (via wiphy_free()) to avoid a memory leak.

Reported-by: syzbot+7fddca22578bc67c3fe4@syzkaller.appspotmail.com
Fixes: 1f87f7d3a3 ("cfg80211: add rfkill support")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 15:46:33 +02:00
Mordechay Goodstein
1a473d6092 cfg80211: util: fix bit count off by one
The bits of Rx MCS Map in VHT capability were enumerated
with index transform - index i -> (i + 1) bit => nss i. BUG!
while it should be -   index i -> (i + 1) bit => (i + 1) nss.

The bug was exposed in commit a53b2a0b12 ("iwlwifi: mvm: implement VHT
extended NSS support in rs.c"), where iwlwifi started using the
function.

Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Fixes: b0aa75f0b1 ("ieee80211: add new VHT capability fields/parsing")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 15:46:33 +02:00
Naftali Goldstein
f8891461a2 mac80211: do not start any work during reconfigure flow
It is not a good idea to try to perform any work (e.g. send an auth
frame) during reconfigure flow.

Prevent this from happening, and at the end of the reconfigure flow
requeue all the works.

Signed-off-by: Naftali Goldstein <naftali.goldstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 15:46:33 +02:00
Luca Coelho
ebb3ca3b44 cfg80211: use BIT_ULL in cfg80211_parse_mbssid_data()
The seen_indices variable is u64 and in other parts of the code we
assume mbssid_index_ie[2] can be up to 45, so we should use the 64-bit
versions of BIT, namely, BIT_ULL().

Reported-by: Dan Carpented <dan.carpenter@oracle.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 15:46:33 +02:00
Yibo Zhao
5635723401 mac80211: only warn once on chanctx_conf being NULL
In multiple SSID cases, it takes time to prepare every AP interface
to be ready in initializing phase. If a sta already knows everything it
needs to join one of the APs and sends authentication to the AP which
is not fully prepared at this point of time, AP's channel context
could be NULL. As a result, warning message occurs.

Even worse, if the AP is under attack via tools such as MDK3 and massive
authentication requests are received in a very short time, console will
be hung due to kernel warning messages.

WARN_ON_ONCE() could be a better way for indicating warning messages
without duplicate messages to flood the console.

Johannes: We still need to address the underlying problem, but we
          don't really have a good handle on it yet. Suppress the
          worst side-effects for now.

Signed-off-by: Zhi Chen <zhichen@codeaurora.org>
Signed-off-by: Yibo Zhao <yiboz@codeaurora.org>
[johannes: add note, change subject]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 15:46:33 +02:00
Johannes Berg
588f7d39b3 mac80211: drop robust management frames from unknown TA
When receiving a robust management frame, drop it if we don't have
rx->sta since then we don't have a security association and thus
couldn't possibly validate the frame.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 15:46:33 +02:00
Johannes Berg
163a7cdd1f mac80211: extend __rate_control_send_low warning
This appears to happen occasionally, and if it does we
really want even more information than we have now.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:18:05 +02:00
Johannes Berg
583a7a34c5 mac80211: fill low rate even for HAS_RATE_CONTROL
If HW advertises it has rate control, we skip all of the
rate control assignments, but sometimes the data we have
here is useful, especially so that we don't have to do
the lookups again on which rates are configured and are
supported.

So do the low rate assignment anyway to help out drivers
that might need it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:17:55 +02:00
Johannes Berg
bd718fc11d mac80211: use STA info in rate_control_send_low()
Even if we have a station, we currently call rate_control_send_low()
with the NULL station unless further rate control (driver, minstrel)
has been initialized.

Change this so we can use more information about the station to use
a better rate. For example, when we associate with an AP, we will
now use the lowest rate it advertised as supported (that we can)
rather than the lowest mandatory rate. This aligns our behaviour
with most other 802.11 implementations.

To make this possible, we need to also ensure that we have non-zero
rates at all times, so in case we really have *nothing* pre-fill
the supp_rates bitmap with the very lowest mandatory bitmap (11b
and 11a on 2.4 and 5 GHz respectively).

Additionally, hostapd appears to be giving us an empty supported
rates bitmap (it can and should do better, since the STA must have
supported for at least the basic rates in the BSS), so ignore any
such bitmaps that would actually zero out the supp_rates, and in
that case just keep the pre-filled mandatory rates.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:17:37 +02:00
Johannes Berg
1e87fec9fa mac80211: call rate_control_send_low() internally
There's no rate control algorithm that *doesn't* want to call
it internally, and calling it internally will let us modify
its behaviour in the future.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:17:37 +02:00
Ilan Peer
4770c8f902 cfg80211: Add a function to iterate all BSS entries
Add a function that iterates over the BSS entries associated with a
given wiphy and calls a callback for each iterated BSS. This can be
used by drivers in various ways, e.g., to evaluate some property for
all the BSSs in the medium.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:16:02 +02:00
John Crispin
a0de1ca383 mac80211: allow turning TWT responder support on and off via netlink
Allow the userland daemon to en/disable TWT support for an AP.

Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com>
Signed-off-by: John Crispin <john@phrozen.org>
[simplify parsing code]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:14:10 +02:00
John Crispin
c9d3245e03 mac80211: dynamically enable the TWT requester support on STA interfaces
Turn TWT for STA interfaces when they associate and/or receive a
beacon where the twt_responder bit has changed.

Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:12:13 +02:00
Johannes Berg
901bb98918 nl80211: require and validate vendor command policy
Require that each vendor command give a policy of its sub-attributes
in NL80211_ATTR_VENDOR_DATA, and then (stricly) check the contents,
including the NLA_F_NESTED flag that we couldn't check on the outer
layer because there we don't know yet.

It is possible to use VENDOR_CMD_RAW_DATA for raw data, but then no
nested data can be given (NLA_F_NESTED flag must be clear) and the
data is just passed as is to the command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:12:01 +02:00
Chung-Hsien Hsu
26f7044e95 nl80211: add support for SAE authentication offload
Let drivers advertise support for station-mode SAE authentication
offload with a new NL80211_EXT_FEATURE_SAE_OFFLOAD flag.

Signed-off-by: Chung-Hsien Hsu <stanley.hsu@cypress.com>
Signed-off-by: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:07:49 +02:00
Chung-Hsien Hsu
cc3e14c21a nl80211: add WPA3 definition for SAE authentication
Add definition of WPA version 3 for SAE authentication.

Signed-off-by: Chung-Hsien Hsu <stanley.hsu@cypress.com>
Signed-off-by: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:07:34 +02:00
Chung-Hsien Hsu
f4d75993d8 nl80211: add NL80211_ATTR_IFINDEX to port authorized event
Add NL80211_ATTR_IFINDEX attribute to port authorized event to indicate
the operating interface of the device. Also put NL80211_ATTR_WIPHY
attribute in it to be consistent with the other MLME notifications.

Signed-off-by: Chung-Hsien Hsu <stanley.hsu@cypress.com>
Signed-off-by: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:07:17 +02:00
Alexander Wetzel
90cc4bd611 mac80211: AMPDU handling for Extended Key ID
IEEE 802.11 - 2016 forbids mixing MPDUs with different keyIDs in one
A-MPDU. Drivers supporting A-MPDUs and Extended Key ID must actively
enforce that requirement due to the available two unicast keyIDs.

Allow driver to signal mac80211 that they will not check the keyID in
MPDUs when aggregating them and that they expect mac80211 to stop Tx
aggregation when rekeying a connection using Extended Key ID.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:05:35 +02:00
Li RongQing
8928aa6a00 xfrm: remove empty xfrmi_init_net
Pointer members of an object with static storage duration, if not
explicitly initialized, will be initialized to a NULL pointer. The
net namespace API checks if this pointer is not NULL before using it,
it are safe to remove the function.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-14 08:31:32 +02:00
Vladimir Oltean
6358f38c06 net: dsa: tag_sja1105: Select CONFIG_PACKING
The packing facility is needed to decode Ethernet meta frames containing
source port and RX timestamping information.

The DSA driver selects CONFIG_PACKING, but the tagger did not, and since
taggers can be now compiled as modules independently from the drivers
themselves, this is an issue now, as CONFIG_PACKING is disabled by
default on all architectures.

Fixes: e53e18a6fe ("net: dsa: sja1105: Receive and decode meta frames")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-13 22:42:19 -07:00
Aya Levin
e44ef4e451 devlink: Hang reporter's dump method on a dumpit cb
The devlink health reporter provides a dump method on an error. Dump
may contain a large amount of data, in this case doit cb isn't sufficient.
This is because the user side is blocking and doesn't allow draining of
the socket until the socket runs out of buffers. Using dumpit cb
is the correct way to go.
Please note that thankfully the dump op is not yet implemented in any
driver and therefore this change is not breaking userspace.

Fixes: 35455e23e6 ("devlink: Add health dump {get,clear} commands")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-13 13:23:16 -07:00
Eric Dumazet
a842fe1425 tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.

Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.

EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.

Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.

This requires FQ packet scheduler or a EDT-enabled NIC.

This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.

  unsigned int tx_delay = 10000; /* 10 msec */

  setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

Note that FQ packet scheduler limits might need some tweaking :

man tc-fq

PARAMETERS
   limit
       Hard  limit  on  the  real  queue  size. When this limit is
       reached, new packets are dropped. If the value is  lowered,
       packets  are  dropped so that the new limit is met. Default
       is 10000 packets.

   flow_limit
       Hard limit on the maximum  number  of  packets  queued  per
       flow.  Default value is 100.

Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.

Use of a jump label makes this support runtime-free, for hosts
never using the option.

Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)

Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 13:05:43 -07:00
John Fastabend
648ee6cea7 net: tls, correctly account for copied bytes with multiple sk_msgs
tls_sw_do_sendpage needs to return the total number of bytes sent
regardless of how many sk_msgs are allocated. Unfortunately, copied
(the value we return up the stack) is zero'd before each new sk_msg
is allocated so we only return the copied size of the last sk_msg used.

The caller (splice, etc.) of sendpage will then believe only part
of its data was sent and send the missing chunks again. However,
because the data actually was sent the receiver will get multiple
copies of the same data.

To reproduce this do multiple sendfile calls with a length close to
the max record size. This will in turn call splice/sendpage, sendpage
may use multiple sk_msg in this case and then returns the incorrect
number of bytes. This will cause splice to resend creating duplicate
data on the receiver. Andre created a C program that can easily
generate this case so we will push a similar selftest for this to
bpf-next shortly.

The fix is to _not_ zero the copied field so that the total sent
bytes is returned.

Reported-by: Steinar H. Gunderson <steinar+kernel@gunderson.no>
Reported-by: Andre Tomt <andre@tomt.net>
Tested-by: Andre Tomt <andre@tomt.net>
Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 11:04:35 -07:00
Florian Fainelli
2131fba531 net: dsa: Deal with non-existing PHY/fixed-link
We need to specifically deal with phylink_of_phy_connect() returning
-ENODEV, because this can happen when a CPU/DSA port does connect
neither to a PHY, nor has a fixed-link property. This is a valid use
case that is permitted by the binding and indicates to the switch:
auto-configure port with maximum capabilities.

Fixes: 0e27921816 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 11:01:20 -07:00
Stephen Suryaputra
e1ae5c2ea4 vrf: Increment Icmp6InMsgs on the original netdev
Get the ingress interface and increment ICMP counters based on that
instead of skb->dev when the the dev is a VRF device.

This is a follow up on the following message:
https://www.spinics.net/lists/netdev/msg560268.html

v2: Avoid changing skb->dev since it has unintended effect for local
    delivery (David Ahern).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 11:00:11 -07:00
Maxime Chevallier
f0d2ca1531 net: ethtool: Allow matching on vlan DEI bit
Using ethtool, users can specify a classification action matching on the
full vlan tag, which includes the DEI bit (also previously called CFI).

However, when converting the ethool_flow_spec to a flow_rule, we use
dissector keys to represent the matching patterns.

Since the vlan dissector key doesn't include the DEI bit, this
information was silently discarded when translating the ethtool
flow spec in to a flow_rule.

This commit adds the DEI bit into the vlan dissector key, and allows
propagating the information to the driver when parsing the ethtool flow
spec.

Fixes: eca4205f9e ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator")
Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 10:09:56 -07:00
Matteo Croce
ec66854c83 mpls: fix af_mpls dependencies for real
Randy reported that selecting MPLS_ROUTING without PROC_FS breaks
the build, because since commit c1a9d65954 ("mpls: fix af_mpls
dependencies"), MPLS_ROUTING selects PROC_SYSCTL, but Kconfig's select
doesn't recursively handle dependencies.
Change the select into a dependency.

Fixes: c1a9d65954 ("mpls: fix af_mpls dependencies")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 09:42:34 -07:00
Vlad Buslov
7a096d579e net: sched: ingress: set 'unlocked' flag for Qdisc ops
To remove rtnl lock dependency in tc filter update API when using ingress
Qdisc, set QDISC_CLASS_OPS_DOIT_UNLOCKED flag in ingress Qdisc_class_ops.

Ingress Qdisc ops don't require any modifications to be used without rtnl
lock on tc filter update path. Ingress implementation never changes its
q->block and only releases it when Qdisc is being destroyed. This means it
is enough for RTM_{NEWTFILTER|DELTFILTER|GETTFILTER} message handlers to
hold ingress Qdisc reference while using it without relying on rtnl lock
protection. Unlocked Qdisc ops support is already implemented in filter
update path by unlocked cls API patch set.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 09:28:05 -07:00
Ilya Maximets
01d76b5317 xdp: check device pointer before clearing
We should not call 'ndo_bpf()' or 'dev_put()' with NULL argument.

Fixes: c9b47cc1fa ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-12 16:41:47 +02:00
Martin KaFai Lau
f12dd75959 bpf: net: Set sk_bpf_storage back to NULL for cloned sk
The cloned sk should not carry its parent-listener's sk_bpf_storage.
This patch fixes it by setting it back to NULL.

Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-12 16:38:20 +02:00
Florian Westphal
1be451d993 xfrm: fix bogus WARN_ON with ipv6
net/xfrm/xfrm_input.c:378:17: warning: this statement may fall through [-Wimplicit-fallthrough=]
skb->protocol = htons(ETH_P_IPV6);

... the fallthrough then causes a bogus WARN_ON().

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 4c203b0454 ("xfrm: remove eth_proto value from xfrm_state_afinfo")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-12 11:01:56 +02:00
Jakub Kicinski
5018007409 net/tls: add kernel-driven resync mechanism for TX
TLS offload drivers keep track of TCP seq numbers to make sure
the packets are fed into the HW in order.

When packets get dropped on the way through the stack, the driver
will get out of sync and have to use fallback encryption, but unless
TCP seq number is resynced it will never match the packets correctly
(or even worse - use incorrect record sequence number after TCP seq
wraps).

Existing drivers (mlx5) feed the entire record on every out-of-order
event, allowing FW/HW to always be in sync.

This patch adds an alternative, more akin to the RX resync.  When
driver sees a frame which is past its expected sequence number the
stream must have gotten out of order (if the sequence number is
smaller than expected its likely a retransmission which doesn't
require resync).  Driver will ask the stack to perform TX sync
before it submits the next full record, and fall back to software
crypto until stack has performed the sync.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:27 -07:00
Jakub Kicinski
eeb2efaf36 net/tls: generalize the resync callback
Currently only RX direction is ever resynced, however, TX may
also get out of sequence if packets get dropped on the way to
the driver.  Rename the resync callback and add a direction
parameter.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:27 -07:00
Jakub Kicinski
f953d33ba1 net/tls: add kernel-driven TLS RX resync
TLS offload device may lose sync with the TCP stream if packets
arrive out of order.  Drivers can currently request a resync at
a specific TCP sequence number.  When a record is found starting
at that sequence number kernel will inform the device of the
corresponding record number.

This requires the device to constantly scan the stream for a
known pattern (constant bytes of the header) after sync is lost.

This patch adds an alternative approach which is entirely under
the control of the kernel.  Kernel tracks records it had to fully
decrypt, even though TLS socket is in TLS_HW mode.  If multiple
records did not have any decrypted parts - it's a pretty strong
indication that the device is out of sync.

We choose the min number of fully encrypted records to be 2,
which should hopefully be more than will get retransmitted at
a time.

After kernel decides the device is out of sync it schedules a
resync request.  If the TCP socket is empty the resync gets
performed immediately.  If socket is not empty we leave the
record parser to resync when next record comes.

Before resync in message parser we peek at the TCP socket and
don't attempt the sync if the socket already has some of the
next record queued.

On resync failure (encrypted data continues to flow in) we
retry with exponential backoff, up to once every 128 records
(with a 16k record thats at most once every 2M of data).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski
fe58a5a02c net/tls: rename handle_device_resync()
handle_device_resync() doesn't describe the function very well.
The function checks if resync should be issued upon parsing of
a new record.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski
89fec474fa net/tls: pass record number as a byte array
TLS offload code casts record number to a u64.  The buffer
should be aligned to 8 bytes, but its actually a __be64, and
the rest of the TLS code treats it as big int.  Make the
offload callbacks take a byte array, drivers can make the
choice to do the ugly cast if they want to.

Prepare for copying the record number onto the stack by
defining a constant for max size of the byte array.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski
4967373959 net/tls: simplify seq calculation in handle_device_resync()
We subtract "TLS_HEADER_SIZE - 1" from req_seq, then if they
match we add the same constant to seq.  Just add it to seq,
and we don't have to touch req_seq.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Mao Wenan
46088059aa packet: remove unused variable 'status' in __packet_lookup_frame_in_block
The variable 'status' in  __packet_lookup_frame_in_block() is never used since
introduction in commit f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer
implementation."), we can remove it.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:11:11 -07:00
Taehee Yoo
f7a8fb1ff1 net: openvswitch: remove unnecessary ASSERT_OVSL in ovs_vport_del()
ASSERT_OVSL() in ovs_vport_del() is unnecessary because
ovs_vport_del() is only called by ovs_dp_detach_port() and
ovs_dp_detach_port() calls ASSERT_OVSL() too.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 11:56:33 -07:00
Taehee Yoo
abf9979f7e net: netlink: make netlink_walk_start() void return type
netlink_walk_start() needed to return an error code because of
rhashtable_walk_init(). but that was converted to rhashtable_walk_enter()
and it is a void type function. so now netlink_walk_start() doesn't need
any return value.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 11:55:48 -07:00
Taehee Yoo
309b66970e net: openvswitch: do not free vport if register_netdevice() is failed.
In order to create an internal vport, internal_dev_create() is used and
that calls register_netdevice() internally.
If register_netdevice() fails, it calls dev->priv_destructor() to free
private data of netdev. actually, a private data of this is a vport.

Hence internal_dev_create() should not free and use a vport after failure
of register_netdevice().

Test command
    ovs-dpctl add-dp bonding_masters

Splat looks like:
[ 1035.667767] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 1035.675958] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 1035.676916] CPU: 1 PID: 1028 Comm: ovs-vswitchd Tainted: G    B             5.2.0-rc3+ #240
[ 1035.676916] RIP: 0010:internal_dev_create+0x2e5/0x4e0 [openvswitch]
[ 1035.676916] Code: 48 c1 ea 03 80 3c 02 00 0f 85 9f 01 00 00 4c 8b 23 48 b8 00 00 00 00 00 fc ff df 49 8d bc 24 60 05 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 86 01 00 00 49 8b bc 24 60 05 00 00 e8 e4 68 f4
[ 1035.713720] RSP: 0018:ffff88810dcb7578 EFLAGS: 00010206
[ 1035.713720] RAX: dffffc0000000000 RBX: ffff88810d13fe08 RCX: ffffffff84297704
[ 1035.713720] RDX: 00000000000000ac RSI: 0000000000000000 RDI: 0000000000000560
[ 1035.713720] RBP: 00000000ffffffef R08: fffffbfff0d3b881 R09: fffffbfff0d3b881
[ 1035.713720] R10: 0000000000000001 R11: fffffbfff0d3b880 R12: 0000000000000000
[ 1035.768776] R13: 0000607ee460b900 R14: ffff88810dcb7690 R15: ffff88810dcb7698
[ 1035.777709] FS:  00007f02095fc980(0000) GS:ffff88811b400000(0000) knlGS:0000000000000000
[ 1035.777709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1035.777709] CR2: 00007ffdf01d2f28 CR3: 0000000108258000 CR4: 00000000001006e0
[ 1035.777709] Call Trace:
[ 1035.777709]  ovs_vport_add+0x267/0x4f0 [openvswitch]
[ 1035.777709]  new_vport+0x15/0x1e0 [openvswitch]
[ 1035.777709]  ovs_vport_cmd_new+0x567/0xd10 [openvswitch]
[ 1035.777709]  ? ovs_dp_cmd_dump+0x490/0x490 [openvswitch]
[ 1035.777709]  ? __kmalloc+0x131/0x2e0
[ 1035.777709]  ? genl_family_rcv_msg+0xa54/0x1030
[ 1035.777709]  genl_family_rcv_msg+0x63a/0x1030
[ 1035.777709]  ? genl_unregister_family+0x630/0x630
[ 1035.841681]  ? debug_show_all_locks+0x2d0/0x2d0
[ ... ]

Fixes: cf124db566 ("net: Fix inconsistent teardown and release of private netdev state.")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 11:54:01 -07:00
Willem de Bruijn
522924b583 net: correct udp zerocopy refcnt also when zerocopy only on append
The below patch fixes an incorrect zerocopy refcnt increment when
appending with MSG_MORE to an existing zerocopy udp skb.

  send(.., MSG_ZEROCOPY | MSG_MORE);	// refcnt 1
  send(.., MSG_ZEROCOPY | MSG_MORE);	// refcnt still 1 (bar frags)

But it missed that zerocopy need not be passed at the first send. The
right test whether the uarg is newly allocated and thus has extra
refcnt 1 is not !skb, but !skb_zcopy.

  send(.., MSG_MORE);			// <no uarg>
  send(.., MSG_ZEROCOPY);		// refcnt 1

Fixes: 100f6d8e09 ("net: correct zerocopy refcnt with udp MSG_MORE")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 11:40:54 -07:00
Jonathan Lemon
fada7fdc83 bpf: Allow bpf_map_lookup_elem() on an xskmap
Currently, the AF_XDP code uses a separate map in order to
determine if an xsk is bound to a queue.  Instead of doing this,
have bpf_map_lookup_elem() return a xdp_sock.

Rearrange some xdp_sock members to eliminate structure holes.

Remove selftest - will be added back in later patch.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-10 23:31:26 -07:00
David Ahern
7bf4796dd0 nexthops: add support for replace
Add support for atomically upating a nexthop config.

When updating a nexthop, walk the lists of associated fib entries and
verify the new config is valid. Replace is done by swapping nh_info
for single nexthops - new config is applied to old nexthop struct, and
old config is moved to new nexthop struct. For nexthop groups the same
applies but for nh_group. In addition for groups the nh_parent reference
needs to be updated. The old config is released by calling __remove_nexthop
on the 'new' nexthop which now has the old config. This is done to avoid
messing around with the list_heads that track which fib entries are
using the nexthop.

After the swap of config data, bump the sequence counters for FIB entries
to invalidate any dst entries and send notifications to userspace. The
notifications include the new nexthop spec as well as any fib entries
using the updated nexthop struct.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:57 -07:00
David Ahern
5b98324ebe ipv6: Allow routes to use nexthop objects
Add support for RTA_NH_ID attribute to allow a user to specify a
nexthop id to use with a route. fc_nh_id is added to fib6_config to
hold the value passed in the RTA_NH_ID attribute. If a nexthop id
is given, the gateway, device, encap and multipath attributes can
not be set.

Update ip6_route_del to check metric and protocol before nexthop
specs. If fc_nh_id is set, then it must match the id in the route
entry. Since IPv6 allows delete of a cached entry (an exception),
add ip6_del_cached_rt_nh to cycle through all of the fib6_nh in
a fib entry if it is using a nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:57 -07:00
David Ahern
6c48ea5fe6 ipv4: Optimization for fib_info lookup with nexthops
Be optimistic about re-using a fib_info when nexthop id is given and
the route does not use metrics. Avoids a memory allocation which in
most cases is expected to be freed anyways.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:57 -07:00
David Ahern
493ced1ac4 ipv4: Allow routes to use nexthop objects
Add support for RTA_NH_ID attribute to allow a user to specify a
nexthop id to use with a route. fc_nh_id is added to fib_config to
hold the value passed in the RTA_NH_ID attribute. If a nexthop id
is given, the gateway, device, encap and multipath attributes can
not be set.

Update fib_nh_match to check ids on a route delete.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
2d44234bf9 ipv6: Handle all fib6_nh in a nexthop in mtu updates
Use nexthop_for_each_fib6_nh to call fib6_nh_mtu_change for each
fib6_nh in a nexthop for rt6_mtu_change_route. For __ip6_rt_update_pmtu,
we need to find the nexthop that correlates to the device and gateway
in the rt6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
49d5b8ef20 ipv6: Handle all fib6_nh in a nexthop in rt6_do_redirect
Use nexthop_for_each_fib6_nh and fib6_nh_find_match to find the
fib6_nh in a nexthop that correlates to the device and gateway
in the rt6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
c55c89884d ipv6: Handle all fib6_nh in a nexthop in __ip6_route_redirect
Add a hook in __ip6_route_redirect to handle a nexthop struct in a
fib6_info. Use nexthop_for_each_fib6_nh and fib6_nh_redirect_match
to call ip6_redirect_nh_match for each fib6_nh looking for a match.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
e659ba31d8 ipv6: Handle all fib6_nh in a nexthop in exception handling
Add a hook in rt6_flush_exceptions, rt6_remove_exception_rt,
rt6_update_exception_stamp_rt, and rt6_age_exceptions to handle
nexthop struct in a fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
2c170e0753 ipv6: Handle all fib6_nh in a nexthop in fib6_info_uses_dev
Add a hook in fib6_info_uses_dev to handle nexthop struct in a fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
a1b7a1f06c ipv6: Handle all fib6_nh in a nexthop in rt6_nlmsg_size
Add a hook in rt6_nlmsg_size to handle nexthop struct in a fib6_info.
rt6_nh_nlmsg_size is used to sum the space needed for all nexthops in
the fib entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
17a5984eee ipv6: Handle all fib6_nh in a nexthop in __find_rr_leaf
Add a hook in __find_rr_leaf to handle nexthop struct in a fib6_info.
nexthop_for_each_fib6_nh is used to walk each fib6_nh in a nexthop and
call find_match. On a match, use the fib6_nh saved in the callback arg
to setup fib6_result.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
962b680383 ipv6: Handle all fib6_nh in a nexthop in rt6_device_match
Add a hook in rt6_device_match to handle nexthop struct in a fib6_info.
The new rt6_nh_dev_match uses nexthop_for_each_fib6_nh to walk each
fib6_nh in a nexthop and call __rt6_device_match. On match,
rt6_nh_dev_match returns the fib6_nh and rt6_device_match uses it to
setup fib6_result.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
2ab75bfb17 ipv6: Handle all fib6_nh in a nexthop in fib6_drop_pcpu_from
Use nexthop_for_each_fib6_nh to walk all fib6_nh in a nexthop when
dropping 'from' reference in pcpu routes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
f88c9aa12f nexthops: Add ipv6 helper to walk all fib6_nh in a nexthop struct
IPv6 has traditionally had a single fib6_nh per fib6_info. With
nexthops we can have multiple fib6_nh associated with a fib6_info.
Add a nexthop helper to invoke a callback for each fib6_nh in a
'struct nexthop'. If the callback returns non-0, the loop is
stopped and the return value passed to the caller.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
YueHaibing
948622f984 tcp: Make tcp_fastopen_alloc_ctx static
Fix sparse warning:

net/ipv4/tcp_fastopen.c:75:29: warning:
 symbol 'tcp_fastopen_alloc_ctx' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:38:50 -07:00
Jozsef Kadlecsik
fe03d47456 Update my email address
It's better to use my kadlec@netfilter.org email address in
the source code. I might not be able to use
kadlec@blackhole.kfki.hu in the future.

Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 13:00:24 +02:00
Stefano Brivio
11921796f4 ipset: Fix memory accounting for hash types on resize
If a fresh array block is allocated during resize, the current in-memory
set size should be increased by the size of the block, not replaced by it.

Before the fix, adding entries to a hash set type, leading to a table
resize, caused an inconsistent memory size to be reported. This becomes
more obvious when swapping sets with similar sizes:

  # cat hash_ip_size.sh
  #!/bin/sh
  FAIL_RETRIES=10

  tries=0
  while [ ${tries} -lt ${FAIL_RETRIES} ]; do
  	ipset create t1 hash:ip
  	for i in `seq 1 4345`; do
  		ipset add t1 1.2.$((i / 255)).$((i % 255))
  	done
  	t1_init="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')"

  	ipset create t2 hash:ip
  	for i in `seq 1 4360`; do
  		ipset add t2 1.2.$((i / 255)).$((i % 255))
  	done
  	t2_init="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')"

  	ipset swap t1 t2
  	t1_swap="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')"
  	t2_swap="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')"

  	ipset destroy t1
  	ipset destroy t2
  	tries=$((tries + 1))

  	if [ ${t1_init} -lt 10000 ] || [ ${t2_init} -lt 10000 ]; then
  		echo "FAIL after ${tries} tries:"
  		echo "T1 size ${t1_init}, after swap ${t1_swap}"
  		echo "T2 size ${t2_init}, after swap ${t2_swap}"
  		exit 1
  	fi
  done
  echo "PASS"
  # echo -n 'func hash_ip4_resize +p' > /sys/kernel/debug/dynamic_debug/control
  # ./hash_ip_size.sh
  [ 2035.018673] attempt to resize set t1 from 10 to 11, t 00000000fe6551fa
  [ 2035.078583] set t1 resized from 10 (00000000fe6551fa) to 11 (00000000172a0163)
  [ 2035.080353] Table destroy by resize 00000000fe6551fa
  FAIL after 4 tries:
  T1 size 9064, after swap 71128
  T2 size 71128, after swap 9064

Reported-by: NOYB <JunkYardMail1@Frontier.com>
Fixes: 9e41f26a50 ("netfilter: ipset: Count non-static extension memory for userspace")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:59:23 +02:00
Jozsef Kadlecsik
b1732e1638 netfilter: ipset: Fix error path in set_target_v3_checkentry()
Fix error path and release the references properly.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:58:26 +02:00
Jozsef Kadlecsik
13c6ba1f85 netfilter: ipset: Fix the last missing check of nla_parse_deprecated()
In dump_init() the outdated comment was incorrect and we had a missing
validation check of nla_parse_deprecated().

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:50:00 +02:00
Aditya Pakki
f4f5748bfe netfilter: ipset: fix a missing check of nla_parse
When nla_parse fails, we should not use the results (the first
argument). The fix checks if it fails, and if so, returns its error code
upstream.

Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:47:37 +02:00
Florent Fourcot
f0cb839084 netfilter: ipset: merge uadd and udel functions
Both functions are using exactly the same code, except the command value
passed to call_ad function.

Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:42:56 +02:00
Florent Fourcot
24c509b2e2 netfilter: ipset: remove useless memset() calls
One of the memset call is buggy: it does not erase full array, but only pointer size.
Moreover, after a check, first step of nla_parse_nested/nla_parse is to
erase tb array as well. We can remove both calls safely.

Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-10 12:29:31 +02:00
Eric Dumazet
c67b85558f ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
In case autoflowlabel is in action, skb_get_hash_flowi6()
derives a non zero skb->hash to the flowlabel.

If skb->hash is zero, a flow dissection is performed.

Since all TCP skbs sent from ESTABLISH state inherit their
skb->hash from sk->sk_txhash, we better keep a copy
of sk->sk_txhash into the TIME_WAIT socket.

After this patch, ACK or RST packets sent on behalf of
a TIME_WAIT socket have the flowlabel that was previously
used by the flow.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 20:10:19 -07:00
Yuchung Cheng
fcc2202a9d tcp: fix undo spurious SYNACK in passive Fast Open
Commit 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK
retransmit") may cause tcp_fastretrans_alert() to warn about pending
retransmission in Open state. This is triggered when the Fast Open
server both sends data and has spurious SYNACK retransmission during
the handshake, and the data packets were lost or reordered.

The root cause is a bit complicated:

(1) Upon receiving SYN-data: a full socket is created with
    snd_una = ISN + 1 by tcp_create_openreq_child()

(2) On SYNACK timeout the server/sender enters CA_Loss state.

(3) Upon receiving the final ACK to complete the handshake, sender
    does not mark FLAG_SND_UNA_ADVANCED since (1)

    Sender then calls tcp_process_loss since state is CA_loss by (2)

(4) tcp_process_loss() does not invoke undo operations but instead
    mark REXMIT_LOST to force retransmission

(5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
    changes state to CA_Open but has positive tp->retrans_out

(6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()

The step that goes wrong is (4) where the undo operation should
have been invoked because the ACK successfully acknowledged the
SYN sequence. This fixes that by specifically checking undo
when the SYN-ACK sequence is acknowledged. Then after
tcp_process_loss() the state would be further adjusted based
in tcp_fastretrans_alert() to avoid triggering the warning in (6).

Fixes: 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 20:04:11 -07:00
Matteo Croce
c1a9d65954 mpls: fix af_mpls dependencies
MPLS routing code relies on sysctl to work, so let it select PROC_SYSCTL.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:57:24 -07:00
Hariprasad Kelam
de47c5d8e1 af_key: make use of BUG_ON macro
fix below warnings reported by coccicheck

net/key/af_key.c:932:2-5: WARNING: Use BUG_ON instead of if condition
followed by BUG.
net/key/af_key.c:948:2-5: WARNING: Use BUG_ON instead of if condition
followed by BUG.

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:56:42 -07:00
David S. Miller
62f42a114b linux-can-fixes-for-5.2-20190607
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEmvEkXzgOfc881GuFWsYho5HknSAFAlz60fwTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRBaxiGjkeSdILI8B/41j6UxwMSYMetM/Vw2AyqPt0z671Gu
 mAVzDK3PZF5WIzD5JsDlUhwN1dqvfvAZeSS1tQwQ18ZpQLkRSxlAKhkLLlxBVKpJ
 0uDwYuFWwXF8DbdlRwZq+Db0GGAXW5+8WtA4gp9GTiIP6kTgmqnnaZvWPnQzQmKJ
 RCY8YAzF52jXa4hBfLsXiG7NoO6cZHQJMFKLsQEpHV+GMqTcAYJHzBP6YVVXC8PG
 495TREmTq5lvRKc2U3QS6//yPWleFYr5ViW/psEcBPCGdruyJI8LjDRaqnGWRBJX
 vPyEVSFHtQZu49Vue3g4jbjhVr8OWTV/MsFgE/zK5Ahe+2G5969pID3n
 =iCgZ
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.2-20190607' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2019-06-07

this is a pull reqeust of 9 patches for net/master.

The first patch is by Alexander Dahl and removes a duplicate menu entry from
the Kconfig. The next patch by Joakim Zhang fixes the timeout in the flexcan
driver when setting small bit rates. Anssi Hannula's patch for the xilinx_can
driver fixes the bittiming_const for CAN FD core. The two patches by Sean
Nyekjaer bring mcp25625 to the existing mcp251x driver. The patch by Eugen
Hristev implements an errata for the m_can driver. YueHaibing's patch fixes the
error handling ing can_init(). The patch by Fabio Estevam for the flexcan
driver removes an unneeded registration message during flexcan_probe(). And the
last patch is by Willem de Bruijn and adds the missing purging the  socket
error queue on sock destruct.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:44:01 -07:00
Eric Dumazet
392096736a ipv6: tcp: fix potential NULL deref in tcp_v6_send_reset()
syzbot found a crash in tcp_v6_send_reset() caused by my latest
change.

Problem is that if an skb has been queued to socket prequeue,
skb_dst(skb)->dev can not anymore point to the device.

Fortunately in this case the socket pointer is not NULL.

A similar issue has been fixed in commit 0f85feae6b ("tcp: fix
more NULL deref after prequeue changes"), I should have known better.

Fixes: 323a53c412 ("ipv6: tcp: enable flowlabel reflection in some RST packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:42:23 -07:00
Sebastian Andrzej Siewior
6dcdd884e2 net: hwbm: Make the hwbm_pool lock a mutex
Based on review, `lock' is only acquired in hwbm_pool_add() which is
invoked via ->probe(), ->resume() and ->ndo_change_mtu(). Based on this
the lock can become a mutex and there is no need to disable interrupts
during the procedure.
Now that the lock is a mutex, hwbm_pool_add() no longer invokes
hwbm_pool_refill() in an atomic context so we can pass GFP_KERNEL to
hwbm_pool_refill() and remove the `gfp' argument from hwbm_pool_add().

Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:40:10 -07:00
Sebastian Andrzej Siewior
92dcabd7a0 net: Don't disable interrupts in __netdev_alloc_skb()
__netdev_alloc_skb() can be used from any context and is used by NAPI
and non-NAPI drivers. Non-NAPI drivers use it in interrupt context and
NAPI drivers use it during initial allocation (->ndo_open() or
->ndo_change_mtu()). Some NAPI drivers share the same function for the
initial allocation and the allocation in their NAPI callback.

The interrupts are disabled in order to ensure locked access from every
context to `netdev_alloc_cache'.

Let __netdev_alloc_skb() check if interrupts are disabled. If they are, use
`netdev_alloc_cache'. Otherwise disable BH and use `napi_alloc_cache.page'.
The IRQ check is cheaper compared to disabling & enabling interrupts and
memory allocation with disabled interrupts does not work on -RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:40:09 -07:00
Sebastian Andrzej Siewior
7ba7aeabba net: Don't disable interrupts in napi_alloc_frag()
netdev_alloc_frag() can be used from any context and is used by NAPI
and non-NAPI drivers. Non-NAPI drivers use it in interrupt context
and NAPI drivers use it during initial allocation (->ndo_open() or
->ndo_change_mtu()). Some NAPI drivers share the same function for the
initial allocation and the allocation in their NAPI callback.

The interrupts are disabled in order to ensure locked access from every
context to `netdev_alloc_cache'.

Let netdev_alloc_frag() check if interrupts are disabled. If they are,
use `netdev_alloc_cache' otherwise disable BH and invoke
__napi_alloc_frag() for the allocation. The IRQ check is cheaper
compared to disabling & enabling interrupts and memory allocation with
disabled interrupts does not work on -RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:40:09 -07:00
George Wilkie
2f3f7d1fa0 mpls: fix warning with multi-label encap
If you configure a route with multiple labels, e.g.
  ip route add 10.10.3.0/24 encap mpls 16/100 via 10.10.2.2 dev ens4
A warning is logged:
  kernel: [  130.561819] netlink: 'ip': attribute type 1 has an invalid
  length.

This happens because mpls_iptunnel_policy has set the type of
MPLS_IPTUNNEL_DST to fixed size NLA_U32.
Change it to a minimum size.
nla_get_labels() does the remaining validation.

Fixes: e3e4712ec0 ("mpls: ip tunnel support")
Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:26:34 -07:00
Eric Dumazet
65a3c497c0 ipv6: flowlabel: fl6_sock_lookup() must use atomic_inc_not_zero
Before taking a refcount, make sure the object is not already
scheduled for deletion.

Same fix is needed in ipv6_flowlabel_opt()

Fixes: 18367681a1 ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:07:14 -07:00
Enrico Weigelt
c3fee640bc net: ipv4: fib_semantics: fix uninitialized variable
fix an uninitialized variable:

  CC      net/ipv4/fib_semantics.o
net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw':
net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized]
   if (!tbl || err) {
            ^~

Signed-off-by: Enrico Weigelt <info@metux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 12:47:30 -07:00
Vladimir Oltean
f3097be21b net: dsa: sja1105: Add a state machine for RX timestamping
Meta frame reception relies on the hardware keeping its promise that it
will send no other traffic towards the CPU port between a link-local
frame and a meta frame.  Otherwise there is no other way to associate
the meta frame with the link-local frame it's holding a timestamp of.
The receive function is made stateful, and buffers a timestampable frame
until its meta frame arrives, then merges the two, drops the meta and
releases the link-local frame up the stack.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
e53e18a6fe net: dsa: sja1105: Receive and decode meta frames
This adds support in the tagger for understanding the source port and
switch id of meta frames.  Their timestamp is also extracted but not
used yet - this needs to be done in a state machine that modifies the
previously received timestampable frame - will be added in a follow-up
patch.

Also take the opportunity to:
- Remove a comment in sja1105_filter made obsolete by e8d67fa569
  ("net: dsa: sja1105: Don't store frame type in skb->cb")
- Reorder the checks in sja1105_filter to optimize for the most likely
  scenario first: regular traffic.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
79fa706139 net: dsa: sja1105: Make sja1105_is_link_local not match meta frames
Although meta frames are configured to be sent at SJA1105_META_DMAC
(01-80-C2-00-00-0E) which is a multicast MAC address that would also be
trapped by the switch to the CPU, were it to receive it on a front-panel
port, meta frames are conceptually not link-local frames, they only
carry their RX timestamps.

The choice of sending meta frames at a multicast DMAC is a pragmatic
one, to avoid installing an extra entry to the DSA master port's
multicast MAC filter.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
d3f9b90bf1 net: dsa: sja1105: Build a minimal understanding of meta frames
Meta frames are sent on the CPU port by the switch if RX timestamping is
enabled. They contain a partial timestamp of the previous frame.

They are Ethernet frames with the Ethernet header constructed out of:

- SJA1105_META_DMAC
- SJA1105_META_SMAC
- ETH_P_SJA1105_META

The Ethernet payload will be decoded in a follow-up patch.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
42824463d3 net: dsa: sja1105: Limit use of incl_srcpt to bridge+vlan mode
The incl_srcpt setting makes the switch mangle the destination MACs of
multicast frames trapped to the CPU - a primitive tagging mechanism that
works even when we cannot use the 802.1Q software features.

The downside is that the two multicast MAC addresses that the switch
traps for L2 PTP (01-80-C2-00-00-0E and 01-1B-19-00-00-00) quickly turn
into a lot more, as the switch encodes the source port and switch id
into bytes 3 and 4 of the MAC. The resulting range of MAC addresses
would need to be installed manually into the DSA master port's multicast
MAC filter, and even then, most devices might not have a large enough
MAC filtering table.

As a result, only limit use of incl_srcpt to when it's strictly
necessary: when under a VLAN filtering bridge.  This fixes PTP in
non-bridged mode (standalone ports). Otherwise, PTP frames, as well as
metadata follow-up frames holding RX timestamps won't be received
because they will be blocked by the master port's MAC filter.
Linuxptp doesn't help, because it only requests the addition of the
unmodified PTP MACs to the multicast filter.
This issue is not seen in bridged mode because the master port is put in
promiscuous mode when the slave ports are enslaved to a bridge.
Therefore, there is no downside to having the incl_srcpt mechanism
active there.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:40 -07:00
Vladimir Oltean
d461933638 net: dsa: tag_8021q: Create helper function for removing VLAN header
This removes the existing implementation from tag_sja1105, which was
partially incorrect (it was not changing the MAC header offset, thereby
leaving it to point 4 bytes earlier than it should have).

This overwrites the VLAN tag by moving the Ethernet source and
destination MACs 4 bytes to the right. Then skb->data (assumed to be
pointing immediately after the EtherType) is temporarily pushed to the
beginning of the new Ethernet header, the new Ethernet header offset and
length are recorded, then skb->data is moved back to where it was.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:39 -07:00
Vladimir Oltean
5e3f847a02 net: dsa: Add teardown callback for drivers
This is helpful for e.g. draining per-driver (not per-port) tagger
queues.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:39 -07:00
Vladimir Oltean
146d442c23 net: dsa: Keep a pointer to the skb clone for TX timestamping
For drivers that use deferred_xmit for PTP frames (such as sja1105),
there is no need to perform matching between PTP frames and their egress
timestamps, since the sending process can be serialized.

In that case, it makes sense to have the pointer to the skb clone that
DSA made directly in the skb->cb. It will be used for pushing the egress
timestamp back in the application socket's error queue.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:39 -07:00
Linus Torvalds
9331b6740f SPDX update for 5.2-rc4
Another round of SPDX header file fixes for 5.2-rc4
 
 These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
 added, based on the text in the files.  We are slowly chipping away at
 the 700+ different ways people tried to write the license text.  All of
 these were reviewed on the spdx mailing list by a number of different
 people.
 
 We now have over 60% of the kernel files covered with SPDX tags:
 	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
 	Files checked:            64533
 	Files with SPDX:          40392
 	Files with errors:            0
 
 I think the majority of the "easy" fixups are now done, it's now the
 start of the longer-tail of crazy variants to wade through.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPuGTg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykBvQCg2SG+HmDH+tlwKLT/q7jZcLMPQigAoMpt9Uuy
 sxVEiFZo8ZU9v1IoRb1I
 =qU++
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull yet more SPDX updates from Greg KH:
 "Another round of SPDX header file fixes for 5.2-rc4

  These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
  added, based on the text in the files. We are slowly chipping away at
  the 700+ different ways people tried to write the license text. All of
  these were reviewed on the spdx mailing list by a number of different
  people.

  We now have over 60% of the kernel files covered with SPDX tags:
	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
	Files checked:            64533
	Files with SPDX:          40392
	Files with errors:            0

  I think the majority of the "easy" fixups are now done, it's now the
  start of the longer-tail of crazy variants to wade through"

* tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429
  ...
2019-06-08 12:52:42 -07:00
David S. Miller
38e406f600 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-06-07

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix several bugs in riscv64 JIT code emission which forgot to clear high
   32-bits for alu32 ops, from Björn and Luke with selftests covering all
   relevant BPF alu ops from Björn and Jiong.

2) Two fixes for UDP BPF reuseport that avoid calling the program in case of
   __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption
   that skb->data is pointing to transport header, from Martin.

3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog
   workqueue, and a missing restore of sk_write_space when psock gets dropped,
   from Jakub and John.

4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it
   breaks standard applications like DNS if reverse NAT is not performed upon
   receive, from Daniel.

5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6
   fails to verify that the length of the tuple is long enough, from Lorenz.

6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for
   {un,}successful probe) as that is expected to be propagated as an fd to
   load_sk_storage_btf() and thus closing the wrong descriptor otherwise,
   from Michal.

7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir.

8) Minor misc fixes in docs, samples and selftests, from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 14:46:47 -07:00
Willem de Bruijn
fd704bd5ee can: purge socket error queue on sock destruct
CAN supports software tx timestamps as of the below commit. Purge
any queued timestamp packets on socket destroy.

Fixes: 51f31cabe3 ("ip: support for TX timestamps on UDP and RAW sockets")
Reported-by: syzbot+a90604060cb40f5bdd16@syzkaller.appspotmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07 23:03:54 +02:00
YueHaibing
c5a3aed1cd can: af_can: Fix error path of can_init()
This patch add error path for can_init() to avoid possible crash if some
error occurs.

Fixes: 0d66548a10 ("[CAN]: Add PF_CAN core module")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-06-07 23:03:54 +02:00
David S. Miller
a6cdeeb16b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 11:00:14 -07:00
Linus Torvalds
1e1d926369 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.

 2) Read SFP eeprom in max 16 byte increments to avoid problems with
    some SFP modules, from Russell King.

 3) Fix UDP socket lookup wrt. VRF, from Tim Beale.

 4) Handle route invalidation properly in s390 qeth driver, from Julian
    Wiedmann.

 5) Memory leak on unload in RDS, from Zhu Yanjun.

 6) sctp_process_init leak, from Neil HOrman.

 7) Fix fib_rules rule insertion semantic change that broke Android,
    from Hangbin Liu.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
  pktgen: do not sleep with the thread lock held.
  net: mvpp2: Use strscpy to handle stat strings
  net: rds: fix memory leak in rds_ib_flush_mr_pool
  ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
  ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
  Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
  net: aquantia: fix wol configuration not applied sometimes
  ethtool: fix potential userspace buffer overflow
  Fix memory leak in sctp_process_init
  net: rds: fix memory leak when unload rds_rdma
  ipv6: fix the check before getting the cookie in rt6_get_cookie
  ipv4: not do cache for local delivery if bc_forwarding is enabled
  s390/qeth: handle error when updating TX queue count
  s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
  s390/qeth: check dst entry before use
  s390/qeth: handle limited IPv4 broadcast in L3 TX path
  net: fix indirect calls helpers for ptype list hooks.
  net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
  udp: only choose unbound UDP socket for multicast when not in a VRF
  net/tls: replace the sleeping lock around RX resync with a bit lock
  ...
2019-06-07 09:29:14 -07:00
Guillaume Nault
8a3dca6325 netfilter: ipv6: nf_defrag: accept duplicate fragments again
When fixing the skb leak introduced by the conversion to rbtree, I
forgot about the special case of duplicate fragments. The condition
under the 'insert_error' label isn't effective anymore as
nf_ct_frg6_gather() doesn't override the returned value anymore. So
duplicate fragments now get NF_DROP verdict.

To accept duplicate fragments again, handle them specially as soon as
inet_frag_queue_insert() reports them. Return -EINPROGRESS which will
translate to NF_STOLEN verdict, like any accepted fragment. However,
such packets don't carry any new information and aren't queued, so we
just drop them immediately.

Fixes: a0d56cb911 ("netfilter: ipv6: nf_defrag: fix leakage of unqueued fragments")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-07 14:49:01 +02:00
Roman Gushchin
4ecabd55c9 bpf: allow CGROUP_SKB programs to use bpf_skb_cgroup_id() helper
Currently bpf_skb_cgroup_id() is not supported for CGROUP_SKB
programs. An attempt to load such a program generates an error
like this:

    libbpf:
    0: (b7) r6 = 0
    ...
    9: (85) call bpf_skb_cgroup_id#79
    unknown func bpf_skb_cgroup_id#79

There are no particular reasons for denying it, and we have some
use cases where it might be useful.

So let's add it to the list of allowed helpers.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-07 01:56:52 +02:00
Daniel Borkmann
983695fa67 bpf: fix unconnected udp hooks
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06 16:53:12 -07:00
Dirk van der Merwe
b9727d7f95 net/tls: export TLS per skb encryption
While offloading TLS connections, drivers need to handle the case where
out of order packets need to be transmitted.

Other drivers obtain the entire TLS record for the specific skb to
provide as context to hardware for encryption. However, other designs
may also want to keep the hardware state intact and perform the
out of order encryption entirely on the host.

To achieve this, export the already existing software encryption
fallback path so drivers could access this.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 14:13:40 -07:00
Linus Torvalds
459aa077a2 NFS client fixes for Linux 5.2
Stable bugfixes:
 - SUNRPC: Fix regression in umount of a secure mount
 - SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
 - NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
 - NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
 
 Other bugfixes:
 - xprtrdma: Use struct_size() in kzalloc()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlz4KzoACgkQ18tUv7Cl
 QOt7OhAAvG+DVZ6V5+q4zvabKgoievlL56Ys4SaAp3+OlxC6VaiyQUDs/6U9C/xH
 dmVbGYWdFXjqJE1JPXxmu0jOdRiZcnhIq+hiHNOK0qZOBCnE5zzZ1r1tdNY0GHQ2
 JOkREqsXsaeUWuO0pCY7JOmzd5aU1XLhg1/8+9Z7gNwamMfwLkEqi7FGtXi+xsGz
 gQVxMJlHsV2F21IKdKS0TJrcqr2okya/MnOQRbbMC2RT/MYNxDrhAPBJ1Shcx3HB
 NlccAn4jhIL0bCPRvFPib6KrO01U0Ye/KECN8j2qHRT4QS2s0dsnnQ6f2tEs9mJ8
 cRTVh1uniF6ZuDxSr6KIIN3mKA9DX2SK83H16ahAaRBLM8dwF+4MIr6gDdtJvsVw
 nY0YDpnAaKFypuCPBV/jFu7fk97hul4ntymJGVeFdlqu/HtWs1Z1iM93DDVJbKr8
 a3AND6woOQ2asvySPo+X66PKt79gofga4C+ZDuMfJax8+K9imqIyforgLrAmd/yL
 sGAlLzenf6fmOB5C1bPTtrFFbs6XiHXMidDGwmm1kOZIDuN+O2TTwc24gyXx0IyJ
 OhmjDn2CKmzS2WVPhetRgurzkdTigJu4PebC421qWSFhxlf/NfghW+rpM9su/hwv
 /r9+bpdjZ8YD5FUJvxsX4NZLr+SWbNTzX/ARNdRFsGr0NLpG/50=
 =rrp/
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "These are mostly stable bugfixes found during testing, many during the
  recent NFS bake-a-thon.

  Stable bugfixes:
   - SUNRPC: Fix regression in umount of a secure mount
   - SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
   - NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
   - NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled

  Other bugfixes:
   - xprtrdma: Use struct_size() in kzalloc()"

* tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
  NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
  SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
  SUNRPC fix regression in umount of a secure mount
  xprtrdma: Use struct_size() in kzalloc()
2019-06-06 12:19:37 -07:00
Paolo Abeni
720f1de402 pktgen: do not sleep with the thread lock held.
Currently, the process issuing a "start" command on the pktgen procfs
interface, acquires the pktgen thread lock and never release it, until
all pktgen threads are completed. The above can blocks indefinitely any
other pktgen command and any (even unrelated) netdevice removal - as
the pktgen netdev notifier acquires the same lock.

The issue is demonstrated by the following script, reported by Matteo:

ip -b - <<'EOF'
	link add type dummy
	link add type veth
	link set dummy0 up
EOF
modprobe pktgen
echo reset >/proc/net/pktgen/pgctrl
{
	echo rem_device_all
	echo add_device dummy0
} >/proc/net/pktgen/kpktgend_0
echo count 0 >/proc/net/pktgen/dummy0
echo start >/proc/net/pktgen/pgctrl &
sleep 1
rmmod veth

Fix the above releasing the thread lock around the sleep call.

Additionally we must prevent racing with forcefull rmmod - as the
thread lock no more protects from them. Instead, acquire a self-reference
before waiting for any thread. As a side effect, running

rmmod pktgen

while some thread is running now fails with "module in use" error,
before this patch such command hanged indefinitely.

Note: the issue predates the commit reported in the fixes tag, but
this fix can't be applied before the mentioned commit.

v1 -> v2:
 - no need to check for thread existence after flipping the lock,
   pktgen threads are freed only at net exit time
 -

Fixes: 6146e6a43b ("[PKTGEN]: Removes thread_{un,}lock() macros.")
Reported-and-tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 11:31:35 -07:00
Colin Ian King
4daa95af7f ipv6: fix spelling mistake: "wtih" -> "with"
There is a spelling mistake in a NL_SET_ERR_MSG message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 10:37:28 -07:00
Zhu Yanjun
85cb928787 net: rds: fix memory leak in rds_ib_flush_mr_pool
When the following tests last for several hours, the problem will occur.

Server:
    rds-stress -r 1.1.1.16 -D 1M
Client:
    rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30

The following will occur.

"
Starting up....
tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu
%
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
"
>From vmcore, we can find that clean_list is NULL.

>From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
Then rds_ib_mr_pool_flush_worker calls
"
 rds_ib_flush_mr_pool(pool, 0, NULL);
"
Then in function
"
int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
                         int free_all, struct rds_ib_mr **ibmr_ret)
"
ibmr_ret is NULL.

In the source code,
"
...
list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
if (ibmr_ret)
        *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);

/* more than one entry in llist nodes */
if (clean_nodes->next)
        llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
...
"
When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
instead of clean_nodes is added in clean_list.
So clean_nodes is discarded. It can not be used again.
The workqueue is executed periodically. So more and more clean_nodes are
discarded. Finally the clean_list is NULL.
Then this problem will occur.

Fixes: 1bc144b625 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 10:32:16 -07:00
Olivier Matz
b9aa52c4cb ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
The following code returns EFAULT (Bad address):

  s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
  setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1);
  sendto(ipv6_icmp6_packet, addr);   /* returns -1, errno = EFAULT */

The IPv4 equivalent code works. A workaround is to use IPPROTO_RAW
instead of IPPROTO_ICMPV6.

The failure happens because 2 bytes are eaten from the msghdr by
rawv6_probe_proto_opt() starting from commit 19e3c66b52 ("ipv6
equivalent of "ipv4: Avoid reading user iov twice after
raw_probe_proto_opt""), but at that time it was not a problem because
IPV6_HDRINCL was not yet introduced.

Only eat these 2 bytes if hdrincl == 0.

Fixes: 715f504b11 ("ipv6: add IPV6_HDRINCL option for raw sockets")
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 10:29:21 -07:00
Olivier Matz
59e3e4b526 ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
As it was done in commit 8f659a03a0 ("net: ipv4: fix for a race
condition in raw_sendmsg") and commit 20b50d7997 ("net: ipv4: emulate
READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()") for ipv4, copy the
value of inet->hdrincl in a local variable, to avoid introducing a race
condition in the next commit.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 10:29:21 -07:00
wenxu
16e6427c88 netfilter: ipv6: Fix undefined symbol nf_ct_frag6_gather
CONFIG_NETFILTER=m and CONFIG_NF_DEFRAG_IPV6 is not set

ERROR: "nf_ct_frag6_gather" [net/ipv6/ipv6.ko] undefined!

Fixes: c9bb6165a1 ("netfilter: nf_conntrack_bridge: fix CONFIG_IPV6=y")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-06 11:52:59 +02:00
Florian Westphal
4f518e802c xfrm: remove type and offload_type map from xfrm_state_afinfo
Only a handful of xfrm_types exist, no need to have 512 pointers for them.

Reduces size of afinfo struct from 4k to 120 bytes on 64bit platforms.

Also, the unregister function doesn't need to return an error, no single
caller does anything useful with it.

Just place a WARN_ON() where needed instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06 08:34:50 +02:00
Florian Westphal
4c203b0454 xfrm: remove eth_proto value from xfrm_state_afinfo
xfrm_prepare_input needs to lookup the state afinfo backend again to fetch
the address family ethernet protocol value.

There are only two address families, so a switch statement is simpler.
While at it, use u8 for family and proto and remove the owner member --
its not used anywhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06 08:34:50 +02:00
Florian Westphal
3aaf3915a3 xfrm: remove state and template sort indirections from xfrm_state_afinfo
No module dependency, placing this in xfrm_state.c avoids need for
an indirection.

This also removes the state spinlock -- I don't see why we would need
to hold it during sorting.

This in turn allows to remove the 'net' argument passed to
xfrm_tmpl_sort.  Last, remove the EXPORT_SYMBOL, there are no modular
callers.

For the CONFIG_IPV6=m case, vmlinux size increase is about 300 byte.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06 08:34:50 +02:00
Eric Dumazet
50a8accf10 ipv6: tcp: send consistent flowlabel in TIME_WAIT state
After commit 1d13a96c74 ("ipv6: tcp: fix flowlabel value in ACK
messages"), we stored in tw_flowlabel the flowlabel, in the
case ACK packets needed to be sent on behalf of a TIME_WAIT socket.

We can use the same field so that RST packets sent from
TIME_WAIT state also use a consistent flowlabel.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 19:05:01 -07:00
Eric Dumazet
323a53c412 ipv6: tcp: enable flowlabel reflection in some RST packets
When RST packets are sent because no socket could be found,
it makes sense to use flowlabel_reflect sysctl to decide
if a reflection of the flowlabel is requested.

This extends commit 22b6722bfa ("ipv6: Add sysctl for per
namespace flow label reflection"), for some TCP RST packets.

In order to provide full control of this new feature,
flowlabel_reflect becomes a bitmask.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 19:05:01 -07:00
Zhiqiang Liu
8b5e07d7ee inet_connection_sock: remove unused parameter of reqsk_queue_unlink func
small cleanup: "struct request_sock_queue *queue" parameter of reqsk_queue_unlink
func is never used in the func, so we can remove it.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 18:48:49 -07:00
Hangbin Liu
4970b42d5c Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
This reverts commit e9919a24d3.

Nathan reported the new behaviour breaks Android, as Android just add
new rules and delete old ones.

If we return 0 without adding dup rules, Android will remove the new
added rules and causing system to soft-reboot.

Fixes: e9919a24d3 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Yaro Slav <yaro330@gmail.com>
Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 17:54:46 -07:00
Vivien Didelot
0ee4e76937 ethtool: fix potential userspace buffer overflow
ethtool_get_regs() allocates a buffer of size ops->get_regs_len(),
and pass it to the kernel driver via ops->get_regs() for filling.

There is no restriction about what the kernel drivers can or cannot do
with the open ethtool_regs structure. They usually set regs->version
and ignore regs->len or set it to the same size as ops->get_regs_len().

But if userspace allocates a smaller buffer for the registers dump,
we would cause a userspace buffer overflow in the final copy_to_user()
call, which uses the regs.len value potentially reset by the driver.

To fix this, make this case obvious and store regs.len before calling
ops->get_regs(), to only copy as much data as requested by userspace,
up to the value returned by ops->get_regs_len().

While at it, remove the redundant check for non-null regbuf.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 17:15:27 -07:00
Neil Horman
0a8dd9f67c Fix memory leak in sctp_process_init
syzbot found the following leak in sctp_process_init
BUG: memory leak
unreferenced object 0xffff88810ef68400 (size 1024):
  comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s)
  hex dump (first 32 bytes):
    1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25  ..(........h...%
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000a02cebbd>] kmemleak_alloc_recursive include/linux/kmemleak.h:55
[inline]
    [<00000000a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<00000000a02cebbd>] slab_alloc mm/slab.c:3326 [inline]
    [<00000000a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline]
    [<00000000a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675
    [<000000009e6245e6>] kmemdup+0x27/0x60 mm/util.c:119
    [<00000000dfdc5d2d>] kmemdup include/linux/string.h:432 [inline]
    [<00000000dfdc5d2d>] sctp_process_init+0xa7e/0xc20
net/sctp/sm_make_chunk.c:2437
    [<00000000b58b62f8>] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682
[inline]
    [<00000000b58b62f8>] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384
[inline]
    [<00000000b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194
[inline]
    [<00000000b58b62f8>] sctp_do_sm+0xbdc/0x1d60 net/sctp/sm_sideeffect.c:1165
    [<0000000044e11f96>] sctp_assoc_bh_rcv+0x13c/0x200
net/sctp/associola.c:1074
    [<00000000ec43804d>] sctp_inq_push+0x7f/0xb0 net/sctp/inqueue.c:95
    [<00000000726aa954>] sctp_backlog_rcv+0x5e/0x2a0 net/sctp/input.c:354
    [<00000000d9e249a8>] sk_backlog_rcv include/net/sock.h:950 [inline]
    [<00000000d9e249a8>] __release_sock+0xab/0x110 net/core/sock.c:2418
    [<00000000acae44fa>] release_sock+0x37/0xd0 net/core/sock.c:2934
    [<00000000963cc9ae>] sctp_sendmsg+0x2c0/0x990 net/sctp/socket.c:2122
    [<00000000a7fc7565>] inet_sendmsg+0x64/0x120 net/ipv4/af_inet.c:802
    [<00000000b732cbd3>] sock_sendmsg_nosec net/socket.c:652 [inline]
    [<00000000b732cbd3>] sock_sendmsg+0x54/0x70 net/socket.c:671
    [<00000000274c57ab>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2292
    [<000000008252aedb>] __sys_sendmsg+0x80/0xf0 net/socket.c:2330
    [<00000000f7bf23d1>] __do_sys_sendmsg net/socket.c:2339 [inline]
    [<00000000f7bf23d1>] __se_sys_sendmsg net/socket.c:2337 [inline]
    [<00000000f7bf23d1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2337
    [<00000000a8b4131f>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:3

The problem was that the peer.cookie value points to an skb allocated
area on the first pass through this function, at which point it is
overwritten with a heap allocated value, but in certain cases, where a
COOKIE_ECHO chunk is included in the packet, a second pass through
sctp_process_init is made, where the cookie value is re-allocated,
leaking the first allocation.

Fix is to always allocate the cookie value, and free it when we are done
using it.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 17:11:47 -07:00
Zhu Yanjun
b50e058746 net: rds: fix memory leak when unload rds_rdma
When KASAN is enabled, after several rds connections are
created, then "rmmod rds_rdma" is run. The following will
appear.

"
BUG rds_ib_incoming (Not tainted): Objects remaining
in rds_ib_incoming on __kmem_cache_shutdown()

Call Trace:
 dump_stack+0x71/0xab
 slab_err+0xad/0xd0
 __kmem_cache_shutdown+0x17d/0x370
 shutdown_cache+0x17/0x130
 kmem_cache_destroy+0x1df/0x210
 rds_ib_recv_exit+0x11/0x20 [rds_rdma]
 rds_ib_exit+0x7a/0x90 [rds_rdma]
 __x64_sys_delete_module+0x224/0x2c0
 ? __ia32_sys_delete_module+0x2c0/0x2c0
 do_syscall_64+0x73/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
"
This is rds connection memory leak. The root cause is:
When "rmmod rds_rdma" is run, rds_ib_remove_one will call
rds_ib_dev_shutdown to drop the rds connections.
rds_ib_dev_shutdown will call rds_conn_drop to drop rds
connections as below.
"
rds_conn_path_drop(&conn->c_path[0], false);
"
In the above, destroy is set to false.
void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
{
        atomic_set(&cp->cp_state, RDS_CONN_ERROR);

        rcu_read_lock();
        if (!destroy && rds_destroy_pending(cp->cp_conn)) {
                rcu_read_unlock();
                return;
        }
        queue_work(rds_wq, &cp->cp_down_w);
        rcu_read_unlock();
}
In the above function, destroy is set to false. rds_destroy_pending
is called. This does not move rds connections to ib_nodev_conns.
So destroy is set to true to move rds connections to ib_nodev_conns.
In rds_ib_unregister_client, flush_workqueue is called to make rds_wq
finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns
is called to shutdown rds connections finally.
Then rds_ib_recv_exit is called to destroy slab.

void rds_ib_recv_exit(void)
{
        kmem_cache_destroy(rds_ib_incoming_slab);
        kmem_cache_destroy(rds_ib_frag_slab);
}
The above slab memory leak will not occur again.

>From tests,
256 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma

real    0m16.522s
user    0m0.000s
sys     0m8.152s
512 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma

real    0m32.054s
user    0m0.000s
sys     0m15.568s

To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed.
And with 512 rds connections, about 32 seconds are needed.
>From ftrace, when one rds connection is destroyed,

"
 19)               |  rds_conn_destroy [rds]() {
 19)   7.782 us    |    rds_conn_path_drop [rds]();
 15)               |  rds_shutdown_worker [rds]() {
 15)               |    rds_conn_shutdown [rds]() {
 15)   1.651 us    |      rds_send_path_reset [rds]();
 15)   7.195 us    |    }
 15) + 11.434 us   |  }
 19)   2.285 us    |    rds_cong_remove_conn [rds]();
 19) * 24062.76 us |  }
"
So if many rds connections will be destroyed, this function
rds_ib_destroy_nodev_conns uses most of time.

Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 17:08:14 -07:00
Zhu Yanjun
fe3475af3b net: rds: add per rds connection cache statistics
The variable cache_allocs is to indicate how many frags (KiB) are in one
rds connection frag cache.
The command "rds-info -Iv" will output the rds connection cache
statistics as below:
"
RDS IB Connections:
      LocalAddr RemoteAddr Tos SL  LocalDev            RemoteDev
      1.1.1.14 1.1.1.14   58 255  fe80::2:c903🅰️7a31 fe80::2:c903🅰️7a31
      send_wr=256, recv_wr=1024, send_sge=8, rdma_mr_max=4096,
      rdma_mr_size=257, cache_allocs=12
"
This means that there are about 12KiB frag in this rds connection frag
cache.
Since rds.h in rds-tools is not related with the kernel rds.h, the change
in kernel rds.h does not affect rds-tools.
rds-info in rds-tools 2.0.5 and 2.0.6 is tested with this commit. It works
well.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 17:07:06 -07:00
Xin Long
0a90478b93 ipv4: not do cache for local delivery if bc_forwarding is enabled
With the topo:

    h1 ---| rp1            |
          |     route  rp3 |--- h3 (192.168.200.1)
    h2 ---| rp2            |

If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
h2, and the packets can still be forwared.

This issue was caused by the input route cache. It should only do
the cache for either bc forwarding or local delivery. Otherwise,
local delivery can use the route cache for bc forwarding of other
interfaces.

This patch is to fix it by not doing cache for local delivery if
all.bc_forwarding is enabled.

Note that we don't fix it by checking route cache local flag after
rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
common route code shouldn't be touched for bc_forwarding.

Fixes: 5cbf777cfd ("route: add support for directed broadcast forwarding")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 16:59:21 -07:00
Enrico Weigelt
26f8113cc7 net: ipv6: drop unneeded likely() call around IS_ERR()
IS_ERR() already calls unlikely(), so this extra unlikely() call
around IS_ERR() is not needed.

Signed-off-by: Enrico Weigelt <info@metux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 16:57:23 -07:00
Enrico Weigelt
88e235b80c net: ipv4: drop unneeded likely() call around IS_ERR()
IS_ERR() already calls unlikely(), so this extra unlikely() call
around IS_ERR() is not needed.

Signed-off-by: Enrico Weigelt <info@metux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 16:57:23 -07:00
Enrico Weigelt
b90f5aa4d6 net: openvswitch: drop unneeded likely() call around IS_ERR()
IS_ERR() already calls unlikely(), so this extra likely() call
around the !IS_ERR() is not needed.

Signed-off-by: Enrico Weigelt <info@metux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 16:57:23 -07:00
Enrico Weigelt
4546e44ca2 net: socket: drop unneeded likely() call around IS_ERR()
IS_ERR() already calls unlikely(), so this extra likely() call
around the !IS_ERR() is not needed.

Signed-off-by: Enrico Weigelt <info@metux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 16:57:23 -07:00
Thomas Gleixner
b886d83c5b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:17 +02:00
Thomas Gleixner
28c61a66ab treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432
Based on 1 normalized pattern(s):

  this file is gplv2 as found in copying

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.657082701@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:16 +02:00
Thomas Gleixner
75a6faf617 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 422
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 101 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190113.822954939@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:15 +02:00
Thomas Gleixner
46fe777164 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 399
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not see http www gnu org
  licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 33 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081038.745679586@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:12 +02:00
Thomas Gleixner
e46bd7099d treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 389
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license v2 as published
  by the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 2 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081037.837563564@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:11 +02:00
Thomas Gleixner
a10e763b87 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 372
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 135 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081036.435762997@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:10 +02:00
Thomas Gleixner
ddc64d0ac9 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 363
Based on 1 normalized pattern(s):

  released under terms in gpl version 2 see copying

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081035.689962394@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:09 +02:00
Thomas Gleixner
b7c320ff08 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 343
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not see http www gnu org
  licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000437.338011816@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:07 +02:00
Thomas Gleixner
2b27bdcc20 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 336
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin st fifth floor boston ma 02110
  1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 246 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.674189849@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:07 +02:00
Thomas Gleixner
a61127c213 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 335
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin st fifth floor boston ma 02110
  1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 111 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.567572064@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:06 +02:00
Thomas Gleixner
685a6bf848 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 321
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 and no later version this
  program is distributed in the hope that it will be useful but
  without any warranty without even the implied warranty of
  merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 33 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000435.345978407@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:05 +02:00
Thomas Gleixner
5b497af42f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 64 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:38 +02:00
Thomas Gleixner
2025cf9e19 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 263 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:37 +02:00
Thomas Gleixner
97fb5e8d9b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 284
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 and
  only version 2 as published by the free software foundation this
  program is distributed in the hope that it will be useful but
  without any warranty without even the implied warranty of
  merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 294 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141900.825281744@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:37 +02:00
Thomas Gleixner
0e9facca1b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 281
Based on 1 normalized pattern(s):

  this program can be redistributed or modified under the terms of the
  gnu general public license version 2 as published by the free
  software foundation this program is distributed without any warranty
  or implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license version 2 for more
  details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141900.551133917@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:36 +02:00
Thomas Gleixner
c942299924 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 269
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin street fifth floor boston ma
  02110 1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 21 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141334.228102212@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:30:29 +02:00
Thomas Gleixner
04672fe6d6 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 268
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin street fifth floor boston ma
  02110 1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 46 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141334.135501091@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:30:29 +02:00
Florian Westphal
e46817472a xfrm: remove init_flags indirection from xfrm_state_afinfo
There is only one implementation of this function; just call it directly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05 13:16:30 +02:00
Florian Westphal
5c1b9ab3ec xfrm: remove init_temprop indirection from xfrm_state_afinfo
same as previous patch: just place this in the caller, no need to
have an indirection for a structure initialization.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05 13:16:30 +02:00
Florian Westphal
bac9593515 xfrm: remove init_tempsel indirection from xfrm_state_afinfo
Simple initialization, handle it in the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05 13:16:30 +02:00
Paolo Abeni
fdf71426e7 net: fix indirect calls helpers for ptype list hooks.
As Eric noted, the current wrapper for ptype func hook inside
__netif_receive_skb_list_ptype() has no chance of avoiding the indirect
call: we enter such code path only for protocols other than ipv4 and
ipv6.

Instead we can wrap the list_func invocation.

v1 -> v2:
 - use the correct fix tag

Fixes: f5737cbadb ("net: use indirect calls helpers for ptype hook")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 20:16:22 -07:00
David Ahern
f88d8ea67f ipv6: Plumb support for nexthop object in a fib6_info
Add struct nexthop and nh_list list_head to fib6_info. nh_list is the
fib6_info side of the nexthop <-> fib_info relationship. Since a fib6_info
referencing a nexthop object can not have 'sibling' entries (the old way
of doing multipath routes), the nh_list is a union with fib6_siblings.

Add f6i_list list_head to 'struct nexthop' to track fib6_info entries
using a nexthop instance. Update __remove_nexthop_fib to walk f6_list
and delete fib entries using the nexthop.

Add a few nexthop helpers for use when a nexthop is added to fib6_info:
- nexthop_fib6_nh - return first fib6_nh in a nexthop object
- fib6_info_nh_dev moved to nexthop.h and updated to use nexthop_fib6_nh
  if the fib6_info references a nexthop object
- nexthop_path_fib6_result - similar to ipv4, select a path within a
  multipath nexthop object. If the nexthop is a blackhole, set
  fib6_result type to RTN_BLACKHOLE, and set the REJECT flag

Update the fib6_info references to check for nh and take a different path
as needed:
- rt6_qualify_for_ecmp - if a fib entry uses a nexthop object it can NOT
  be coalesced with other fib entries into a multipath route
- rt6_duplicate_nexthop - use nexthop_cmp if either fib6_info references
  a nexthop
- addrconf (host routes), RA's and info entries (anything configured via
  ndisc) does not use nexthop objects
- fib6_info_destroy_rcu - put reference to nexthop object
- fib6_purge_rt - drop fib6_info from f6i_list
- fib6_select_path - update to use the new nexthop_path_fib6_result when
  fib entry uses a nexthop object
- rt6_device_match - update to catch use of nexthop object as a blackhole
  and set fib6_type and flags.
- ip6_route_info_create - don't add space for fib6_nh if fib entry is
  going to reference a nexthop object, take a reference to nexthop object,
  disallow use of source routing
- rt6_nlmsg_size - add space for RTA_NH_ID
- add rt6_fill_node_nexthop to add nexthop data on a dump

As with ipv4, most of the changes push existing code into the else branch
of whether the fib entry uses a nexthop object.

Update the nexthop code to walk f6i_list on a nexthop deleted to remove
fib entries referencing it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:50 -07:00
David Ahern
4c7e8084fd ipv4: Plumb support for nexthop object in a fib_info
Add 'struct nexthop' and nh_list list_head to fib_info. nh_list is the
fib_info side of the nexthop <-> fib_info relationship.

Add fi_list list_head to 'struct nexthop' to track fib_info entries
using a nexthop instance. Add __remove_nexthop_fib and add it to
__remove_nexthop to walk the new list_head and mark those fib entries
as dead when the nexthop is deleted.

Add a few nexthop helpers for use when a nexthop is added to fib_info:
- nexthop_cmp to determine if 2 nexthops are the same
- nexthop_path_fib_result to select a path for a multipath
  'struct nexthop'
- nexthop_fib_nhc to select a specific fib_nh_common within a
  multipath 'struct nexthop'

Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses
a 'struct nexthop', and mark fib_info_nh as only used for the non-nexthop
case.

Update the fib_info functions to check for fi->nh and take a different
path as needed:
- free_fib_info_rcu - put the nexthop object reference
- fib_release_info - remove the fib_info from the nexthop's fi_list
- nh_comp - use nexthop_cmp when either fib_info references a nexthop
  object
- fib_info_hashfn - use the nexthop id for the hashing vs the oif of
  each fib_nh in a fib_info
- fib_nlmsg_size - add space for the RTA_NH_ID attribute
- fib_create_info - verify nexthop reference can be taken, verify
  nexthop spec is valid for fib entry, and add fib_info to fi_list for
  a nexthop
- fib_select_multipath - use the new nexthop_path_fib_result to select a
  path when nexthop objects are used
- fib_table_lookup - if the 'struct nexthop' is a blackhole nexthop, treat
  it the same as a fib entry using 'blackhole'

The bulk of the changes are in fib_semantics.c and most of that is
moving the existing change_nexthops into an else branch.

Update the nexthop code to walk fi_list on a nexthop deleted to remove
fib entries referencing it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:49 -07:00
David Ahern
dcb1ecb50e ipv4: Prepare for fib6_nh from a nexthop object
Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes
to use a fib6_nh based nexthop. In the end, only code not using a
nexthop object in a fib_info should directly access fib_nh in a fib_info
without checking the famiy and going through fib_nh_common. Those
functions will be marked when it is not directly evident.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:49 -07:00
David Ahern
5481d73f81 ipv4: Use accessors for fib_info nexthop data
Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the
fib_dev macro which is an alias for the first nexthop. Replacements:

  fi->fib_dev    --> fib_info_nh(fi, 0)->fib_nh_dev
  fi->fib_nh     --> fib_info_nh(fi, 0)
  fi->fib_nh[i]  --> fib_info_nh(fi, i)
  fi->fib_nhs    --> fib_info_num_path(fi)

where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path
returns fi->fib_nhs.

Move the existing fib_info_nhc to nexthop.h and define the new ones
there. A later patch adds a check if a fib_info uses a nexthop object,
and defining the helpers in nexthop.h avoid circular header
dependencies.

After this all remaining open coded references to fi->fib_nhs and
fi->fib_nh are in:
- fib_create_info and helpers used to lookup an existing fib_info
  entry, and
- the netdev event functions fib_sync_down_dev and fib_sync_up.

The latter two will not be reused for nexthops, and the fib_create_info
will be updated to handle a nexthop in a fib_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:49 -07:00
Tim Beale
82ba25c6de udp: only choose unbound UDP socket for multicast when not in a VRF
By default, packets received in another VRF should not be passed to an
unbound socket in the default VRF. This patch updates the IPv4 UDP
multicast logic to match the unicast VRF logic (in compute_score()),
as well as the IPv6 mcast logic (in __udp_v6_is_mcast_sock()).

The particular case I noticed was DHCP discover packets going
to the 255.255.255.255 address, which are handled by
__udp4_lib_mcast_deliver(). The previous code meant that running
multiple different DHCP server or relay agent instances across VRFs
did not work correctly - any server/relay agent in the default VRF
received DHCP discover packets for all other VRFs.

Fixes: 6da5b0f027 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
Signed-off-by: Tim Beale <timbeale@catalyst.net.nz>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 18:34:03 -07:00
David Ahern
7dd73168e2 ipv6: Always allocate pcpu memory in a fib6_nh
A recent commit had an unintended side effect with reject routes:
rt6i_pcpu is expected to always be initialized for all fib6_info except
the null entry. The commit mentioned below skips it for reject routes
and ends up leaking references to the loopback device. For example,

    ip netns add foo
    ip -netns foo li set lo up
    ip -netns foo -6 ro add blackhole 2001:db8:1::1
    ip netns exec foo ping6 2001:db8:1::1
    ip netns del foo

ends up spewing:
    unregister_netdevice: waiting for lo to become free. Usage count = 3

The fib_nh_common_init is not needed for reject routes (no ipv4 caching
or encaps), so move the alloc_percpu_gfp after it and adjust the goto label.

Fixes: f40b6ae2b6 ("ipv6: Move pcpu cached routes to fib6_nh")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:54:59 -07:00
Ariel Levkovich
8b6912a501 net: vlan: Inherit MPLS features from parent device
During the creation of the VLAN interface net device,
the various device features and offloads are being set based
on the parent device's features.
The code initiates the basic, vlan and encapsulation features
but doesn't address the MPLS features set and they remain blank.
As a result, all device offloads that have significant performance
effect are disabled for MPLS traffic going via this VLAN device such
as checksumming and TSO.

This patch makes sure that MPLS features are also set for the
VLAN device based on the parent which will allow HW offloads of
checksumming and TSO to be performed on MPLS tagged packets.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:49:38 -07:00
Jakub Kicinski
fb0f886fa2 net/tls: don't pass version to tls_advance_record_sn()
All callers pass prot->version as the last parameter
of tls_advance_record_sn(), yet tls_advance_record_sn()
itself needs a pointer to prot.  Pass prot from callers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski
9cd81988cc net/tls: use version from prot
ctx->prot holds the same information as per-direction contexts.
Almost all code gets TLS version from this structure, convert
the last two stragglers, this way we can improve the cache
utilization by moving the per-direction data into cold cache lines.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski
1fe275d434 net/tls: don't re-check msg decrypted status in tls_device_decrypted()
tls_device_decrypted() is only called from decrypt_skb_update(),
when ctx->decrypted == false, there is no need to re-check the bit.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski
b9d8fec927 net/tls: don't look for decrypted frames on non-offloaded sockets
If the RX config of a TLS socket is SW, there is no point iterating
over the fragments and checking if frame is decrypted.  It will
always be fully encrypted.  Note that in fully encrypted case
the function doesn't actually touch any offload-related state,
so it's safe to call for TLS_SW, today.  Soon we will introduce
code which can only be called for offloaded contexts.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski
87b11e0638 net/tls: remove false positive warning
It's possible that TCP stack will decide to retransmit a packet
right when that packet's data gets acked, especially in presence
of packet reordering.  This means that packets may be in flight,
even though tls_device code has already freed their record state.
Make fill_sg_in() and in turn tls_sw_fallback() not generate a
warning in that case, and quietly proceed to drop such frames.

Make the exit path from tls_sw_fallback() drop monitor friendly,
for users to be able to troubleshoot dropped retransmissions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski
aeb11ff0dc net/tls: check return values from skb_copy_bits() and skb_store_bits()
In light of recent bugs, we should make a better effort of
checking return values.  In theory none of the functions should
fail today.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski
da29e4b466 net/tls: fully initialize the msg wrapper skb
If strparser gets cornered into starting a new message from
an sk_buff which already has frags, it will allocate a new
skb to become the "wrapper" around the fragments of the
message.

This new skb does not inherit any metadata fields.  In case
of TLS offload this may lead to unnecessarily re-encrypting
the message, as skb->decrypted is not set for the wrapper skb.

Try to be conservative and copy all fields of old skb
strparser's user may reasonably need.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Florian Westphal
d3e6e285ff net: ipv4: fix rcu lockdep splat due to wrong annotation
syzbot triggered following splat when strict netlink
validation is enabled:

net/ipv4/devinet.c:1766 suspicious rcu_dereference_check() usage!

This occurs because we hold RTNL mutex, but no rcu read lock.
The second call site holds both, so just switch to the _rtnl variant.

Reported-by: syzbot+bad6e32808a3a97b1515@syzkaller.appspotmail.com
Fixes: 2638eb8b50 ("net: ipv4: provide __rcu annotation for ifa_list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:24:10 -07:00
Jiri Pirko
191ed2024d devlink: allow driver to update progress of flash update
Introduce a function to be called from drivers during flash. It sends
notification to userspace about flash update progress.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:21:40 -07:00
Jakub Kicinski
e52972c11d net/tls: replace the sleeping lock around RX resync with a bit lock
Commit 38030d7cb7 ("net/tls: avoid NULL-deref on resync during device removal")
tried to fix a potential NULL-dereference by taking the
context rwsem.  Unfortunately the RX resync may get called
from soft IRQ, so we can't use the rwsem to protect from
the device disappearing.  Because we are guaranteed there
can be only one resync at a time (it's called from strparser)
use a bit to indicate resync is busy and make device
removal wait for the bit to get cleared.

Note that there is a leftover "flags" field in struct
tls_context already.

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 13:34:37 -07:00
Jakub Kicinski
27393f8c6e Revert "net/tls: avoid NULL-deref on resync during device removal"
This reverts commit 38030d7cb7.
Unfortunately the RX resync may get called from soft IRQ,
so we can't take the rwsem to protect from the device
disappearing.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 13:34:37 -07:00
Guillaume Nault
a0d56cb911 netfilter: ipv6: nf_defrag: fix leakage of unqueued fragments
With commit 997dd96471 ("net: IP6 defrag: use rbtrees in
nf_conntrack_reasm.c"), nf_ct_frag6_reasm() is now called from
nf_ct_frag6_queue(). With this change, nf_ct_frag6_queue() can fail
after the skb has been added to the fragment queue and
nf_ct_frag6_gather() was adapted to handle this case.

But nf_ct_frag6_queue() can still fail before the fragment has been
queued. nf_ct_frag6_gather() can't handle this case anymore, because it
has no way to know if nf_ct_frag6_queue() queued the fragment before
failing. If it didn't, the skb is lost as the error code is overwritten
with -EINPROGRESS.

Fix this by setting -EINPROGRESS directly in nf_ct_frag6_queue(), so
that nf_ct_frag6_gather() can propagate the error as is.

Fixes: 997dd96471 ("net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-04 15:25:51 +02:00
Eric Dumazet
b703414675 net: fix use-after-free in kfree_skb_list
syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
 dst_output include/net/dst.h:433 [inline]
 ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
 rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
 slab_post_alloc_hook mm/slab.h:437 [inline]
 slab_alloc_node mm/slab.c:3269 [inline]
 kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
 alloc_skb include/linux/skbuff.h:1058 [inline]
 __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kmem_cache_free+0x86/0x260 mm/slab.c:3698
 kfree_skbmem net/core/skbuff.c:625 [inline]
 kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
 __kfree_skb net/core/skbuff.c:682 [inline]
 kfree_skb net/core/skbuff.c:699 [inline]
 kfree_skb+0xf0/0x390 net/core/skbuff.c:693
 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
 __dev_xmit_skb net/core/dev.c:3551 [inline]
 __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
 dst_output include/net/dst.h:433 [inline]
 ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
 rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
 224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                           ^
 ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 15:18:00 -07:00
Eric Dumazet
5472c3c6a5 tcp: use this_cpu_read(*X) instead of *this_cpu_ptr(X)
this_cpu_read(*X) is slightly faster than *this_cpu_ptr(X)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 15:09:34 -07:00
Eric Dumazet
046386ca0c ipv4: icmp: use this_cpu_read() in icmp_sk()
this_cpu_read(*X) is faster than *this_cpu_ptr(X)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 15:08:53 -07:00
Eric Dumazet
c353071ad0 ipv6: use this_cpu_read() in rt6_get_pcpu_route()
this_cpu_read(*X) is faster than *this_cpu_ptr(X)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 15:06:36 -07:00
Eric Dumazet
2789c14d19 ipv6: icmp: use this_cpu_read() in icmpv6_sk()
In general, this_cpu_read(*X) is faster than *this_cpu_ptr(X)

Also remove the inline attibute, totally useless.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 14:57:41 -07:00
Stanislav Fomichev
1cc26450a8 flow_dissector: remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
This flag is not used by any caller, remove it.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 14:56:35 -07:00
Martin KaFai Lau
257a525fe2 bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro
When the commit a6024562ff ("udp: Add GRO functions to UDP socket")
added udp[46]_lib_lookup_skb to the udp_gro code path, it broke
the reuseport_select_sock() assumption that skb->data is pointing
to the transport header.

This patch follows an earlier __udp6_lib_err() fix by
passing a NULL skb to avoid calling the reuseport's bpf_prog.

Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-03 13:38:48 -07:00
Martin KaFai Lau
4ac30c4b36 bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err
__udp6_lib_err() may be called when handling icmpv6 message. For example,
the icmpv6 toobig(type=2).  __udp6_lib_lookup() is then called
which may call reuseport_select_sock().  reuseport_select_sock() will
call into a bpf_prog (if there is one).

reuseport_select_sock() is expecting the skb->data pointing to the
transport header (udphdr in this case).  For example, run_bpf_filter()
is pulling the transport header.

However, in the __udp6_lib_err() path, the skb->data is pointing to the
ipv6hdr instead of the udphdr.

One option is to pull and push the ipv6hdr in __udp6_lib_err().
Instead of doing this, this patch follows how the original
commit 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
was done in IPv4, which has passed a NULL skb pointer to
reuseport_select_sock().

Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Cc: Craig Gallek <kraig@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-03 13:38:48 -07:00
Heiner Kallweit
db4bad0737 net: ethernet: improve eth_platform_get_mac_address
pci_device_to_OF_node(to_pci_dev(dev)) is the same as dev->of_node,
so we can simplify the code. In addition add an empty line before
the return statement.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02 18:11:57 -07:00
Willem de Bruijn
afa0925c6f packet: unconditionally free po->rollover
Rollover used to use a complex RCU mechanism for assignment, which had
a race condition. The below patch fixed the bug and greatly simplified
the logic.

The feature depends on fanout, but the state is private to the socket.
Fanout_release returns f only when the last member leaves and the
fanout struct is to be freed.

Destroy rollover unconditionally, regardless of fanout state.

Fixes: 57f015f5ec ("packet: fix crash in fanout_demux_rollover()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Diagnosed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02 18:10:14 -07:00
Florian Westphal
2638eb8b50 net: ipv4: provide __rcu annotation for ifa_list
ifa_list is protected by rcu, yet code doesn't reflect this.

Add the __rcu annotations and fix up all places that are now reported by
sparse.

I've done this in the same commit to not add intermediate patches that
result in new warnings.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02 18:08:36 -07:00
Florian Westphal
cd5a411dba net: use new in_dev_ifa iterators
Use in_dev_for_each_ifa_rcu/rtnl instead.
This prevents sparse warnings once proper __rcu annotations are added.

Signed-off-by: Florian Westphal <fw@strlen.de>

t di# Last commands done (6 commands done):

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02 18:06:26 -07:00
Florian Westphal
b8d1957236 netfilter: use in_dev_for_each_ifa_rcu
Netfilter hooks are always running under rcu read lock, use
the new iterator macro so sparse won't complain once we add
proper __rcu annotations.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02 18:06:26 -07:00
Florian Westphal
d519e8708b devinet: use in_dev_for_each_ifa_rcu in more places
This also replaces spots that used for_primary_ifa().

for_primary_ifa() aborts the loop on the first secondary address seen.

Replace it with either the rcu or rtnl variant of in_dev_for_each_ifa(),
but two places will now also consider secondary addresses too:
inet_addr_onlink() and inet_ifa_byprefix().

I do not understand why they should ignore secondary addresses.

Why would a secondary address not be considered 'on link'?
When matching a prefix, why ignore a matching secondary address?

Other places get converted as well, but gain "->flags & SECONDARY" check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02 18:06:26 -07:00
Florian Westphal
ef11db3310 net: inetdevice: provide replacement iterators for in_ifaddr walk
The ifa_list is protected either by rcu or rtnl lock, but the
current iterators do not account for this.

This adds two iterators as replacement, a later patch in
the series will update them with the needed rcu/rtnl_dereference calls.

Its not done in this patch yet to avoid sparse warnings -- the fields
lack the proper __rcu annotation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02 18:06:26 -07:00
Sven Eckelmann
9e6b5648bb batman-adv: Fix duplicated OGMs on NETDEV_UP
The state of slave interfaces are handled differently depending on whether
the interface is up or not. All active interfaces (IFF_UP) will transmit
OGMs. But for B.A.T.M.A.N. IV, also non-active interfaces are scheduling
(low TTL) OGMs on active interfaces. The code which setups and schedules
the OGMs must therefore already be called when the interfaces gets added as
slave interface and the transmit function must then check whether it has to
send out the OGM or not on the specific slave interface.

But the commit f0d97253fb ("batman-adv: remove ogm_emit and ogm_schedule
API calls") moved the setup code from the enable function to the activate
function. The latter is called either when the added slave was already up
when batadv_hardif_enable_interface processed the new interface or when a
NETDEV_UP event was received for this slave interfac. As result, each
NETDEV_UP would schedule a new OGM worker for the interface and thus OGMs
would be send a lot more than expected.

Fixes: f0d97253fb ("batman-adv: remove ogm_emit and ogm_schedule API calls")
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Tested-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-06-02 13:33:48 +02:00
David S. Miller
c1e9e01d42 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset container Netfilter/IPVS update for net-next:

1) Add UDP tunnel support for ICMP errors in IPVS.

Julian Anastasov says:

This patchset is a followup to the commit that adds UDP/GUE tunnel:
"ipvs: allow tunneling with gue encapsulation".

What we do is to put tunnel real servers in hash table (patch 1),
add function to lookup tunnels (patch 2) and use it to strip the
embedded tunnel headers from ICMP errors (patch 3).

2) Extend xt_owner to match for supplementary groups, from
   Lukasz Pawelczyk.

3) Remove unused oif field in flow_offload_tuple object, from
   Taehee Yoo.

4) Release basechain counters from workqueue to skip synchronize_rcu()
   call. From Florian Westphal.

5) Replace skb_make_writable() by skb_ensure_writable(). Patchset
   from Florian Westphal.

6) Checksum support for gue encapsulation in IPVS, from Jacky Hu.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-01 16:21:19 -07:00
David S. Miller
0462eaacee Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-05-31

The following pull-request contains BPF updates for your *net-next* tree.

Lots of exciting new features in the first PR of this developement cycle!
The main changes are:

1) misc verifier improvements, from Alexei.

2) bpftool can now convert btf to valid C, from Andrii.

3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
   This feature greatly improves BPF speed on 32-bit architectures. From Jiong.

4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
   bpf programs got stuck in dying cgroups. From Roman.

5) new bpf_send_signal() helper, from Yonghong.

6) cgroup inet skb programs can signal CN to the stack, from Lawrence.

7) miscellaneous cleanups, from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 21:21:18 -07:00
Roman Gushchin
c85d69135a bpf: move memory size checks to bpf_map_charge_init()
Most bpf map types doing similar checks and bytes to pages
conversion during memory allocation and charging.

Let's unify these checks by moving them into bpf_map_charge_init().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
Roman Gushchin
b936ca643a bpf: rework memlock-based memory accounting for maps
In order to unify the existing memlock charging code with the
memcg-based memory accounting, which will be added later, let's
rework the current scheme.

Currently the following design is used:
  1) .alloc() callback optionally checks if the allocation will likely
     succeed using bpf_map_precharge_memlock()
  2) .alloc() performs actual allocations
  3) .alloc() callback calculates map cost and sets map.memory.pages
  4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
     and performs actual charging; in case of failure the map is
     destroyed
  <map is in use>
  1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
     performs uncharge and releases the user
  2) .map_free() callback releases the memory

The scheme can be simplified and made more robust:
  1) .alloc() calculates map cost and calls bpf_map_charge_init()
  2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
  3) .alloc() performs actual allocations
  <map is in use>
  1) .map_free() callback releases the memory
  2) bpf_map_charge_finish() performs uncharge and releases the user

The new scheme also allows to reuse bpf_map_charge_init()/finish()
functions for memcg-based accounting. Because charges are performed
before actual allocations and uncharges after freeing the memory,
no bogus memory pressure can be created.

In cases when the map structure is not available (e.g. it's not
created yet, or is already destroyed), on-stack bpf_map_memory
structure is used. The charge can be transferred with the
bpf_map_charge_move() function.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
Roman Gushchin
3539b96e04 bpf: group memory related fields in struct bpf_map_memory
Group "user" and "pages" fields of bpf_map into the bpf_map_memory
structure. Later it can be extended with "memcg" and other related
information.

The main reason for a such change (beside cosmetics) is to pass
bpf_map_memory structure to charging functions before the actual
allocation of bpf_map.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
Roman Gushchin
d50836cda6 bpf: add memlock precharge for socket local storage
Socket local storage maps lack the memlock precharge check,
which is performed before the memory allocation for
most other bpf map types.

Let's add it in order to unify all map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
brakmo
956fe21908 bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications from the BPF programs.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:41:29 -07:00
Colin Ian King
6f43e52528 nexthop: remove redundant assignment to err
The variable err is initialized with a value that is never read
and err is reassigned a few statements later. This initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 14:33:52 -07:00
Jens Axboe
87e5e6dab6 uio: make import_iovec()/compat_import_iovec() return bytes on success
Currently these functions return < 0 on error, and 0 for success.
Change that so that we return < 0 on error, but number of bytes
for success.

Some callers already treat the return value that way, others need a
slight tweak.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-31 15:30:03 -06:00
Vladimir Oltean
e8d67fa569 net: dsa: sja1105: Don't store frame type in skb->cb
Due to a confusion I thought that eth_type_trans() was called by the
network stack whereas it can actually be called by network drivers to
figure out the skb protocol and next packet_type handlers.

In light of the above, it is not safe to store the frame type from the
DSA tagger's .filter callback (first entry point on RX path), since GRO
is yet to be invoked on the received traffic.  Hence it is very likely
that the skb->cb will actually get overwritten between eth_type_trans()
and the actual DSA packet_type handler.

Of course, what this patch fixes is the actual overwriting of the
SJA1105_SKB_CB(skb)->type field from the GRO layer, which made all
frames be seen as SJA1105_FRAME_TYPE_NORMAL (0).

Fixes: 227d07a07e ("net: dsa: sja1105: Add support for traffic through standalone ports")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 14:27:27 -07:00
David S. Miller
b4b12b0d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The phylink conflict was between a bug fix by Russell King
to make sure we have a consistent PHY interface mode, and
a change in net-next to pull some code in phylink_resolve()
into the helper functions phylink_mac_link_{up,down}()

On the dp83867 side it's mostly overlapping changes, with
the 'net' side removing a condition that was supposed to
trigger for RGMII but because of how it was coded never
actually could trigger.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 10:49:43 -07:00
Pablo Neira Ayuso
c9bb6165a1 netfilter: nf_conntrack_bridge: fix CONFIG_IPV6=y
This patch fixes a few problems with CONFIG_IPV6=y and
CONFIG_NF_CONNTRACK_BRIDGE=m:

In file included from net/netfilter/utils.c:5:
include/linux/netfilter_ipv6.h: In function 'nf_ipv6_br_defrag':
include/linux/netfilter_ipv6.h:110:9: error: implicit declaration of function 'nf_ct_frag6_gather'; did you mean 'nf_ct_attach'? [-Werror=implicit-function-declaration]

And these too:

net/ipv6/netfilter.c:242:2: error: unknown field 'br_defrag' specified in initializer
net/ipv6/netfilter.c:243:2: error: unknown field 'br_fragment' specified in initializer

This patch includes an original chunk from wenxu.

Fixes: 764dd163ac ("netfilter: nf_conntrack_bridge: add support for IPv6")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Yuehaibing <yuehaibing@huawei.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 09:45:26 -07:00
Jacky Hu
29930e314d ipvs: add checksum support for gue encapsulation
Add checksum support for gue encapsulation with the tun_flags parameter,
which could be one of the values below:
IP_VS_TUNNEL_ENCAP_FLAG_NOCSUM
IP_VS_TUNNEL_ENCAP_FLAG_CSUM
IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM

Signed-off-by: Jacky Hu <hengqing.hu@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:23:52 +02:00
Pablo Neira Ayuso
5142967ab5 netfilter: nf_tables: fix module autoload with inet family
Use MODULE_ALIAS_NFT_EXPR() to make happy the inet family with nat.

Fixes: 63ce3940f3 ("netfilter: nft_redir: add inet support")
Fixes: 071657d2c3 ("netfilter: nft_masq: add inet support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:20:49 +02:00
Florian Westphal
2cf6bffc49 netfilter: replace skb_make_writable with skb_ensure_writable
This converts all remaining users and then removes skb_make_writable.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:48 +02:00
Florian Westphal
fb2eb1c131 netfilter: tcpmss, optstrip: prefer skb_ensure_writable
This also changes optstrip to only make the tcp header writeable
rather than the entire packet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:48 +02:00
Florian Westphal
8e03707f11 netfilter: xt_HL: prefer skb_ensure_writable
Also, make the argument to be only the needed size of the header
we're altering, no need to pull in the full packet into linear area.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:47 +02:00
Florian Westphal
7418ee4c88 netfilter: nf_tables: prefer skb_ensure_writable
.. so skb_make_writable can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:46 +02:00
Florian Westphal
3862c6a91a netfilter: ipv4: prefer skb_ensure_writable
.. so skb_make_writable can be removed soon.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:46 +02:00
Florian Westphal
86f0453854 netfilter: conntrack, nat: prefer skb_ensure_writable
like previous patches -- convert conntrack to use the core helper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:45 +02:00
Florian Westphal
ec0974df35 netfilter: ipvs: prefer skb_ensure_writable
It does the same thing, use it instead so we can remove skb_make_writable.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:44 +02:00
Florian Westphal
c1a8311679 netfilter: bridge: convert skb_make_writable to skb_ensure_writable
Back in the day, skb_ensure_writable did not exist.  By now, both functions
have the same precondition:

I. skb_make_writable will test in this order:
  1. wlen > skb->len -> error
  2. if not cloned and wlen <= headlen -> OK
  3. If cloned and wlen bytes of clone writeable -> OK

After those checks, skb is either not cloned but needs to pull from
nonlinear area, or writing to head would also alter data of another clone.

In both cases skb_make_writable will then call __pskb_pull_tail, which will
kmalloc a new memory area to use for skb->head.

IOW, after successful skb_make_writable call, the requested length is in
linear area and can be modified, even if skb was cloned.

II. skb_ensure_writable will do this instead:
   1. call pskb_may_pull.  This handles case 1 above.
      After this, wlen is in linear area, but skb might be cloned.
   2. return if skb is not cloned
   3. return if wlen byte of clone are writeable.
   4. fully copy the skb.

So post-conditions are the same:
*len bytes are writeable in linear area without altering any payload data
of a clone, all header pointers might have been changed.

Only differences are that skb_ensure_writable is in the core, whereas
skb_make_writable lives in netfilter core and the inverted return value.
skb_make_writable returns 0 on error, whereas skb_ensure_writable returns
negative value.

For the normal cases performance is similar:
A. skb is not cloned and in linear area:
   pskb_may_pull is inline helper, so neither function copies.
B. skb is cloned, write is in linear area and clone is writeable:
   both funcions return with step 3.

This series removes skb_make_writable from the kernel.

While at it, pass the needed value instead, its less confusing that way:
There is no special-handling of "0-length" argument in either
skb_make_writable or skb_ensure_writable.

bridge already makes sure ethernet header is in linear area, only purpose
of the make_writable() is is to copy skb->head in case of cloned skbs.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:43 +02:00
Florian Westphal
53315ac660 netfilter: nf_tables: free base chain counters from worker
No need to use synchronize_rcu() here, just swap the two pointers
and have the release occur from work queue after commit has completed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:43 +02:00
Taehee Yoo
5e2ad02e90 netfilter: nf_flow_table: remove unnecessary variable in flow_offload_tuple
The oifidx in the struct flow_offload_tuple is not used anymore.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:42 +02:00
Lukasz Pawelczyk
ea6cc2fd8a netfilter: xt_owner: Add supplementary groups option
The XT_OWNER_SUPPL_GROUPS flag causes GIDs specified with XT_OWNER_GID
to be also checked in the supplementary groups of a process.

f_cred->group_info cannot be modified during its lifetime and f_cred
holds a reference to it so it's safe to use.

Signed-off-by: Lukasz Pawelczyk <l.pawelczyk@samsung.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:41 +02:00
Julian Anastasov
508f744c0d ipvs: strip udp tunnel headers from icmp errors
Recognize UDP tunnels in received ICMP errors and
properly strip the tunnel headers. GUE is what we
have for now.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 17:48:16 +02:00
Julian Anastasov
2aa3c9f48b ipvs: add function to find tunnels
Add ip_vs_find_tunnel() to match tunnel headers
by family, address and optional port. Use it to
properly find the tunnel real server used in
received ICMP errors.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 17:48:09 +02:00
Julian Anastasov
1da40ab6ca ipvs: allow rs_table to contain different real server types
Before now rs_table was used only for NAT real servers.
Change it to allow TUN real severs from different types,
possibly hashed with different port key.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 17:47:57 +02:00
Linus Torvalds
2f4c533499 SPDX update for 5.2-rc3, round 1
Here is another set of reviewed patches that adds SPDX tags to different
 kernel files, based on a set of rules that are being used to parse the
 comments to try to determine that the license of the file is
 "GPL-2.0-or-later" or "GPL-2.0-only".  Only the "obvious" versions of
 these matches are included here, a number of "non-obvious" variants of
 text have been found but those have been postponed for later review and
 analysis.
 
 There is also a patch in here to add the proper SPDX header to a bunch
 of Kbuild files that we have missed in the past due to new files being
 added and forgetting that Kbuild uses two different file names for
 Makefiles.  This issue was reported by the Kbuild maintainer.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPCHLg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykxyACgql6ktH+Tv8Ho1747kKPiFca1Jq0AoK5HORXI
 yB0DSTXYNjMtH41ypnsZ
 =x2f8
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull yet more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later" or "GPL-2.0-only". Only the "obvious" versions of
  these matches are included here, a number of "non-obvious" variants of
  text have been found but those have been postponed for later review
  and analysis.

  There is also a patch in here to add the proper SPDX header to a bunch
  of Kbuild files that we have missed in the past due to new files being
  added and forgetting that Kbuild uses two different file names for
  Makefiles. This issue was reported by the Kbuild maintainer.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (82 commits)
  treewide: Add SPDX license identifier - Kbuild
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 225
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 224
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 223
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 222
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 221
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 220
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 218
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 217
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 216
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 214
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 213
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 211
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 210
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 209
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 207
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 203
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201
  ...
2019-05-31 08:34:32 -07:00
Linus Torvalds
036e343109 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix OOPS during nf_tables rule dump, from Florian Westphal.

 2) Use after free in ip_vs_in, from Yue Haibing.

 3) Fix various kTLS bugs (NULL deref during device removal resync,
    netdev notification ignoring, etc.) From Jakub Kicinski.

 4) Fix ipv6 redirects with VRF, from David Ahern.

 5) Memory leak fix in igmpv3_del_delrec(), from Eric Dumazet.

 6) Missing memory allocation failure check in ip6_ra_control(), from
    Gen Zhang. And likewise fix ip_ra_control().

 7) TX clean budget logic error in aquantia, from Igor Russkikh.

 8) SKB leak in llc_build_and_send_ui_pkt(), from Eric Dumazet.

 9) Double frees in mlx5, from Parav Pandit.

10) Fix lost MAC address in r8169 during PCI D3, from Heiner Kallweit.

11) Fix botched register access in mvpp2, from Antoine Tenart.

12) Use after free in napi_gro_frags(), from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (89 commits)
  net: correct zerocopy refcnt with udp MSG_MORE
  ethtool: Check for vlan etype or vlan tci when parsing flow_rule
  net: don't clear sock->sk early to avoid trouble in strparser
  net-gro: fix use-after-free read in napi_gro_frags()
  net: dsa: tag_8021q: Create a stable binary format
  net: dsa: tag_8021q: Change order of rx_vid setup
  net: mvpp2: fix bad MVPP2_TXQ_SCHED_TOKEN_CNTR_REG queue value
  ipv4: tcp_input: fix stack out of bounds when parsing TCP options.
  mlxsw: spectrum: Prevent force of 56G
  mlxsw: spectrum_acl: Avoid warning after identical rules insertion
  net: dsa: mv88e6xxx: fix handling of upper half of STATS_TYPE_PORT
  r8169: fix MAC address being lost in PCI D3
  net: core: support XDP generic on stacked devices.
  netvsc: unshare skb in VF rx handler
  udp: Avoid post-GRO UDP checksum recalculation
  net: phy: dp83867: Set up RGMII TX delay
  net: phy: dp83867: do not call config_init twice
  net: phy: dp83867: increase SGMII autoneg timer duration
  net: phy: dp83867: fix speed 10 in sgmii mode
  net: phy: marvell10g: report if the PHY fails to boot firmware
  ...
2019-05-30 21:11:22 -07:00
Willem de Bruijn
100f6d8e09 net: correct zerocopy refcnt with udp MSG_MORE
TCP zerocopy takes a uarg reference for every skb, plus one for the
tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
as it builds, sends and frees skbs inside its inner loop.

UDP and RAW zerocopy do not send inside the inner loop so do not need
the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
extra_uref to pass the initial reference taken in sock_zerocopy_alloc
to the first generated skb.

But, sock_zerocopy_realloc takes this extra reference at the start of
every call. With MSG_MORE, no new skb may be generated to attach the
extra_uref to, so refcnt is incorrectly 2 with only one skb.

Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
Update extra_uref accordingly.

This conditional assignment triggers a false positive may be used
uninitialized warning, so have to initialize extra_uref at define.

Changes v1->v2: fix typo in Fixes SHA1

Fixes: 52900d2228 ("udp: elide zerocopy operation in hot path")
Reported-by: syzbot <syzkaller@googlegroups.com>
Diagnosed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 15:54:04 -07:00
Kevin 'ldir' Darbyshire-Bryant
84a32edec4 net: sched: act_ctinfo: minor size optimisation
Since the new parameter block is initialised to 0 by kzmalloc we don't
need to mask & clear unused operational mode bits, they are already
unset.

Drop the pointless code.

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 15:10:22 -07:00
Maxime Chevallier
b73484b2fc ethtool: Check for vlan etype or vlan tci when parsing flow_rule
When parsing an ethtool flow spec to build a flow_rule, the code checks
if both the vlan etype and the vlan tci are specified by the user to add
a FLOW_DISSECTOR_KEY_VLAN match.

However, when the user only specified a vlan etype or a vlan tci, this
check silently ignores these parameters.

For example, the following rule :

ethtool -N eth0 flow-type udp4 vlan 0x0010 action -1 loc 0

will result in no error being issued, but the equivalent rule will be
created and passed to the NIC driver :

ethtool -N eth0 flow-type udp4 action -1 loc 0

In the end, neither the NIC driver using the rule nor the end user have
a way to know that these keys were dropped along the way, or that
incorrect parameters were entered.

This kind of check should be left to either the driver, or the ethtool
flow spec layer.

This commit makes so that ethtool parameters are forwarded as-is to the
NIC driver.

Since none of the users of ethtool_rx_flow_rule_create are using the
VLAN dissector, I don't think this qualifies as a regression.

Fixes: eca4205f9e ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Pablo Neira Ayuso <pablo@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 15:04:55 -07:00
Ioana Ciornei
e70c7aad7a net: dsa: Add error path handling in dsa_tree_setup()
In case a call to dsa_tree_setup() fails, an attempt to cleanup is made
by calling dsa_tree_remove_switch(), which should take care of
removing/unregistering any resources previously allocated. This does not
happen because it is conditioned by dst->setup being true, which is set
only after _all_ setup steps were performed successfully.

This is especially interesting when the internal MDIO bus is registered
but afterwards, a port setup fails and the mdiobus_unregister() is never
called. This leads to a BUG_ON() complaining about the fact that it's
trying to free an MDIO bus that's still registered.

Add proper error handling in all functions branching from
dsa_tree_setup().

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:58:31 -07:00
Jakub Kicinski
2b81f8161d net: don't clear sock->sk early to avoid trouble in strparser
af_inet sets sock->sk to NULL which trips strparser over:

BUG: kernel NULL pointer dereference, address: 0000000000000012
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.2.0-rc1-00139-g14629453a6d3 #21
RIP: 0010:tcp_peek_len+0x10/0x60
RSP: 0018:ffffc02e41c54b98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9cf924c4e030 RCX: 0000000000000051
RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff9cf97128f480
RBP: ffff9cf9365e0300 R08: ffff9cf94fe7d2c0 R09: 0000000000000000
R10: 000000000000036b R11: ffff9cf939735e00 R12: ffff9cf91ad9ae40
R13: ffff9cf924c4e000 R14: ffff9cf9a8fcbaae R15: 0000000000000020
FS: 0000000000000000(0000) GS:ffff9cf9af7c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000012 CR3: 000000013920a003 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
 <IRQ>
 strp_data_ready+0x48/0x90
 tls_data_ready+0x22/0xd0 [tls]
 tcp_rcv_established+0x569/0x620
 tcp_v4_do_rcv+0x127/0x1e0
 tcp_v4_rcv+0xad7/0xbf0
 ip_protocol_deliver_rcu+0x2c/0x1c0
 ip_local_deliver_finish+0x41/0x50
 ip_local_deliver+0x6b/0xe0
 ? ip_protocol_deliver_rcu+0x1c0/0x1c0
 ip_rcv+0x52/0xd0
 ? ip_rcv_finish_core.isra.20+0x380/0x380
 __netif_receive_skb_one_core+0x7e/0x90
 netif_receive_skb_internal+0x42/0xf0
 napi_gro_receive+0xed/0x150
 nfp_net_poll+0x7a2/0xd30 [nfp]
 ? kmem_cache_free_bulk+0x286/0x310
 net_rx_action+0x149/0x3b0
 __do_softirq+0xe3/0x30a
 ? handle_irq_event_percpu+0x6a/0x80
 irq_exit+0xe8/0xf0
 do_IRQ+0x85/0xd0
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0010:cpuidle_enter_state+0xbc/0x450

To avoid this issue set sock->sk after sk_prot->close.
My grepping and testing did not discover any code which
would depend on the current behaviour.

Fixes: c46234ebb4 ("tls: RX path for ktls")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:54:17 -07:00
Eric Dumazet
a4270d6795 net-gro: fix use-after-free read in napi_gro_frags()
If a network driver provides to napi_gro_frags() an
skb with a page fragment of exactly 14 bytes, the call
to gro_pull_from_frag0() will 'consume' the fragment
by calling skb_frag_unref(skb, 0), and the page might
be freed and reused.

Reading eth->h_proto at the end of napi_frags_skb() might
read mangled data, or crash under specific debugging features.

BUG: KASAN: use-after-free in napi_frags_skb net/core/dev.c:5833 [inline]
BUG: KASAN: use-after-free in napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
Read of size 2 at addr ffff88809366840c by task syz-executor599/8957

CPU: 1 PID: 8957 Comm: syz-executor599 Not tainted 5.2.0-rc1+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/generic_report.c:142
 napi_frags_skb net/core/dev.c:5833 [inline]
 napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
 tun_get_user+0x2f3c/0x3ff0 drivers/net/tun.c:1991
 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
 call_write_iter include/linux/fs.h:1872 [inline]
 do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
 do_iter_write fs/read_write.c:970 [inline]
 do_iter_write+0x184/0x610 fs/read_write.c:951
 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
 do_writev+0x15b/0x330 fs/read_write.c:1058

Fixes: a50e233c50 ("net-gro: restore frag0 optimization")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:52:14 -07:00
Vladimir Oltean
0471dd429c net: dsa: tag_8021q: Create a stable binary format
Tools like tcpdump need to be able to decode the significance of fake
VLAN headers that DSA uses to separate switch ports.

But currently these have no global significance - they are simply an
ordered list of DSA_MAX_SWITCHES x DSA_MAX_PORTS numbers ending at 4095.

The reason why this is submitted as a fix is that the existing mapping
of VIDs should not enter into a stable kernel, so we can pretend that
only the new format exists. This way tcpdump won't need to try to make
something out of the VLAN tags on 5.2 kernels.

Fixes: f9bbe4477c ("net: dsa: Optional VLAN-based port separation for switches without tagging")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:47:14 -07:00
Ioana Ciornei
d34d2baa91 net: dsa: tag_8021q: Change order of rx_vid setup
The 802.1Q tagging performs an unbalanced setup in terms of RX VIDs on
the CPU port. For the ingress path of a 802.1Q switch to work, the RX
VID of a port needs to be seen as tagged egress on the CPU port.

While configuring the other front-panel ports to be part of this VID,
for bridge scenarios, the untagged flag is applied even on the CPU port
in dsa_switch_vlan_add.  This happens because DSA applies the same flags
on the CPU port as on the (bridge-controlled) slave ports, and the
effect in this case is that the CPU port tagged settings get deleted.

Instead of fixing DSA by introducing a way to control VLAN flags on the
CPU port (and hence stop inheriting from the slave ports) - a hard,
perhaps intractable problem - avoid this situation by moving the setup
part of the RX VID on the CPU port after all the other front-panel ports
have been added to the VID.

Fixes: f9bbe4477c ("net: dsa: Optional VLAN-based port separation for switches without tagging")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:47:14 -07:00
Matteo Croce
c3e933a5b8 sctp: deduplicate identical skb_checksum_ops
The same skb_checksum_ops struct is defined twice in two different places,
leading to code duplication. Declare it as a global variable into a common
header instead of allocating it on the stack on each function call.
bloat-o-meter reports a slight code shrink.

add/remove: 1/1 grow/shrink: 0/10 up/down: 128/-1282 (-1154)
Function                                     old     new   delta
sctp_csum_ops                                  -     128    +128
crc32c_csum_ops                               16       -     -16
sctp_rcv                                    6616    6583     -33
sctp_packet_pack                            4542    4504     -38
nf_conntrack_sctp_packet                    4980    4926     -54
execute_masked_set_action                   6453    6389     -64
tcf_csum_sctp                                575     428    -147
sctp_gso_segment                            1292    1126    -166
sctp_csum_check                              579     412    -167
sctp_snat_handler                            957     772    -185
sctp_dnat_handler                           1321    1132    -189
l4proto_manip_pkt                           2536    2313    -223
Total: Before=359297613, After=359296459, chg -0.00%

Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:35:44 -07:00
Matteo Croce
2544af0344 net: avoid indirect calls in L4 checksum calculation
Commit 283c16a2df ("indirect call wrappers: helpers to speed-up
indirect calls of builtin") introduces some macros to avoid doing
indirect calls.

Use these helpers to remove two indirect calls in the L4 checksum
calculation for devices which don't have hardware support for it.

As a test I generate packets with pktgen out to a dummy interface
with HW checksumming disabled, to have the checksum calculated in
every sent packet.
The packet rate measured with an i7-6700K CPU and a single pktgen
thread raised from 6143 to 6608 Kpps, an increase by 7.5%

Suggested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:34:12 -07:00
Pablo Neira Ayuso
af9573be67 netfilter: nf_conntrack_bridge: register inet conntrack for bridge
This patch enables IPv4 and IPv6 conntrack from the bridge to deal with
local traffic. Hence, packets that are passed up to the local input path
are confirmed later on from the {ipv4,ipv6}_confirm() hooks.

For packets leaving the IP stack (ie. output path), fragmentation occurs
after the inet postrouting hook. Therefore, the bridge local out and
postrouting bridge hooks see fragments with conntrack objects, which is
inconsistent. In this case, we could defragment again from the bridge
output hook, but this is expensive. The recommended filtering spot for
outgoing locally generated traffic leaving through the bridge interface
is to use the classic IPv4/IPv6 output hook, which comes earlier.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:18 -07:00
Pablo Neira Ayuso
764dd163ac netfilter: nf_conntrack_bridge: add support for IPv6
br_defrag() and br_fragment() indirections are added in case that IPv6
support comes as a module, to avoid pulling innecessary dependencies in.

The new fraglist iterator and fragment transformer APIs are used to
implement the refragmentation code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:18 -07:00
Pablo Neira Ayuso
3c171f496e netfilter: bridge: add connection tracking system
This patch adds basic connection tracking support for the bridge,
including initial IPv4 support.

This patch register two hooks to deal with the bridge forwarding path,
one from the bridge prerouting hook to call nf_conntrack_in(); and
another from the bridge postrouting hook to confirm the entry.

The conntrack bridge prerouting hook defragments packets before passing
them to nf_conntrack_in() to look up for an existing entry, otherwise a
new entry is allocated and it is attached to the skbuff. The conntrack
bridge postrouting hook confirms new conntrack entries, ie. if this is
the first packet seen, then it adds the entry to the hashtable and (if
needed) it refragments the skbuff into the original fragments, leaving
the geometry as is if possible. Exceptions are linearized skbuffs, eg.
skbuffs that are passed up to nfqueue and conntrack helpers, as well as
cloned skbuff for the local delivery (eg. tcpdump), also in case of
bridge port flooding (cloned skbuff too).

The packet defragmentation is done through the ip_defrag() call.  This
forces us to save the bridge control buffer, reset the IP control buffer
area and then restore it after call. This function also bumps the IP
fragmentation statistics, it would be probably desiderable to have
independent statistics for the bridge defragmentation/refragmentation.
The maximum fragment length is stored in the control buffer and it is
used to refragment the skbuff from the postrouting path.

The new fraglist splitter and fragment transformer APIs are used to
implement the bridge refragmentation code. The br_ip_fragment() function
drops the packet in case the maximum fragment size seen is larger than
the output port MTU.

This patchset follows the principle that conntrack should not drop
packets, so users can do it through policy via invalid state matching.

Like br_netfilter, there is no refragmentation for packets that are
passed up for local delivery, ie. prerouting -> input path. There are
calls to nf_reset() already in several spots in the stack since time ago
already, eg. af_packet, that show that skbuff fraglist handling from the
netif_rx path is supported already.

The helpers are called from the postrouting hook, before confirmation,
from there we may see packet floods to bridge ports. Then, although
unlikely, this may result in exercising the helpers many times for each
clone. It would be good to explore how to pass all the packets in a list
to the conntrack hook to do this handle only once for this case.

Thanks to Florian Westphal for handing me over an initial patchset
version to add support for conntrack bridge.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:18 -07:00
Pablo Neira Ayuso
d035f19f59 netfilter: nf_conntrack: allow to register bridge support
This patch adds infrastructure to register and to unregister bridge
support for the conntrack module via nf_ct_bridge_register() and
nf_ct_bridge_unregister().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:18 -07:00
Pablo Neira Ayuso
19c3401a91 net: ipv4: place control buffer handling away from fragmentation iterators
Deal with the IPCB() area away from the iterators.

The bridge codebase has its own control buffer layout, move specific
IP control buffer handling into the IPv4 codepath.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:18 -07:00
Pablo Neira Ayuso
8a6a1f1764 net: ipv6: split skbuff into fragments transformer
This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:

* ip6_frag_init(), that initializes the internal state of the transformer.
* ip6_frag_next(), that allows you to fetch the next fragment. This function
  internally allocates the skbuff that represents the fragment, it pushes
  the IPv6 header, and it also copies the payload for each fragment.

The ip6_frag_state object stores the internal state of the splitter.

This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Pablo Neira Ayuso
065ff79f88 net: ipv4: split skbuff into fragments transformer
This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:

* ip_frag_init(), that initializes the internal state of the transformer.
* ip_frag_next(), that allows you to fetch the next fragment. This function
  internally allocates the skbuff that represents the fragment, it pushes
  the IPv4 header, and it also copies the payload for each fragment.

The ip_frag_state object stores the internal state of the splitter.

This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Pablo Neira Ayuso
0feca6190f net: ipv6: add skbuff fraglist splitter
This patch adds the skbuff fraglist split iterator. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:

* ip6_fraglist_init(), that initializes the internal state of the
  fraglist iterator.
* ip6_fraglist_prepare(), that restores the IPv6 header on the fragment.
* ip6_fraglist_next(), that retrieves the fragment from the fraglist and
  updates the internal state of the iterator to point to the next
  fragment in the fraglist.

The ip6_fraglist_iter object stores the internal state of the iterator.

This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Pablo Neira Ayuso
c8b17be0b7 net: ipv4: add skbuff fraglist splitter
This patch adds the skbuff fraglist splitter. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:

* ip_fraglist_init(), that initializes the internal state of the
  fraglist splitter.
* ip_fraglist_prepare(), that restores the IPv4 header on the
  fragments.
* ip_fraglist_next(), that retrieves the fragment from the fraglist and
  it updates the internal state of the splitter to point to the next
  fragment skbuff in the fraglist.

The ip_fraglist_iter object stores the internal state of the iterator.

This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Jason Baron
aa1236cdfa tcp: add support for optional TFO backup key to net.ipv4.tcp_fastopen_key
Add the ability to add a backup TFO key as:

# echo "x-x-x-x,x-x-x-x" > /proc/sys/net/ipv4/tcp_fastopen_key

The key before the comma acks as the primary TFO key and the key after the
comma is the backup TFO key. This change is intended to be backwards
compatible since if only one key is set, userspace will simply read back
that single key as follows:

# echo "x-x-x-x" > /proc/sys/net/ipv4/tcp_fastopen_key
# cat /proc/sys/net/ipv4/tcp_fastopen_key
x-x-x-x

Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 13:41:26 -07:00
Jason Baron
0f1ce02368 tcp: add support to TCP_FASTOPEN_KEY for optional backup key
Add support for get/set of an optional backup key via TCP_FASTOPEN_KEY, in
addition to the current 'primary' key. The primary key is used to encrypt
and decrypt TFO cookies, while the backup is only used to decrypt TFO
cookies. The backup key is used to maximize successful TFO connections when
TFO keys are rotated.

Currently, TCP_FASTOPEN_KEY allows a single 16-byte primary key to be set.
This patch now allows a 32-byte value to be set, where the first 16 bytes
are used as the primary key and the second 16 bytes are used for the backup
key. Similarly, for getsockopt(), we can receive a 32-byte value as output
if requested. If a 16-byte value is used to set the primary key via
TCP_FASTOPEN_KEY, then any previously set backup key will be removed.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 13:41:26 -07:00
Jason Baron
9092a76d3c tcp: add backup TFO key infrastructure
We would like to be able to rotate TFO keys while minimizing the number of
client cookies that are rejected. Currently, we have only one key which can
be used to generate and validate cookies, thus if we simply replace this
key clients can easily have cookies rejected upon rotation.

We propose having the ability to have both a primary key and a backup key.
The primary key is used to generate as well as to validate cookies.
The backup is only used to validate cookies. Thus, keys can be rotated as:

1) generate new key
2) add new key as the backup key
3) swap the primary and backup key, thus setting the new key as the primary

We don't simply set the new key as the primary key and move the old key to
the backup slot because the ip may be behind a load balancer and we further
allow for the fact that all machines behind the load balancer will not be
updated simultaneously.

We make use of this infrastructure in subsequent patches.

Suggested-by: Igor Lubashev <ilubashe@akamai.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 13:41:26 -07:00
Christoph Paasch
483642e5ea tcp: introduce __tcp_fastopen_cookie_gen_cipher()
Restructure __tcp_fastopen_cookie_gen() to take a 'struct crypto_cipher'
argument and rename it as __tcp_fastopen_cookie_gen_cipher(). Subsequent
patches will provide different ciphers based on which key is being used for
the cookie generation.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 13:41:26 -07:00
Young Xiao
9609dad263 ipv4: tcp_input: fix stack out of bounds when parsing TCP options.
The TCP option parsing routines in tcp_parse_options function could
read one byte out of the buffer of the TCP options.

1         while (length > 0) {
2                 int opcode = *ptr++;
3                 int opsize;
4
5                 switch (opcode) {
6                 case TCPOPT_EOL:
7                         return;
8                 case TCPOPT_NOP:        /* Ref: RFC 793 section 3.1 */
9                         length--;
10                        continue;
11                default:
12                        opsize = *ptr++; //out of bound access

If length = 1, then there is an access in line2.
And another access is occurred in line 12.
This would lead to out-of-bound access.

Therefore, in the patch we check that the available data length is
larger enough to pase both TCP option code and size.

Signed-off-by: Young Xiao <92siuyang@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 12:32:47 -07:00
Trond Myklebust
7987b694ad SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
The addition of rpc_check_timeout() to call_decode causes an Oops
when the RPCSEC_GSS credential is rejected.
The reason is that rpc_decode_header() will call xprt_release() in
order to free task->tk_rqstp, which is needed by rpc_check_timeout()
to check whether or not we should exit due to a soft timeout.

The fix is to move the call to xprt_release() into call_decode() so
we can perform it after rpc_check_timeout().

Reported-by: Olga Kornievskaia <olga.kornievskaia@gmail.com>
Reported-by: Nick Bowler <nbowler@draconx.ca>
Fixes: cea57789e4 ("SUNRPC: Clean up")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-30 15:29:41 -04:00
Olga Kornievskaia
ec6017d903 SUNRPC fix regression in umount of a secure mount
If call_status returns ENOTCONN, we need to re-establish the connection
state after. Otherwise the client goes into an infinite loop of call_encode,
call_transmit, call_status (ENOTCONN), call_encode.

Fixes: c8485e4d63 ("SUNRPC: Handle ECONNREFUSED correctly in xprt_transmit()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v2.6.29+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-30 15:15:42 -04:00
Herbert Xu
32707c4dfa inet: frags: Remove unnecessary smp_store_release/READ_ONCE
The smp_store_release call in fqdir_exit cannot protect the setting
of fqdir->dead as claimed because its memory barrier is only
guaranteed to be one-way and the barrier precedes the setting of
fqdir->dead.

IOW it doesn't provide any barriers between fq->dir and the following
hash table destruction.

In fact, the code is safe anyway because call_rcu does provide both
the memory barrier as well as a guarantee that when the destruction
work starts executing all RCU readers will see the updated value for
fqdir->dead.

Therefore this patch removes the unnecessary smp_store_release call
as well as the corresponding READ_ONCE on the read-side in order to
not confuse future readers of this code.  Comments have been added
in their places.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 11:51:29 -07:00
Thomas Gleixner
25763b3c86 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 107 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528171438.615055994@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:53 -07:00
Thomas Gleixner
9952f6918d treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not see http www gnu org
  licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 228 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528171438.107155473@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:52 -07:00
Thomas Gleixner
af873fcece treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 194
Based on 1 normalized pattern(s):

  license terms gnu general public license gpl version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 161 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170027.447718015@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:22 -07:00
Thomas Gleixner
1f32761322 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 188
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to free software
  foundation 51 franklin street fifth floor boston ma 02111 1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 27 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170026.981318839@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:21 -07:00
Thomas Gleixner
84a14ae8c4 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 178
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 24 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170026.162703968@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:19 -07:00
Thomas Gleixner
1802d0beec treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 655 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070034.575739538@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:41 -07:00
Thomas Gleixner
935912c538 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 164
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 of the license this program
  is distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin street fifth floor boston ma
  02110 1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 12 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.745497013@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:38 -07:00
Thomas Gleixner
c942fddf87 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157
Based on 3 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version [author] [kishon] [vijay] [abraham]
  [i] [kishon]@[ti] [com] this program is distributed in the hope that
  it will be useful but without any warranty without even the implied
  warranty of merchantability or fitness for a particular purpose see
  the gnu general public license for more details

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version [author] [graeme] [gregory]
  [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
  [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
  [hk] [hemahk]@[ti] [com] this program is distributed in the hope
  that it will be useful but without any warranty without even the
  implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1105 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:37 -07:00
Thomas Gleixner
de6cc6515a treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 153
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 or at your option any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details you should have received a
  copy of the gnu general public license along with this program if
  not write to the free software foundation inc 675 mass ave cambridge
  ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 77 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.837555891@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Thomas Gleixner
7931287d47 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 132
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify it
  under the terms of the gnu general public license as published by the
  free software foundation either version 2 of the license or at your
  option any later version this program is distributed in the hope that it
  will be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details you should have received a copy
  of the gnu general public license along with this program if not write to
  the free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 4 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190524100843.499675784@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:25:14 -07:00
Stephen Hemminger
458bf2f224 net: core: support XDP generic on stacked devices.
When a device is stacked like (team, bonding, failsafe or netvsc) the
XDP generic program for the parent device was not called.

Move the call to XDP generic inside __netif_receive_skb_core where
it can be done multiple times for stacked case.

Fixes: d445516966 ("net: xdp: support xdp generic on virtual devices")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 11:12:21 -07:00
Ioana Ciornei
0e27921816 net: dsa: Use PHYLINK for the CPU/DSA ports
For DSA switches that do not have an .adjust_link callback, aka those
who transitioned totally to the PHYLINK-compliant API, use PHYLINK to
drive the CPU/DSA ports.

The PHYLIB usage and .adjust_link are kept but deprecated, and users are
asked to transition from it.  The reason why we can't do anything for
them is because PHYLINK does not wrap the fixed-link state behind a
phydev object, so we cannot wrap .phylink_mac_config into .adjust_link
unless we fabricate a phy_device structure.

For these ports, the newly introduced PHYLINK_DEV operation type is
used and the dsa_switch device structure is passed to PHYLINK for
printing purposes.  The handling of the PHYLINK_NETDEV and PHYLINK_DEV
PHYLINK instances is common from the perspective of the driver.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-29 21:48:53 -07:00
Ioana Ciornei
77373d49de net: dsa: Move the phylink driver calls into port.c
In order to have a common handling of PHYLINK for the slave and non-user
ports, the DSA core glue logic (between PHYLINK and the driver) must use
an API that does not rely on a struct net_device.

These will also be called by the CPU-port-handling code in a further
patch.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-29 21:48:53 -07:00
Ioana Ciornei
44cc27e43f net: phylink: Add struct phylink_config to PHYLINK API
The phylink_config structure will encapsulate a pointer to a struct
device and the operation type requested for this instance of PHYLINK.
This patch does not make any functional changes, it just transitions the
PHYLINK internals and all its users to the new API.

A pointer to a phylink_config structure will be passed to
phylink_create() instead of the net_device directly. Also, the same
phylink_config pointer will be passed back to all phylink_mac_ops
callbacks instead of the net_device. Using this mechanism, a PHYLINK
user can get the original net_device using a structure such as
'to_net_dev(config->dev)' or directly the structure containing the
phylink_config using a container_of call.

At the moment, only the PHYLINK_NETDEV is defined as a valid operation
type for PHYLINK. In this mode, a valid reference to a struct device
linked to the original net_device should be passed to PHYLINK through
the phylink_config structure.

This API changes is mainly driven by the necessity of adding a new
operation type in PHYLINK that disconnects the phy_device from the
net_device and also works when the net_device is lacking.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-29 21:48:53 -07:00
Kevin 'ldir' Darbyshire-Bryant
24ec483cec net: sched: Introduce act_ctinfo action
ctinfo is a new tc filter action module.  It is designed to restore
information contained in firewall conntrack marks to other packet fields
and is typically used on packet ingress paths.  At present it has two
independent sub-functions or operating modes, DSCP restoration mode &
skb mark restoration mode.

The DSCP restore mode:

This mode copies DSCP values that have been placed in the firewall
conntrack mark back into the IPv4/v6 diffserv fields of relevant
packets.

The DSCP restoration is intended for use and has been found useful for
restoring ingress classifications based on egress classifications across
links that bleach or otherwise change DSCP, typically home ISP Internet
links.  Restoring DSCP on ingress on the WAN link allows qdiscs such as
but by no means limited to CAKE to shape inbound packets according to
policies that are easier to set & mark on egress.

Ingress classification is traditionally a challenging task since
iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
lookups, hence are unable to see internal IPv4 addresses as used on the
typical home masquerading gateway.  Thus marking the connection in some
manner on egress for later restoration of classification on ingress is
easier to implement.

Parameters related to DSCP restore mode:

dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
conntrack mark field contain the DSCP value to be restored.

statemask - a 32 bit mask of (usually) 1 bit length, outside the area
specified by dscpmask.  This represents a conditional operation flag
whereby the DSCP is only restored if the flag is set.  This is useful to
implement a 'one shot' iptables based classification where the
'complicated' iptables rules are only run once to classify the
connection on initial (egress) packet and subsequent packets are all
marked/restored with the same DSCP.  A mask of zero disables the
conditional behaviour ie. the conntrack mark DSCP bits are always
restored to the ip diffserv field (assuming the conntrack entry is found
& the skb is an ipv4/ipv6 type)

e.g. dscpmask 0xfc000000 statemask 0x01000000

|----0xFC----conntrack mark----000000---|
| Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
| DSCP       | unused | flag  |unused   |
|-----------------------0x01---000000---|
      |                   |
      |                   |
      ---|             Conditional flag
         v             only restore if set
|-ip diffserv-|
| 6 bits      |
|-------------|

The skb mark restore mode (cpmark):

This mode copies the firewall conntrack mark to the skb's mark field.
It is completely the functional equivalent of the existing act_connmark
action with the additional feature of being able to apply a mask to the
restored value.

Parameters related to skb mark restore mode:

mask - a 32 bit mask applied to the firewall conntrack mark to mask out
bits unwanted for restoration.  This can be useful where the conntrack
mark is being used for different purposes by different applications.  If
not specified and by default the whole mark field is copied (i.e.
default mask of 0xffffffff)

e.g. mask 0x00ffffff to mask out the top 8 bits being used by the
aforementioned DSCP restore mode.

|----0x00----conntrack mark----ffffff---|
| Bits 31-24 |                          |
| DSCP & flag|      some value here     |
|---------------------------------------|
			|
			|
			v
|------------skb mark-------------------|
|            |                          |
|  zeroed    |                          |
|---------------------------------------|

Overall parameters:

zone - conntrack zone

control - action related control (reclassify | pipe | drop | continue |
ok | goto chain <CHAIN_INDEX>)

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-29 21:43:54 -07:00
Johannes Berg
180aa422ef nl80211: fill all policy .type entries
For old commands, it's fine to have .type = NLA_UNSPEC and it
behaves the same as NLA_MIN_LEN. However, for new commands with
strict validation this is no longer true, and for policy export
to userspace these are also ignored.

Fix up the remaining ones that don't have a type.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-29 11:12:11 +02:00
Pradeep Kumar Chitrapu
0112fa557c mac80211: free peer keys before vif down in mesh
freeing peer keys after vif down is resulting in peer key uninstall
to fail due to interface lookup failure. so fix that.

Signed-off-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-29 11:10:15 +02:00
David Ahern
430a049190 nexthop: Add support for nexthop groups
Allow the creation of nexthop groups which reference other nexthop
objects to create multipath routes:

                      +--------------+
   +------------+   +--------------+ |
   | nh  nh_grp --->| nh_grp_entry |-+
   +------------+   +---------|----+
     ^                |       |    +------------+
     +----------------+       +--->| nh, weight |
        nh_parent                  +------------+

A group entry points to a nexthop with a weight for that hop within the
group. The nexthop has a list_head, grp_list, for tracking which groups
it is a member of and the group entry has a reference back to the parent.
The grp_list is used when a nexthop is deleted - to efficiently remove
it from groups using it.

If a nexthop group spec is given, no other attributes can be set. Each
nexthop id in a group spec must already exist.

Similar to single nexthops, the specification of a nexthop group can be
updated so that data is managed with rcu locking.

Add path selection function to account for multiple paths and add
ipv{4,6}_good_nh helpers to know that if a neighbor entry exists it is
in a good state.

Update NETDEV event handling to rebalance multipath nexthop groups if
a nexthop is deleted due to a link event (down or unregister).

When a nexthop is removed any groups using it are updated. Groups using a
nexthop a tracked via a grp_list.

Nexthop dumps can be limited to groups only by adding NHA_GROUPS to the
request.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
b513bd035f nexthop: Add support for lwt encaps
Add support for NHA_ENCAP and NHA_ENCAP_TYPE. Leverages the existing code
for lwtunnel within fib_nh_common, so the only change needed is handling
the attributes in the nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
53010f991a nexthop: Add support for IPv6 gateways
Handle IPv6 gateway in a nexthop spec. If nh_family is set to AF_INET6,
NHA_GATEWAY is expected to be an IPv6 address. Add ipv6 option to gw in
nh_config to hold the address, add fib6_nh to nh_info to leverage the
ipv6 initialization and cleanup code. Update nh_fill_node to dump the v6
address.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
597cfe4fc3 nexthop: Add support for IPv4 nexthops
Add support for IPv4 nexthops. If nh_family is set to AF_INET, then
NHA_GATEWAY is expected to be an IPv4 address.

Register for netdev events to be notified of admin up/down changes as
well as deletes. A hash table is used to track nexthop per devices to
quickly convert device events to the affected nexthops.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
ab84be7e54 net: Initial nexthop code
Barebones start point for nexthops. Implementation for RTM commands,
notifications, management of rbtree for holding nexthops by id, and
kernel side data structures for nexthops and nexthop config.

Nexthops are maintained in an rbtree sorted by id. Similar to routes,
nexthops are configured per namespace using netns_nexthop struct added
to struct net.

Nexthop notifications are sent when a nexthop is added or deleted,
but NOT if the delete is due to a device event or network namespace
teardown (which also involves device events). Applications are
expected to use the device down event to flush nexthops and any
routes used by the nexthops.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
Eric Dumazet
8fb44d60d4 llc: fix skb leak in llc_build_and_send_ui_pkt()
If llc_mac_hdr_init() returns an error, we must drop the skb
since no llc_build_and_send_ui_pkt() caller will take care of this.

BUG: memory leak
unreferenced object 0xffff8881202b6800 (size 2048):
  comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.590s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    1a 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<00000000e25b5abe>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
    [<00000000e25b5abe>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<00000000e25b5abe>] slab_alloc mm/slab.c:3326 [inline]
    [<00000000e25b5abe>] __do_kmalloc mm/slab.c:3658 [inline]
    [<00000000e25b5abe>] __kmalloc+0x161/0x2c0 mm/slab.c:3669
    [<00000000a1ae188a>] kmalloc include/linux/slab.h:552 [inline]
    [<00000000a1ae188a>] sk_prot_alloc+0xd6/0x170 net/core/sock.c:1608
    [<00000000ded25bbe>] sk_alloc+0x35/0x2f0 net/core/sock.c:1662
    [<000000002ecae075>] llc_sk_alloc+0x35/0x170 net/llc/llc_conn.c:950
    [<00000000551f7c47>] llc_ui_create+0x7b/0x140 net/llc/af_llc.c:173
    [<0000000029027f0e>] __sock_create+0x164/0x250 net/socket.c:1430
    [<000000008bdec225>] sock_create net/socket.c:1481 [inline]
    [<000000008bdec225>] __sys_socket+0x69/0x110 net/socket.c:1523
    [<00000000b6439228>] __do_sys_socket net/socket.c:1532 [inline]
    [<00000000b6439228>] __se_sys_socket net/socket.c:1530 [inline]
    [<00000000b6439228>] __x64_sys_socket+0x1e/0x30 net/socket.c:1530
    [<00000000cec820c1>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<000000000c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88811d750d00 (size 224):
  comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.600s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 f0 0c 24 81 88 ff ff 00 68 2b 20 81 88 ff ff  ...$.....h+ ....
  backtrace:
    [<0000000053026172>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
    [<0000000053026172>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<0000000053026172>] slab_alloc_node mm/slab.c:3269 [inline]
    [<0000000053026172>] kmem_cache_alloc_node+0x153/0x2a0 mm/slab.c:3579
    [<00000000fa8f3c30>] __alloc_skb+0x6e/0x210 net/core/skbuff.c:198
    [<00000000d96fdafb>] alloc_skb include/linux/skbuff.h:1058 [inline]
    [<00000000d96fdafb>] alloc_skb_with_frags+0x5f/0x250 net/core/skbuff.c:5327
    [<000000000a34a2e7>] sock_alloc_send_pskb+0x269/0x2a0 net/core/sock.c:2225
    [<00000000ee39999b>] sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2242
    [<00000000e034d810>] llc_ui_sendmsg+0x10a/0x540 net/llc/af_llc.c:933
    [<00000000c0bc8445>] sock_sendmsg_nosec net/socket.c:652 [inline]
    [<00000000c0bc8445>] sock_sendmsg+0x54/0x70 net/socket.c:671
    [<000000003b687167>] __sys_sendto+0x148/0x1f0 net/socket.c:1964
    [<00000000922d78d9>] __do_sys_sendto net/socket.c:1976 [inline]
    [<00000000922d78d9>] __se_sys_sendto net/socket.c:1972 [inline]
    [<00000000922d78d9>] __x64_sys_sendto+0x2a/0x30 net/socket.c:1972
    [<00000000cec820c1>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<000000000c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:25:23 -07:00
Eric Dumazet
dc93f46bc4 inet: frags: fix use-after-free read in inet_frag_destroy_rcu
As caught by syzbot [1], the rcu grace period that is respected
before fqdir_rwork_fn() proceeds and frees fqdir is not enough
to prevent inet_frag_destroy_rcu() being run after the freeing.

We need a proper rcu_barrier() synchronization to replace
the one we had in inet_frags_fini()

We also have to fix a potential problem at module removal :
inet_frags_fini() needs to make sure that all queued work queues
(fqdir_rwork_fn) have completed, otherwise we might
call kmem_cache_destroy() too soon and get another use-after-free.

[1]
BUG: KASAN: use-after-free in inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
Read of size 8 at addr ffff88806ed47a18 by task swapper/1/0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0-rc1+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch kernel/rcu/tree.c:2092 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
 rcu_core+0xba5/0x1500 kernel/rcu/tree.c:2291
 __do_softirq+0x25c/0x94c kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
 </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: ff ff 48 89 df e8 f2 95 8c fa eb 82 e9 07 00 00 00 0f 00 2d e4 45 4b 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d d4 45 4b 00 fb f4 <c3> 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 8e 18 42 fa e8 99
RSP: 0018:ffff8880a98e7d78 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff1164e11 RBX: ffff8880a98d4340 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a98d4bbc
RBP: ffff8880a98e7da8 R08: ffff8880a98d4340 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff88b27078 R14: 0000000000000001 R15: 0000000000000000
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
 default_idle_call+0x36/0x90 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x377/0x560 kernel/sched/idle.c:263
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:354
 start_secondary+0x34e/0x4c0 arch/x86/kernel/smpboot.c:267
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243

Allocated by task 8877:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 kmem_cache_alloc_trace+0x151/0x750 mm/slab.c:3555
 kmalloc include/linux/slab.h:547 [inline]
 kzalloc include/linux/slab.h:742 [inline]
 fqdir_init include/net/inet_frag.h:115 [inline]
 ipv6_frags_init_net+0x48/0x460 net/ipv6/reassembly.c:513
 ops_init+0xb3/0x410 net/core/net_namespace.c:130
 setup_net+0x2d3/0x740 net/core/net_namespace.c:316
 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
 ksys_unshare+0x440/0x980 kernel/fork.c:2692
 __do_sys_unshare kernel/fork.c:2760 [inline]
 __se_sys_unshare kernel/fork.c:2758 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 17:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kfree+0xcf/0x220 mm/slab.c:3755
 fqdir_rwork_fn+0x33/0x40 net/ipv4/inet_fragment.c:154
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at ffff88806ed47a00
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 24 bytes inside of
 512-byte region [ffff88806ed47a00, ffff88806ed47c00)
The buggy address belongs to the page:
page:ffffea0001bb51c0 refcount:1 mapcount:0 mapping:ffff8880aa400940 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea000282a788 ffffea0001bb53c8 ffff8880aa400940
raw: 0000000000000000 ffff88806ed47000 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88806ed47900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88806ed47980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88806ed47a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
 ffff88806ed47a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88806ed47b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 3c8fc87820 ("inet: frags: rework rhashtable dismantle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:22:15 -07:00
Eric Dumazet
ae7352d384 inet: frags: call inet_frags_fini() after unregister_pernet_subsys()
Both IPv6 and 6lowpan are calling inet_frags_fini() too soon.

inet_frags_fini() is dismantling a kmem_cache, that might be needed
later when unregister_pernet_subsys() eventually has to remove
frags queues from hash tables and free them.

This fixes potential use-after-free, and is a prereq for the following patch.

Fixes: d4ad4d22e7 ("inet: frags: use kmem_cache for inet_frag_queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:22:15 -07:00
Eric Dumazet
6b73d19711 inet: frags: uninline fqdir_init()
fqdir_init() is not fast path and is getting bigger.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:22:15 -07:00
Gustavo A. R. Silva
66d4218f99 xprtrdma: Use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-28 09:28:49 -04:00
Johannes Berg
1a28ed2136 nl80211: fill all policy .type entries
For old commands, it's fine to have .type = NLA_UNSPEC and it
behaves the same as NLA_MIN_LEN. However, for new commands with
strict validation this is no longer true, and for policy export
to userspace these are also ignored.

Fix up the remaining ones that don't have a type.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-28 14:18:07 +02:00
Thomas Pedersen
551842446e mac80211: mesh: fix RCU warning
ifmsh->csa is an RCU-protected pointer. The writer context
in ieee80211_mesh_finish_csa() is already mutually
exclusive with wdev->sdata.mtx, but the RCU checker did
not know this. Use rcu_dereference_protected() to avoid a
warning.

fixes the following warning:

[   12.519089] =============================
[   12.520042] WARNING: suspicious RCU usage
[   12.520652] 5.1.0-rc7-wt+ #16 Tainted: G        W
[   12.521409] -----------------------------
[   12.521972] net/mac80211/mesh.c:1223 suspicious rcu_dereference_check() usage!
[   12.522928] other info that might help us debug this:
[   12.523984] rcu_scheduler_active = 2, debug_locks = 1
[   12.524855] 5 locks held by kworker/u8:2/152:
[   12.525438]  #0: 00000000057be08c ((wq_completion)phy0){+.+.}, at: process_one_work+0x1a2/0x620
[   12.526607]  #1: 0000000059c6b07a ((work_completion)(&sdata->csa_finalize_work)){+.+.}, at: process_one_work+0x1a2/0x620
[   12.528001]  #2: 00000000f184ba7d (&wdev->mtx){+.+.}, at: ieee80211_csa_finalize_work+0x2f/0x90
[   12.529116]  #3: 00000000831a1f54 (&local->mtx){+.+.}, at: ieee80211_csa_finalize_work+0x47/0x90
[   12.530233]  #4: 00000000fd06f988 (&local->chanctx_mtx){+.+.}, at: ieee80211_csa_finalize_work+0x51/0x90

Signed-off-by: Thomas Pedersen <thomas@eero.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-28 09:47:10 +02:00
Andy Strohman
f77bf4863d nl80211: fix station_info pertid memory leak
When dumping stations, memory allocated for station_info's
pertid member will leak if the nl80211 header cannot be added to
the sk_buff due to insufficient tail room.

I noticed this leak in the kmalloc-2048 cache.

Cc: stable@vger.kernel.org
Fixes: 8689c051a2 ("cfg80211: dynamically allocate per-tid stats for station info")
Signed-off-by: Andy Strohman <andy@uplevelsystems.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-28 09:47:10 +02:00
Chaitanya Tata
a3ce17d149 cfg80211: Handle bss expiry during connection
If the BSS is expired during connection, the connect result will
trigger a kernel warning. Ideally cfg80211 should hold the BSS
before the connection is attempted, but as the BSSID is not known
in case of auth/assoc MLME offload (connect op) it doesn't.

For those drivers without the connect op cfg80211 holds down the
reference so it wil not be removed from list.

Fix this by removing the warning and silently adding the BSS back to
the bss list which is return by the driver (with proper BSSID set) or
in case the BSS is already added use that.

The requirements for drivers are documented in the API's.

Signed-off-by: Chaitanya Tata <chaitanya.tata@bluwireless.co.uk>
[formatting fixes, keep old timestamp]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-28 09:35:39 +02:00
Jouni Malinen
a71fd9dac2 mac80211: Do not use stack memory with scatterlist for GMAC
ieee80211_aes_gmac() uses the mic argument directly in sg_set_buf() and
that does not allow use of stack memory (e.g., BUG_ON() is hit in
sg_set_buf() with CONFIG_DEBUG_SG). BIP GMAC TX side is fine for this
since it can use the skb data buffer, but the RX side was using a stack
variable for deriving the local MIC value to compare against the
received one.

Fix this by allocating heap memory for the mic buffer.

This was found with hwsim test case ap_cipher_bip_gmac_128 hitting that
BUG_ON() and kernel panic.

Cc: stable@vger.kernel.org
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-28 09:22:12 +02:00
Jeremy Sowden
7c80eb1c7e af_key: fix leaks in key_pol_get_resp and dump_sp.
In both functions, if pfkey_xfrm_policy2msg failed we leaked the newly
allocated sk_buff.  Free it on error.

Fixes: 55569ce256 ("Fix conversion between IPSEC_MODE_xxx and XFRM_MODE_xxx.")
Reported-by: syzbot+4f0529365f7f2208d9f0@syzkaller.appspotmail.com
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-05-28 09:20:52 +02:00
Anirudh Gupta
b38ff4075a xfrm: Fix xfrm sel prefix length validation
Family of src/dst can be different from family of selector src/dst.
Use xfrm selector family to validate address prefix length,
while verifying new sa from userspace.

Validated patch with this command:
ip xfrm state add src 1.1.6.1 dst 1.1.6.2 proto esp spi 4260196 \
reqid 20004 mode tunnel aead "rfc4106(gcm(aes))" \
0x1111016400000000000000000000000044440001 128 \
sel src 1011:1:4::2/128 sel dst 1021:1:4::2/128 dev Port5

Fixes: 07bf790895 ("xfrm: Validate address prefix lengths in the xfrm selector.")
Signed-off-by: Anirudh Gupta <anirudh.gupta@sophos.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-05-28 09:16:30 +02:00
Eric W. Biederman
1dfd1711de signal/bpfilter: Fix bpfilter_kernl to use send_sig not force_sig
The locking in force_sig_info is not prepared to deal with
a task that exits or execs (as sighand may change).  As force_sig
is only built to handle synchronous exceptions.

Further the function force_sig_info changes the signal state if the
signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
delivery of the signal.  The signal SIGKILL can not be ignored and can
not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
delivered.

So using force_sig rather than send_sig for SIGKILL is pointless.

Because it won't impact the sending of the signal and and because
using force_sig is wrong, replace force_sig with send_sig.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Colin Ian King
df80152265 ipv4: remove redundant assignment to n
The pointer n is being assigned a value however this value is
never read in the code block and the end of the code block
continues to the next loop iteration. Clean up the code by
removing the redundant assignment.

Fixes: 1bff1a0c9b ("ipv4: Add function to send route updates")
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 22:11:58 -07:00
Jakub Kicinski
04b25a5411 net/tls: fix no wakeup on partial reads
When tls_sw_recvmsg() partially copies a record it pops that
record from ctx->recv_pkt and places it on rx_list.

Next iteration of tls_sw_recvmsg() reads from rx_list via
process_rx_list() before it enters the decryption loop.
If there is no more records to be read tls_wait_data()
will put the process on the wait queue and got to sleep.
This is incorrect, because some data was already copied
in process_rx_list().

In case of RPC connections process may never get woken up,
because peer also simply blocks in read().

I think this may also fix a similar issue when BPF is at
play, because after __tcp_bpf_recvmsg() returns some data
we subtract it from len and use continue to restart the
loop, but len could have just reached 0, so again we'd
sleep unnecessarily. That's added by:
commit d3b18ad31f ("tls: add bpf support to sk_msg handling")

Fixes: 692d7b5d1f ("tls: Fix recvmsg() to be able to peek across multiple records")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Tested-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 21:47:13 -07:00
Jakub Kicinski
46a1695960 net/tls: fix lowat calculation if some data came from previous record
If some of the data came from the previous record, i.e. from
the rx_list it had already been decrypted, so it's not counted
towards the "decrypted" variable, but the "copied" variable.
Take that into account when checking lowat.

When calculating lowat target we need to pass the original len.
E.g. if lowat is at 80, len is 100 and we had 30 bytes on rx_list
target would currently be incorrectly calculated as 70, even though
we only need 50 more bytes to make up the 80.

Fixes: 692d7b5d1f ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Tested-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 21:47:12 -07:00
Eric Dumazet
3c8fc87820 inet: frags: rework rhashtable dismantle
syszbot found an interesting use-after-free [1] happening
while IPv4 fragment rhashtable was destroyed at netns dismantle.

While no insertions can possibly happen at the time a dismantling
netns is destroying this rhashtable, timers can still fire and
attempt to remove elements from this rhashtable.

This is forbidden, since rhashtable_free_and_destroy() has
no synchronization against concurrent inserts and deletes.

Add a new fqdir->dead flag so that timers do not attempt
a rhashtable_remove_fast() operation.

We also have to respect an RCU grace period before starting
the rhashtable_free_and_destroy() from process context,
thus we use rcu_work infrastructure.

This is a refinement of a prior rough attempt to fix this bug :
https://marc.info/?l=linux-netdev&m=153845936820900&w=2

Since the rhashtable cleanup is now deferred to a work queue,
netns dismantles should be slightly faster.

[1]
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:194 [inline]
BUG: KASAN: use-after-free in rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
Read of size 8 at addr ffff8880a6497b70 by task kworker/0:0/5

CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.2.0-rc1+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events rht_deferred_worker
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 __read_once_size include/linux/compiler.h:194 [inline]
 rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
 rht_deferred_worker+0x111/0x2030 lib/rhashtable.c:411
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 32687:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 __do_kmalloc_node mm/slab.c:3620 [inline]
 __kmalloc_node+0x4e/0x70 mm/slab.c:3627
 kmalloc_node include/linux/slab.h:590 [inline]
 kvmalloc_node+0x68/0x100 mm/util.c:431
 kvmalloc include/linux/mm.h:637 [inline]
 kvzalloc include/linux/mm.h:645 [inline]
 bucket_table_alloc+0x90/0x480 lib/rhashtable.c:178
 rhashtable_init+0x3f4/0x7b0 lib/rhashtable.c:1057
 inet_frags_init_net include/net/inet_frag.h:109 [inline]
 ipv4_frags_init_net+0x182/0x410 net/ipv4/ip_fragment.c:683
 ops_init+0xb3/0x410 net/core/net_namespace.c:130
 setup_net+0x2d3/0x740 net/core/net_namespace.c:316
 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
 ksys_unshare+0x440/0x980 kernel/fork.c:2692
 __do_sys_unshare kernel/fork.c:2760 [inline]
 __se_sys_unshare kernel/fork.c:2758 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 7:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kfree+0xcf/0x220 mm/slab.c:3755
 kvfree+0x61/0x70 mm/util.c:460
 bucket_table_free+0x69/0x150 lib/rhashtable.c:108
 rhashtable_free_and_destroy+0x165/0x8b0 lib/rhashtable.c:1155
 inet_frags_exit_net+0x3d/0x50 net/ipv4/inet_fragment.c:152
 ipv4_frags_exit_net+0x73/0x90 net/ipv4/ip_fragment.c:695
 ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154
 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at ffff8880a6497b40
 which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 48 bytes inside of
 1024-byte region [ffff8880a6497b40, ffff8880a6497f40)
The buggy address belongs to the page:
page:ffffea0002992580 refcount:1 mapcount:0 mapping:ffff8880aa400ac0 index:0xffff8880a64964c0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea0002916e88 ffffea000218fe08 ffff8880aa400ac0
raw: ffff8880a64964c0 ffff8880a6496040 0000000100000005 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880a6497a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a6497a80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff8880a6497b00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                             ^
 ffff8880a6497b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a6497c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
4907abc605 net: dynamically allocate fqdir structures
Following patch will add rcu grace period before fqdir
rhashtable destruction, so we need to dynamically allocate
fqdir structures to not force expensive synchronize_rcu() calls
in netns dismantle path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
a39aca678a net: add a net pointer to struct fqdir
fqdir will soon be dynamically allocated.

We need to reach the struct net pointer from fqdir,
so add it, and replace the various container_of() constructs
by direct access to the new field.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
9cce45f22c net: rename inet_frags_init_net() to fdir_init()
And pass an extra parameter, since we will soon
dynamically allocate fqdir structures.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
d2dfd43598 ieee820154: 6lowpan: no longer reference init_net in lowpan_frags_ns_ctl_table
(struct net *)->ieee802154_lowpan.fqdir will soon be a pointer, so make
sure lowpan_frags_ns_ctl_table[] does not reference init_net.

lowpan_frags_ns_sysctl_register() can perform the needed initialization
for all netns.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
3bb13dd4ca netfilter: ipv6: nf_defrag: no longer reference init_net in nf_ct_frag6_sysctl_table
(struct net *)->nf_frag.fqdir will soon be a pointer, so make
sure nf_ct_frag6_sysctl_table[] does not reference init_net.

nf_ct_frag6_sysctl_register() can perform the needed initialization
for all netns.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
8668d0e2bf ipv6: no longer reference init_net in ip6_frags_ns_ctl_table[]
(struct net *)->ipv6.fqdir will soon be a pointer, so make
sure ip6_frags_ns_ctl_table[] does not reference init_net.

ip6_frags_ns_ctl_register() can perform the needed initialization
for all netns.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
8dfdb31335 ipv4: no longer reference init_net in ip4_frags_ns_ctl_table[]
(struct net *)->ipv4.fqdir will soon be a pointer, so make
sure ip4_frags_ns_ctl_table[] does not reference init_net.

ip4_frags_ns_ctl_register() can perform the needed initialization
for all netns.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
803fdd9968 net: rename struct fqdir fields
Rename the @frags fields from structs netns_ipv4, netns_ipv6,
netns_nf_frag and netns_ieee802154_lowpan to @fqdir

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
89fb900514 net: rename inet_frags_exit_net() to fqdir_exit()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:04 -07:00
Eric Dumazet
6ce3b4dcee inet: rename netns_frags to fqdir
1) struct netns_frags is renamed to struct fqdir
  This structure is really holding many frag queues in a hash table.

2) (struct inet_frag_queue)->net field is renamed to fqdir
  since net is generally associated to a 'struct net' pointer
  in networking stack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:04 -07:00
David Howells
fba9be4970 vfs: Convert sockfs to use the new mount API
Convert the sockfs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 18:06:12 -04:00
David Howells
b9662f3103 vfs: Convert rpc_pipefs to use the new mount API
Convert the rpc_pipefs filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: "J. Bruce Fields" <bfields@fieldses.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-nfs@vger.kernel.org
2019-05-25 17:59:56 -04:00
Al Viro
1f58bb18f6 mount_pseudo(): drop 'name' argument, switch to d_make_root()
Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work.  These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root.  However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it.  Including those that
didn't *have* any non-root dentries to start with...

All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:24 -04:00
Gen Zhang
425aa0e1d0 ip_sockglue: Fix missing-check bug in ip_ra_control()
In function ip_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when  there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.

Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-25 11:00:50 -07:00
Gen Zhang
95baa60a0d ipv6_sockglue: Fix a missing-check bug in ip6_ra_control()
In function ip6_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.

Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-25 10:59:45 -07:00
Gustavo A. R. Silva
6dca9360a9 flow_offload: use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
   int stuff;
   struct boo entry[];
};

instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-25 10:57:58 -07:00
Linus Lüssing
6bc4544021 batman-adv: mcast: shorten multicast tt/tvlv worker spinlock section
It is not necessary to hold the mla_lock spinlock during the whole
multicast tt/tvlv worker callback. Just holding it during the checks and
updates of the bat_priv stored multicast flags and mla_list is enough.

Therefore this patch splits batadv_mcast_mla_tvlv_update() in two:
batadv_mcast_mla_flags_get() at the beginning of the worker to gather
and calculate the new multicast flags, which does not need any locking
as it neither reads from nor writes to bat_priv->mcast.

And batadv_mcast_mla_flags_update() at the end of the worker which
commits the newly calculated flags and lists to bat_priv->mcast and
therefore needs the lock.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-25 12:59:54 +02:00
Sven Eckelmann
68a600de9a batman-adv: Use includes instead of fwdecls
While it can be slightly beneficial for the build performance to use
forward declarations instead of includes, the handling of them together
with changes in the included headers makes it unnecessary complicated and
fragile. Just replace them with actual includes since some parts (hwmon,
..) of the kernel even request avoidance of forward declarations and net/
is mostly not using them in *.c file.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-25 12:59:53 +02:00
Sven Eckelmann
47d4522dd5 batman-adv: Add missing include for atomic functions
main.h is using atomic_add_unless and log.h atomic_read. The main
header linux/atomic.h should be included for these files.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-25 12:59:53 +02:00
Sven Eckelmann
e192875298 batman-adv: Fix includes for *_MAX constants
The commit 54d50897d5 ("linux/kernel.h: split *_MAX and *_MIN macros into
<linux/limits.h>") moved the U32_MAX/INT_MAX/ULONG_MAX from linux/kernel.h
to linux/limits.h. Adjust the includes accordingly.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-25 12:59:53 +02:00
Linus Torvalds
86c2f5d653 SPDX update for 5.2-rc2, round 2
Here is another set of reviewed patches that adds SPDX tags to different
 kernel files, based on a set of rules that are being used to parse the
 comments to try to determine that the license of the file is
 "GPL-2.0-or-later".  Only the "obvious" versions of these matches are
 included here, a number of "non-obvious" variants of text have been
 found but those have been postponed for later review and analysis.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOgmlw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk4rACfRqxGOGVLR/t6E9dDzOZRAdEz/mYAoJLZmziY
 0YlSSSPtP5HI6JDh65Ng
 =HXQb
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pule more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later".

  Only the "obvious" versions of these matches are included here, a
  number of "non-obvious" variants of text have been found but those
  have been postponed for later review and analysis.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
  ...
2019-05-24 14:31:58 -07:00
John Fastabend
bd95e678e0 bpf: sockmap, fix use after free from sleep in psock backlog workqueue
Backlog work for psock (sk_psock_backlog) might sleep while waiting
for memory to free up when sending packets. However, while sleeping
the socket may be closed and removed from the map by the user space
side.

This breaks an assumption in sk_stream_wait_memory, which expects the
wait queue to be still there when it wakes up resulting in a
use-after-free shown below. To fix his mark sendmsg as MSG_DONTWAIT
to avoid the sleep altogether. We already set the flag for the
sendpage case but we missed the case were sendmsg is used.
Sockmap is currently the only user of skb_send_sock_locked() so only
the sockmap paths should be impacted.

==================================================================
BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110

CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
Workqueue: events sk_psock_backlog
Call Trace:
 print_address_description+0x6e/0x2b0
 ? remove_wait_queue+0x31/0x70
 kasan_report+0xfd/0x177
 ? remove_wait_queue+0x31/0x70
 ? remove_wait_queue+0x31/0x70
 remove_wait_queue+0x31/0x70
 sk_stream_wait_memory+0x4dd/0x5f0
 ? sk_stream_wait_close+0x1b0/0x1b0
 ? wait_woken+0xc0/0xc0
 ? tcp_current_mss+0xc5/0x110
 tcp_sendmsg_locked+0x634/0x15d0
 ? tcp_set_state+0x2e0/0x2e0
 ? __kasan_slab_free+0x1d1/0x230
 ? kmem_cache_free+0x70/0x140
 ? sk_psock_backlog+0x40c/0x4b0
 ? process_one_work+0x40b/0x660
 ? worker_thread+0x82/0x680
 ? kthread+0x1b9/0x1e0
 ? ret_from_fork+0x1f/0x30
 ? check_preempt_curr+0xaf/0x130
 ? iov_iter_kvec+0x5f/0x70
 ? kernel_sendmsg_locked+0xa0/0xe0
 skb_send_sock_locked+0x273/0x3c0
 ? skb_splice_bits+0x180/0x180
 ? start_thread+0xe0/0xe0
 ? update_min_vruntime.constprop.27+0x88/0xc0
 sk_psock_backlog+0xb3/0x4b0
 ? strscpy+0xbf/0x1e0
 process_one_work+0x40b/0x660
 worker_thread+0x82/0x680
 ? process_one_work+0x660/0x660
 kthread+0x1b9/0x1e0
 ? __kthread_create_on_node+0x250/0x250
 ret_from_fork+0x1f/0x30

Fixes: 20bf50de30 ("skbuff: Function to send an skbuf on a socket")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-24 23:18:42 +02:00
Vlad Buslov
4097e9d250 net: sched: don't use tc_action->order during action dump
Function tcf_action_dump() relies on tc_action->order field when starting
nested nla to send action data to userspace. This approach breaks in
several cases:

- When multiple filters point to same shared action, tc_action->order field
  is overwritten each time it is attached to filter. This causes filter
  dump to output action with incorrect attribute for all filters that have
  the action in different position (different order) from the last set
  tc_action->order value.

- When action data is displayed using tc action API (RTM_GETACTION), action
  order is overwritten by tca_action_gd() according to its position in
  resulting array of nl attributes, which will break filter dump for all
  filters attached to that shared action that expect it to have different
  order value.

Don't rely on tc_action->order when dumping actions. Set nla according to
action position in resulting array of actions instead.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:27:52 -07:00
David Ahern
0fa6efc547 ipv6: Refactor ip6_route_del for cached routes
Move the removal of cached routes to a helper, ip6_del_cached_rt, that
can be invoked per nexthop. Rename the existig ip6_del_cached_rt to
__ip6_del_cached_rt since it is called by ip6_del_cached_rt.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
David Ahern
1cf844c747 ipv6: Make fib6_nh optional at the end of fib6_info
Move fib6_nh to the end of fib6_info and make it an array of
size 0. Pass a flag to fib6_info_alloc indicating if the
allocation needs to add space for a fib6_nh.

The current code path always has a fib6_nh allocated with a
fib6_info; with nexthop objects they will be separate.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
David Ahern
cc5c073a69 ipv6: Move exception bucket to fib6_nh
Similar to the pcpu routes exceptions are really per nexthop, so move
rt6i_exception_bucket from fib6_info to fib6_nh.

To avoid additional increases to the size of fib6_nh for a 1-bit flag,
use the lowest bit in the allocated memory pointer for the flushed flag.
Add helpers for retrieving the bucket pointer to mask off the flag.

The cleanup of the exception bucket is moved to fib6_nh_release.

fib6_nh_flush_exceptions can now be called from 2 contexts:
1. deleting a fib entry
2. deleting a fib6_nh

For 1., fib6_nh_flush_exceptions is called for a specific fib6_info that
is getting deleted. All exceptions in the cache using the entry are
deleted. For 2, the fib6_nh itself is getting destroyed so
fib6_nh_flush_exceptions is called for a NULL fib6_info which means
flush all entries.

The pmtu.sh selftest exercises the affected code paths - from creating
exceptions to cleaning them up on device delete. All tests pass without
any rcu locking or memleak warnings.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
David Ahern
c0b220cf7d ipv6: Refactor exception functions
Before moving exception bucket from fib6_info to fib6_nh, refactor
rt6_flush_exceptions, rt6_remove_exception_rt, rt6_mtu_change_route,
and rt6_update_exception_stamp_rt. In all 3 cases, move the primary
logic into a new helper that starts with fib6_nh_. The latter 3
functions still take a fib6_info; this will be changed to fib6_nh
in the next patch.

In the case of rt6_mtu_change_route, move the fib6_metric_locked
out as a standalone check - no need to call the new function if
the fib entry has the mtu locked. Also, add fib6_info to
rt6_mtu_change_arg as a way of passing the fib entry to the new
helper.

No functional change intended. The goal here is to make the next
patch easier to review by moving existing lookup logic for each to
new helpers.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
David Ahern
7d88d8b557 ipv6: Refactor fib6_drop_pcpu_from
Move the existing pcpu walk in fib6_drop_pcpu_from to a new
helper, __fib6_drop_pcpu_from, that can be invoked per fib6_nh with a
reference to the from entries that need to be evicted. If the passed
in 'from' is non-NULL then only entries associated with that fib6_info
are removed (e.g., case where fib entry is deleted); if the 'from' is
NULL are entries are flushed (e.g., fib6_nh is deleted).

For fib6_info entries with builtin fib6_nh (ie., current code) there
is no change in behavior.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
David Ahern
f40b6ae2b6 ipv6: Move pcpu cached routes to fib6_nh
rt6_info are specific instances of a fib entry and are tied to a
device and gateway - ie., a nexthop. Before nexthop objects, IPv6 fib
entries have separate fib6_info for each nexthop in a multipath route,
so the location of the pcpu cache in the fib6_info struct worked.
However, with nexthop objects a fib6_info can point to a set of nexthops
(yet another alignment of ipv6 with ipv4). Accordingly, the pcpu
cache needs to be moved to the fib6_nh struct so the cached entries
are local to the nexthop specification used to create the rt6_info.

Initialization and free of the pcpu entries moved to fib6_nh_init and
fib6_nh_release.

Change in location only, from fib6_info down to fib6_nh; no other
functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
Thomas Gleixner
47505b8bcf treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
Based on 1 normalized pattern(s):

  this sctp implementation is free software you can redistribute it
  and or modify it under the terms of the gnu general public license
  as published by the free software foundation either version 2 or at
  your option any later version this sctp implementation is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with gnu cc see the file copying if not see
  http www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 42 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091649.683323110@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:39:00 +02:00
Thomas Gleixner
8e8caf97b2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
Based on 1 normalized pattern(s):

  the sctp implementation is free software you can redistribute it and
  or modify it under the terms of the gnu general public license as
  published by the free software foundation either version 2 or at
  your option any later version the sctp implementation is distributed
  in the hope that it will be useful but without any warranty without
  even the implied warranty of merchantability or fitness for a
  particular purpose see the gnu general public license for more
  details you should have received a copy of the gnu general public
  license along with gnu cc see the file copying if not see http www
  gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091649.592169384@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:39:00 +02:00
Thomas Gleixner
f17b7eec80 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 93
Based on 1 normalized pattern(s):

  this code is free software you can redistribute it and or modify it
  under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not see http www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075212.233647300@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:37:53 +02:00
Thomas Gleixner
e97150df89 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 77
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation or any later at your
  option

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075210.769496418@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:37:51 +02:00
Thomas Gleixner
1b402f7ebf treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 72
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 or
  any later at your option as published by the free software
  foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071859.749329557@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:36:47 +02:00
Thomas Gleixner
a85036f66f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 62
Based on 1 normalized pattern(s):

  released under the gpl version 2 or later

and 1 additional normalized pattern(s):

  this program is free software you can redistribute it and or
  modify it under the terms of the gnu general public license
  as published by the free software foundation either version
  2 of the license or at your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071858.828691433@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:36:46 +02:00
Thomas Gleixner
74ba9207e1 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 61
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not write to the free software foundation inc
  675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 441 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071858.739733335@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:36:45 +02:00
Thomas Gleixner
e14ca97836 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 53
Based on 1 normalized pattern(s):

  this code may be copied under the gpl v 2 or at your option any
  later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520071858.029737698@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:36:42 +02:00
Thomas Gleixner
ee5d8f4d89 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 41
Based on 1 normalized pattern(s):

  this module is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 18 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.008906948@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:12 +02:00
Thomas Gleixner
b4d0d230cc treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public licence as published by
  the free software foundation either version 2 of the licence or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 114 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170857.552531963@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:11 +02:00
Manikanta Pubbisetty
33d915d9e8 {nl,mac}80211: allow 4addr AP operation on crypto controlled devices
As per the current design, in the case of sw crypto controlled devices,
it is the device which advertises the support for AP/VLAN iftype based
on it's ability to tranmsit packets encrypted in software
(In VLAN functionality, group traffic generated for a specific
VLAN group is always encrypted in software). Commit db3bdcb9c3
("mac80211: allow AP_VLAN operation on crypto controlled devices")
has introduced this change.

Since 4addr AP operation also uses AP/VLAN iftype, this conditional
way of advertising AP/VLAN support has broken 4addr AP mode operation on
crypto controlled devices which do not support VLAN functionality.

In the case of ath10k driver, not all firmwares have support for VLAN
functionality but all can support 4addr AP operation. Because AP/VLAN
support is not advertised for these devices, 4addr AP operations are
also blocked.

Fix this by allowing 4addr operation on devices which do not support
AP/VLAN iftype but can support 4addr AP operation (decision is based on
the wiphy flag WIPHY_FLAG_4ADDR_AP).

Cc: stable@vger.kernel.org
Fixes: db3bdcb9c3 ("mac80211: allow AP_VLAN operation on crypto controlled devices")
Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-24 11:26:54 +02:00
John Crispin
25d16d124a mac80211: fix rate reporting inside cfg80211_calculate_bitrate_he()
The reported rate is not scaled down correctly. After applying this patch,
the function will behave just like the v/ht equivalents.

Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-24 11:26:44 +02:00
YueHaibing
818e9dfa2c mac80211: remove set but not used variable 'old'
Fixes gcc '-Wunused-but-set-variable' warning:

net/mac80211/key.c: In function 'ieee80211_set_tx_key':
net/mac80211/key.c:271:24: warning:
 variable 'old' set but not used [-Wunused-but-set-variable]

It is not used since introduction in
commit 96fc6efb9a ("mac80211: IEEE 802.11 Extended Key ID support")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-24 11:26:44 +02:00
Yu Wang
79c92ca42b mac80211: handle deauthentication/disassociation from TDLS peer
When receiving a deauthentication/disassociation frame from a TDLS
peer, a station should not disconnect the current AP, but only
disable the current TDLS link if it's enabled.

Without this change, a TDLS issue can be reproduced by following the
steps as below:

1. STA-1 and STA-2 are connected to AP, bidirection traffic is running
   between STA-1 and STA-2.
2. Set up TDLS link between STA-1 and STA-2, stay for a while, then
   teardown TDLS link.
3. Repeat step #2 and monitor the connection between STA and AP.

During the test, one STA may send a deauthentication/disassociation
frame to another, after TDLS teardown, with reason code 6/7, which
means: Class 2/3 frame received from nonassociated STA.

On receive this frame, the receiver STA will disconnect the current
AP and then reconnect. It's not a expected behavior, purpose of this
frame should be disabling the TDLS link, not the link with AP.

Cc: stable@vger.kernel.org
Signed-off-by: Yu Wang <yyuwang@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-24 11:26:44 +02:00
David S. Miller
71e15f76f8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree:

1) Fix crash when dumping rules after conversion to RCU,
   from Florian Westphal.

2) Fix incorrect hook reinjection from nf_queue in case NF_REPEAT,
   from Jagdish Motwani.

3) Fix check for route existence in fib extension, from Phil Sutter.

4) Fix use after free in ip_vs_in() hook, from YueHaibing.

5) Check for veth existence from netfilter selftests,
   from Jeffrin Jose T.

6) Checksum corruption in UDP NAT helpers due to typo,
   from Florian Westphal.

7) Pass up packets to classic forwarding path regardless of
   IPv4 DF bit, patch for the flowtable infrastructure from Florian.

8) Set liberal TCP tracking for flows that are placed in the
   flowtable, in case they need to go back to classic forwarding path,
   also from Florian.

9) Don't add flow with sequence adjustment to flowtable, from Florian.

10) Skip IPv4 options from IPv6 datapath in flowtable, from Florian.

11) Add selftest for the flowtable infrastructure, from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-23 14:45:36 -07:00
Andreas Oetken
d2daa127ed hsr: fix don't prune the master node from the node_db
Don't prune the master node in the hsr_prune_nodes function.
Neither time_in[HSR_PT_SLAVE_A] nor time_in[HSR_PT_SLAVE_B]
will ever be updated by hsr_register_frame_in for the master port.
Thus, the master node will be repeatedly pruned leading to
repeated packet loss.
This bug never appeared because the hsr_prune_nodes function
was only called once. Since commit 5150b45fd3
("net: hsr: Fix node prune function for forget time expiry") this issue
is fixed unveiling the issue described above.

Fixes: 5150b45fd3 ("net: hsr: Fix node prune function for forget time expiry")
Signed-off-by: Andreas Oetken <andreas.oetken@siemens.com>
Tested-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-23 09:29:44 -07:00
Jiri Pirko
136bf27fc0 devlink: add warning in case driver does not set port type
Prevent misbehavior of drivers who would not set port type for longer
period of time. Drivers should always set port type. Do WARN if that
happens.

Note that it is perfectly fine to temporarily not have the type set,
during initialization and port type change.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-23 09:18:43 -07:00
Eric Dumazet
903869bd10 ipv4/igmp: fix build error if !CONFIG_IP_MULTICAST
ip_sf_list_clear_all() needs to be defined even if !CONFIG_IP_MULTICAST

Fixes: 3580d04aa6 ("ipv4/igmp: fix another memory leak in igmpv3_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 22:08:06 -07:00
Eric Dumazet
3580d04aa6 ipv4/igmp: fix another memory leak in igmpv3_del_delrec()
syzbot reported memory leaks [1] that I have back tracked to
a missing cleanup from igmpv3_del_delrec() when
(im->sfmode != MCAST_INCLUDE)

Add ip_sf_list_clear_all() and kfree_pmc() helpers to explicitely
handle the cleanups before freeing.

[1]

BUG: memory leak
unreferenced object 0xffff888123e32b00 (size 64):
  comm "softirq", pid 0, jiffies 4294942968 (age 8.010s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 e0 00 00 01 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000006105011b>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
    [<000000006105011b>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<000000006105011b>] slab_alloc mm/slab.c:3326 [inline]
    [<000000006105011b>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
    [<000000004bba8073>] kmalloc include/linux/slab.h:547 [inline]
    [<000000004bba8073>] kzalloc include/linux/slab.h:742 [inline]
    [<000000004bba8073>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
    [<000000004bba8073>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
    [<00000000a46a65a0>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
    [<000000005956ca89>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:957
    [<00000000848e2d2f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
    [<00000000b9db185c>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
    [<000000003028e438>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
    [<0000000015b65589>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
    [<00000000ac198ef0>] __do_sys_setsockopt net/socket.c:2089 [inline]
    [<00000000ac198ef0>] __se_sys_setsockopt net/socket.c:2086 [inline]
    [<00000000ac198ef0>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
    [<000000000a770437>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<00000000d3adb93b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 9c8bb163ae ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 18:04:17 -07:00
Sunil Muthuswamy
14a1eaa882 hv_sock: perf: loop in send() to maximize bandwidth
Currently, the hv_sock send() iterates once over the buffer, puts data into
the VMBUS channel and returns. It doesn't maximize on the case when there
is a simultaneous reader draining data from the channel. In such a case,
the send() can maximize the bandwidth (and consequently minimize the cpu
cycles) by iterating until the channel is found to be full.

Perf data:
Total Data Transfer: 10GB/iteration
Single threaded reader/writer, Linux hvsocket writer with Windows hvsocket
reader
Packet size: 64KB
CPU sys time was captured using the 'time' command for the writer to send
10GB of data.
'Send Buffer Loop' is with the patch applied.
The values below are over 10 iterations.

|--------------------------------------------------------|
|        |        Current        |   Send Buffer Loop    |
|--------------------------------------------------------|
|        | Throughput | CPU sys  | Throughput | CPU sys  |
|        | (MB/s)     | time (s) | (MB/s)     | time (s) |
|--------------------------------------------------------|
| Min    |     407    |   7.048  |    401     |  5.958   |
|--------------------------------------------------------|
| Max    |     455    |   7.563  |    542     |  6.993   |
|--------------------------------------------------------|
| Avg    |     440    |   7.411  |    451     |  6.639   |
|--------------------------------------------------------|
| Median |     446    |   7.417  |    447     |  6.761   |
|--------------------------------------------------------|

Observation:
1. The avg throughput doesn't really change much with this change for this
scenario. This is most probably because the bottleneck on throughput is
somewhere else.
2. The average system (or kernel) cpu time goes down by 10%+ with this
change, for the same amount of data transfer.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 18:00:36 -07:00
Sunil Muthuswamy
ac383f58f3 hv_sock: perf: Allow the socket buffer size options to influence the actual socket buffers
Currently, the hv_sock buffer size is static and can't scale to the
bandwidth requirements of the application. This change allows the
applications to influence the socket buffer sizes using the SO_SNDBUF and
the SO_RCVBUF socket options.

Few interesting points to note:
1. Since the VMBUS does not allow a resize operation of the ring size, the
socket buffer size option should be set prior to establishing the
connection for it to take effect.
2. Setting the socket option comes with the cost of that much memory being
reserved/allocated by the kernel, for the lifetime of the connection.

Perf data:
Total Data Transfer: 1GB
Single threaded reader/writer
Results below are summarized over 10 iterations.

Linux hvsocket writer + Windows hvsocket reader:
|---------------------------------------------------------------------------------------------|
|Packet size ->   |      128B       |       1KB       |       4KB       |        64KB         |
|---------------------------------------------------------------------------------------------|
|SO_SNDBUF size | |                 Throughput in MB/s (min/max/avg/median):                  |
|               v |                                                                           |
|---------------------------------------------------------------------------------------------|
|      Default    | 109/118/114/116 | 636/774/701/700 | 435/507/480/476 |   410/491/462/470   |
|      16KB       | 110/116/112/111 | 575/705/662/671 | 749/900/854/869 |   592/824/692/676   |
|      32KB       | 108/120/115/115 | 703/823/767/772 | 718/878/850/866 | 1593/2124/2000/2085 |
|      64KB       | 108/119/114/114 | 592/732/683/688 | 805/934/903/911 | 1784/1943/1862/1843 |
|---------------------------------------------------------------------------------------------|

Windows hvsocket writer + Linux hvsocket reader:
|---------------------------------------------------------------------------------------------|
|Packet size ->   |     128B    |      1KB        |          4KB        |        64KB         |
|---------------------------------------------------------------------------------------------|
|SO_RCVBUF size | |               Throughput in MB/s (min/max/avg/median):                    |
|               v |                                                                           |
|---------------------------------------------------------------------------------------------|
|      Default    | 69/82/75/73 | 313/343/333/336 |   418/477/446/445   |   659/701/676/678   |
|      16KB       | 69/83/76/77 | 350/401/375/382 |   506/548/517/516   |   602/624/615/615   |
|      32KB       | 62/83/73/73 | 471/529/496/494 |   830/1046/935/939  | 944/1180/1070/1100  |
|      64KB       | 64/70/68/69 | 467/533/501/497 | 1260/1590/1430/1431 | 1605/1819/1670/1660 |
|---------------------------------------------------------------------------------------------|

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 18:00:14 -07:00
David Ahern
31680ac265 ipv6: Fix redirect with VRF
IPv6 redirect is broken for VRF. __ip6_route_redirect walks the FIB
entries looking for an exact match on ifindex. With VRF the flowi6_oif
is updated by l3mdev_update_flow to the l3mdev index and the
FLOWI_FLAG_SKIP_NH_OIF set in the flags to tell the lookup to skip the
device match. For redirects the device match is requires so use that
flag to know when the oif needs to be reset to the skb device index.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:58:21 -07:00
David Ahern
fc651001d2 neighbor: Add tracepoint to __neigh_create
Add tracepoint to __neigh_create to enable debugging of new entries.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:50:24 -07:00
David Ahern
75425657fe net: Set strict_start_type for routes and rules
New userspace on an older kernel can send unknown and unsupported
attributes resulting in an incompelete config which is almost
always wrong for routing (few exceptions are passthrough settings
like the protocol that installed the route).

Set strict_start_type in the policies for IPv4 and IPv6 routes and
rules to detect new, unsupported attributes and fail the route add.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:50:24 -07:00
David Ahern
06c77c3e67 ipv4: Rename and export nh_update_mtu
Rename nh_update_mtu to fib_nhc_update_mtu and export for use by the
nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
c3669486b5 ipv4: export fib_info_update_nh_saddr
Add scope as input argument versus relying on fib_info reference in
fib_nh, and export fib_info_update_nh_saddr.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
9bd8366792 ipv4: export fib_flush
As nexthops are deleted, fib entries referencing it are marked dead.
Export fib_flush so those entries can be removed in a timely manner.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
ac1fab2d13 ipv4: export fib_check_nh
Change fib_check_nh to take net, table and scope as input arguments
over struct fib_config and export for use by nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
1bff1a0c9b ipv4: Add function to send route updates
Add fib_info_notify_update to walk the fib and send RTM_NEWROUTE
notifications with NLM_F_REPLACE set for entries linked to a fib_info
that have nh_updated flag set. This helper will be used by the nexthop
code to notify userspace of routes that are impacted when a nexthop
config is updated via replace. The new function and its helper are
similar to how fib_flush and fib_table_flush work for address delete
and link down events.

This notification is needed for legacy apps that do not understand
the new nexthop object. Apps that are nexthop aware can use the
RTA_NH_ID attribute in the route notification to just ignore it.

In the future this should be wrapped in a sysctl to allow OS'es that
are fully updated to avoid the notificaton storm.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
19a3b7eea4 ipv6: export function to send route updates
Add fib6_rt_update to send RTM_NEWROUTE with NLM_F_REPLACE set. This
helper will be used by the nexthop code to notify userspace of routes
that are impacted when a nexthop config is updated via replace.

This notification is needed for legacy apps that do not understand
the new nexthop object. Apps that are nexthop aware can use the
RTA_NH_ID attribute in the route notification to just ignore it.

In the future this should be wrapped in a sysctl to allow OS'es that
are fully updated to avoid the notificaton storm.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:43 -07:00
David Ahern
cdaa16a4f7 ipv6: Add hook to bump sernum for a route to stubs
Add hook to ipv6 stub to bump the sernum up to the root node for a
route. This is needed by the nexthop code when a nexthop config changes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:43 -07:00
David Ahern
68a9b13d92 ipv6: Add delete route hook to stubs
Add ip6_del_rt to the IPv6 stub. The hook is needed by the nexthop
code to remove entries linked to a nexthop that is getting deleted.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:43 -07:00
Jakub Kicinski
c3f4a6c39c net/tls: don't ignore netdev notifications if no TLS features
On device surprise removal path (the notifier) we can't
bail just because the features are disabled.  They may
have been enabled during the lifetime of the device.
This bug leads to leaking netdev references and
use-after-frees if there are active connections while
device features are cleared.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 12:21:44 -07:00
Jakub Kicinski
3686637e50 net/tls: fix state removal with feature flags off
TLS offload drivers shouldn't (and currently don't) block
the TLS offload feature changes based on whether there are
active offloaded connections or not.

This seems to be a good idea, because we want the admin to
be able to disable the TLS offload at any time, and there
is no clean way of disabling it for active connections
(TX side is quite problematic).  So if features are cleared
existing connections will stay offloaded until they close,
and new connections will not attempt offload to a given
device.

However, the offload state removal handling is currently
broken if feature flags get cleared while there are
active TLS offloads.

RX side will completely bail from cleanup, even on normal
remove path, leaving device state dangling, potentially
causing issues when the 5-tuple is reused.  It will also
fail to release the netdev reference.

Remove the RX-side warning message, in next release cycle
it should be printed when features are disabled, rather
than when connection dies, but for that we need a more
efficient method of finding connection of a given netdev
(a'la BPF offload code).

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 12:21:44 -07:00
Jakub Kicinski
38030d7cb7 net/tls: avoid NULL-deref on resync during device removal
When netdev with active kTLS sockets in unregistered
notifier callback walks the offloaded sockets and
cleans up offload state.  RX data may still be processed,
however, and if resync was requested prior to device
removal we would hit a NULL pointer dereference on
ctx->netdev use.

Make sure resync is under the device offload lock
and NULL-check the netdev pointer.

This should be safe, because the pointer is set to
NULL either in the netdev notifier (under said lock)
or when socket is completely dead and no resync can
happen.

The other access to ctx->netdev in tls_validate_xmit_skb()
does not dereference the pointer, it just checks it against
other device pointer, so it should be pretty safe (perhaps
we can add a READ_ONCE/WRITE_ONCE there, if paranoid).

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 12:21:44 -07:00
Maxim Mikityanskiy
7dc2bccab0 Validate required parameters in inet6_validate_link_af
inet6_set_link_af requires that at least one of IFLA_INET6_TOKEN or
IFLA_INET6_ADDR_GET_MODE is passed. If none of them is passed, it
returns -EINVAL, which may cause do_setlink() to fail in the middle of
processing other commands and give the following warning message:

  A link change request failed with some changes committed already.
  Interface eth0 may have been left with an inconsistent configuration,
  please check.

Check the presence of at least one of them in inet6_validate_link_af to
detect invalid parameters at an early stage, before do_setlink does
anything. Also validate the address generation mode at an early stage.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 12:07:25 -07:00
Felipe Gasper
cae9910e73 net: Add UNIX_DIAG_UID to Netlink UNIX socket diagnostics.
This adds the ability for Netlink to report a socket's UID along with the
other UNIX diagnostic information that is already available. This will
allow diagnostic tools greater insight into which users control which
socket.

To test this, do the following as a non-root user:

    unshare -U -r bash
    nc -l -U user.socket.$$ &

.. and verify from within that same session that Netlink UNIX socket
diagnostics report the socket's UID as 0. Also verify that Netlink UNIX
socket diagnostics report the socket's UID as the user's UID from an
unprivileged process in a different session. Verify the same from
a root process.

Signed-off-by: Felipe Gasper <felipe@felipegasper.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 10:36:35 -07:00
Linus Torvalds
f75b6f303b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Clear up some recent tipc regressions because of registration
    ordering. Fix from Junwei Hu.

 2) tipc's TLV_SET() can read past the end of the supplied buffer during
    the copy. From Chris Packham.

 3) ptp example program doesn't match the kernel, from Richard Cochran.

 4) Outgoing message type fix in qrtr, from Bjorn Andersson.

 5) Flow control regression in stmmac, from Tan Tee Min.

 6) Fix inband autonegotiation in phylink, from Russell King.

 7) Fix sk_bound_dev_if handling in rawv6_bind(), from Mike Manning.

 8) Fix usbnet crash after disconnect, from Kloetzke Jan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
  usbnet: fix kernel crash after disconnect
  selftests: fib_rule_tests: use pre-defined DEV_ADDR
  net-next: net: Fix typos in ip-sysctl.txt
  ipv6: Consider sk_bound_dev_if when binding a raw socket to an address
  net: phylink: ensure inband AN works correctly
  usbnet: ipheth: fix racing condition
  net: stmmac: dma channel control register need to be init first
  net: stmmac: fix ethtool flow control not able to get/set
  net: qrtr: Fix message type of outgoing packets
  networking: : fix typos in code comments
  ptp: Fix example program to match kernel.
  fddi: fix typos in code comments
  selftests: fib_rule_tests: enable forwarding before ipv4 from/iif test
  selftests: fib_rule_tests: fix local IPv4 address typo
  tipc: Avoid copying bytes beyond the supplied data
  2/2] net: xilinx_emaclite: use readx_poll_timeout() in mdio wait function
  1/2] net: axienet: use readx_poll_timeout() in mdio wait function
  vlan: Mark expected switch fall-through
  macvlan: Mark expected switch fall-through
  net/mlx4_en: ethtool, Remove unsupported SFP EEPROM high pages query
  ...
2019-05-22 08:28:16 -07:00
Florian Westphal
69aeb53858 netfilter: nft_flow_offload: IPCB is only valid for ipv4 family
Guard this with a check vs. ipv4, IPCB isn't valid in ipv6 case.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-22 10:56:06 +02:00
Florian Westphal
91a9048f23 netfilter: nft_flow_offload: don't offload when sequence numbers need adjustment
We can't deal with tcp sequence number rewrite in flow_offload.
While at it, simplify helper check, we only need to know if the extension
is present, we don't need the helper data.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-22 10:51:49 +02:00
Florian Westphal
8437a6209f netfilter: nft_flow_offload: set liberal tracking mode for tcp
Without it, whenever a packet has to be pushed up the stack (e.g. because
of mtu mismatch), then conntrack will flag packets as invalid, which in
turn breaks NAT.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-22 10:51:49 +02:00
Florian Westphal
e75b3e1c9b netfilter: nf_flow_table: ignore DF bit setting
Its irrelevant if the DF bit is set or not, we must pass packet to
stack in either case.

If the DF bit is set, we must pass it to stack so the appropriate
ICMP error can be generated.

If the DF is not set, we must pass it to stack for fragmentation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-22 10:51:49 +02:00
Jeremy Sowden
17f78dd1bd batman-adv: fix for leaked TVLV handler.
A handler for BATADV_TVLV_ROAM was being registered when the
translation-table was initialized, but not unregistered when the
translation-table was freed.  Unregister it.

Fixes: 122edaa059 ("batman-adv: tvlv - convert roaming adv packet to use tvlv unicast packets")
Reported-by: syzbot+d454a826e670502484b8@syzkaller.appspotmail.com
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-21 23:01:24 +02:00
Simon Wunderlich
eef98918c8 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-21 23:00:54 +02:00
Mike Manning
72f7cfab6f ipv6: Consider sk_bound_dev_if when binding a raw socket to an address
IPv6 does not consider if the socket is bound to a device when binding
to an address. The result is that a socket can be bound to eth0 and
then bound to the address of eth1. If the device is a VRF, the result
is that a socket can only be bound to an address in the default VRF.

Resolve by considering the device if sk_bound_dev_if is set.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-21 13:12:39 -07:00
Linus Torvalds
2c1212de6f SPDX update for 5.2-rc2, round 1
Here are series of patches that add SPDX tags to different kernel files,
 based on two different things:
   - SPDX entries are added to a bunch of files that we missed a year ago
     that do not have any license information at all.
 
     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the last
     big sweep, or they were Makefile/Kconfig files, which we didn't
     touch last time.
 
   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself.  Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 The reason for these "large" patches is if we were to continue to
 progress at the current rate of change in the kernel, adding license
 tags to individual files in different subsystems, we would be finished
 in about 10 years at the earliest.
 
 There will be more series of these types of patches coming over the next
 few weeks as the tools and reviewers crunch through the more "odd"
 variants of how to say "GPLv2" that developers have come up with over
 the years, combined with other fun oddities (GPL + a BSD disclaimer?)
 that are being unearthed, with the goal for the whole kernel to be
 cleaned up.
 
 These diffstats are not small, 3840 files are touched, over 10k lines
 removed in just 24 patches.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOP8uw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynmGQCgy3evqzleuOITDpuWaxewFdHqiJYAnA7KRw4H
 1KwtfRnMtG6dk/XaS7H7
 =O9lH
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull SPDX update from Greg KH:
 "Here is a series of patches that add SPDX tags to different kernel
  files, based on two different things:

   - SPDX entries are added to a bunch of files that we missed a year
     ago that do not have any license information at all.

     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the
     last big sweep, or they were Makefile/Kconfig files, which we
     didn't touch last time.

   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself. Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers.

  The reason for these "large" patches is if we were to continue to
  progress at the current rate of change in the kernel, adding license
  tags to individual files in different subsystems, we would be finished
  in about 10 years at the earliest.

  There will be more series of these types of patches coming over the
  next few weeks as the tools and reviewers crunch through the more
  "odd" variants of how to say "GPLv2" that developers have come up with
  over the years, combined with other fun oddities (GPL + a BSD
  disclaimer?) that are being unearthed, with the goal for the whole
  kernel to be cleaned up.

  These diffstats are not small, 3840 files are touched, over 10k lines
  removed in just 24 patches"

* tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
  ...
2019-05-21 12:33:38 -07:00
Leon Romanovsky
eaa1ca9cf9 rds: Don't check return value from destroy CQ
There is no value in checking ib_destroy_cq() result and skipping to clear
struct ic fields. This connection needs to be reinitialized anyway.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21 15:50:53 -03:00
Florian Westphal
6bac76db1d netfilter: nat: fix udp checksum corruption
Due to copy&paste error nf_nat_mangle_udp_packet passes IPPROTO_TCP,
resulting in incorrect udp checksum when payload had to be mangled.

Fixes: dac3fe7259 ("netfilter: nat: remove csum_recalc hook")
Reported-by: Marc Haber <mh+netdev@zugschlus.de>
Tested-by: Marc Haber <mh+netdev@zugschlus.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-21 20:20:40 +02:00
YueHaibing
719c7d563c ipvs: Fix use-after-free in ip_vs_in
BUG: KASAN: use-after-free in ip_vs_in.part.29+0xe8/0xd20 [ip_vs]
Read of size 4 at addr ffff8881e9b26e2c by task sshd/5603

CPU: 0 PID: 5603 Comm: sshd Not tainted 4.19.39+ #30
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
 dump_stack+0x71/0xab
 print_address_description+0x6a/0x270
 kasan_report+0x179/0x2c0
 ip_vs_in.part.29+0xe8/0xd20 [ip_vs]
 ip_vs_in+0xd8/0x170 [ip_vs]
 nf_hook_slow+0x5f/0xe0
 __ip_local_out+0x1d5/0x250
 ip_local_out+0x19/0x60
 __tcp_transmit_skb+0xba1/0x14f0
 tcp_write_xmit+0x41f/0x1ed0
 ? _copy_from_iter_full+0xca/0x340
 __tcp_push_pending_frames+0x52/0x140
 tcp_sendmsg_locked+0x787/0x1600
 ? tcp_sendpage+0x60/0x60
 ? inet_sk_set_state+0xb0/0xb0
 tcp_sendmsg+0x27/0x40
 sock_sendmsg+0x6d/0x80
 sock_write_iter+0x121/0x1c0
 ? sock_sendmsg+0x80/0x80
 __vfs_write+0x23e/0x370
 vfs_write+0xe7/0x230
 ksys_write+0xa1/0x120
 ? __ia32_sys_read+0x50/0x50
 ? __audit_syscall_exit+0x3ce/0x450
 do_syscall_64+0x73/0x200
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7ff6f6147c60
Code: 73 01 c3 48 8b 0d 28 12 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 5d 73 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83
RSP: 002b:00007ffd772ead18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000034 RCX: 00007ff6f6147c60
RDX: 0000000000000034 RSI: 000055df30a31270 RDI: 0000000000000003
RBP: 000055df30a31270 R08: 0000000000000000 R09: 0000000000000000
R10: 00007ffd772ead70 R11: 0000000000000246 R12: 00007ffd772ead74
R13: 00007ffd772eae20 R14: 00007ffd772eae24 R15: 000055df2f12ddc0

Allocated by task 6052:
 kasan_kmalloc+0xa0/0xd0
 __kmalloc+0x10a/0x220
 ops_init+0x97/0x190
 register_pernet_operations+0x1ac/0x360
 register_pernet_subsys+0x24/0x40
 0xffffffffc0ea016d
 do_one_initcall+0x8b/0x253
 do_init_module+0xe3/0x335
 load_module+0x2fc0/0x3890
 __do_sys_finit_module+0x192/0x1c0
 do_syscall_64+0x73/0x200
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 6067:
 __kasan_slab_free+0x130/0x180
 kfree+0x90/0x1a0
 ops_free_list.part.7+0xa6/0xc0
 unregister_pernet_operations+0x18b/0x1f0
 unregister_pernet_subsys+0x1d/0x30
 ip_vs_cleanup+0x1d/0xd2f [ip_vs]
 __x64_sys_delete_module+0x20c/0x300
 do_syscall_64+0x73/0x200
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The buggy address belongs to the object at ffff8881e9b26600 which belongs to the cache kmalloc-4096 of size 4096
The buggy address is located 2092 bytes inside of 4096-byte region [ffff8881e9b26600, ffff8881e9b27600)
The buggy address belongs to the page:
page:ffffea0007a6c800 count:1 mapcount:0 mapping:ffff888107c0e600 index:0x0 compound_mapcount: 0
flags: 0x17ffffc0008100(slab|head)
raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff888107c0e600
raw: 0000000000000000 0000000080070007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

while unregistering ipvs module, ops_free_list calls
__ip_vs_cleanup, then nf_unregister_net_hooks be called to
do remove nf hook entries. It need a RCU period to finish,
however net->ipvs is set to NULL immediately, which will
trigger NULL pointer dereference when a packet is hooked
and handled by ip_vs_in where net->ipvs is dereferenced.

Another scene is ops_free_list call ops_free to free the
net_generic directly while __ip_vs_cleanup finished, then
calling ip_vs_in will triggers use-after-free.

This patch moves nf_unregister_net_hooks from __ip_vs_cleanup()
to __ip_vs_dev_cleanup(),  where rcu_barrier() is called by
unregister_pernet_device -> unregister_pernet_operations,
that will do the needed grace period.

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: efe4160618 ("ipvs: convert to use pernet nf_hook api")
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-21 18:31:09 +02:00
Martin KaFai Lau
f7355a6c04 bpf: Check sk_fullsock() before returning from bpf_sk_lookup()
The BPF_FUNC_sk_lookup_xxx helpers return RET_PTR_TO_SOCKET_OR_NULL.
Meaning a fullsock ptr and its fullsock's fields in bpf_sock can be
accessed, e.g. type, protocol, mark and priority.
Some new helper, like bpf_sk_storage_get(), also expects
ARG_PTR_TO_SOCKET is a fullsock.

bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
However, the ptr returned from sk_to_full_sk() is not guaranteed
to be a fullsock.  For example, it cannot get a fullsock if sk
is in TCP_TIME_WAIT.

This patch checks for sk_fullsock() before returning. If it is not
a fullsock, sock_gen_put() is called if needed and then returns NULL.

Fixes: 6acc9b432e ("bpf: Add helper to retrieve socket in BPF")
Cc: Joe Stringer <joe@isovalent.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Joe Stringer <joe@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-21 16:33:30 +02:00
Lorenz Bauer
9b28ae243e bpf: fix out-of-bounds read in __bpf_skc_lookup
__bpf_skc_lookup takes a socket tuple and the length of the
tuple as an argument. Based on the length, it decides which
address family to pass to the helper function sk_lookup.

In case of AF_INET6, it fails to verify that the length
of the tuple is long enough. sk_lookup may therefore access
data past the end of the tuple.

Fixes: 6acc9b432e ("bpf: Add helper to retrieve socket in BPF")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-21 16:25:09 +02:00
Phil Sutter
e633508a95 netfilter: nft_fib: Fix existence check support
NFTA_FIB_F_PRESENT flag was not always honored since eval functions did
not call nft_fib_store_result in all cases.

Given that in all callsites there is a struct net_device pointer
available which holds the interface data to be stored in destination
register, simplify nft_fib_store_result() to just accept that pointer
instead of the nft_pktinfo pointer and interface index. This also
allows to drop the index to interface lookup previously needed to get
the name associated with given index.

Fixes: 055c4b34b9 ("netfilter: nft_fib: Support existence check")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-21 16:10:38 +02:00
Jagdish Motwani
946c0d8e6e netfilter: nf_queue: fix reinject verdict handling
This patch fixes netfilter hook traversal when there are more than 1 hooks
returning NF_QUEUE verdict. When the first queue reinjects the packet,
'nf_reinject' starts traversing hooks with a proper hook_index. However,
if it again receives a NF_QUEUE verdict (by some other netfilter hook), it
queues the packet with a wrong hook_index. So, when the second queue
reinjects the packet, it re-executes hooks in between.

Fixes: 960632ece6 ("netfilter: convert hook list to an array")
Signed-off-by: Jagdish Motwani <jagdish.motwani@sophos.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-21 16:10:30 +02:00
Thomas Gleixner
fd9871f70c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 50 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154042.917228456@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:52:39 +02:00
Thomas Gleixner
1ccea77e2a treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not see http www gnu org licenses

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details [based]
  [from] [clk] [highbank] [c] you should have received a copy of the
  gnu general public license along with this program if not see http
  www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 355 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:45 +02:00
Thomas Gleixner
ef65e43121 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
Based on 1 normalized pattern(s):

  this program is free software you can distribute it and or modify it
  under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.622608495@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:45 +02:00
Thomas Gleixner
3a63cbb8db treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 or
  later as published by the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 9 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154040.848507137@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:40 +02:00
Thomas Gleixner
1621633323 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 1
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not write to the free software foundation inc
  51 franklin street fifth floor boston ma 02110 1301 usa

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option [no]_[pad]_[ctrl] any later version this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin street fifth floor boston ma
  02110 1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 176 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154040.652910950@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:39 +02:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Thomas Gleixner
09c434b8a0 treewide: Add SPDX license identifier for more missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have MODULE_LICENCE("GPL*") inside which was used in the initial
   scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Bjorn Andersson
7036e62145 net: qrtr: Fix message type of outgoing packets
QRTR packets has a message type in the header, which is repeated in the
control header. For control packets we therefor copy the type from
beginning of the outgoing payload and use that as message type.

For non-control messages an endianness fix introduced in v5.2-rc1 caused the
type to be 0, rather than QRTR_TYPE_DATA, causing all messages to be dropped by
the receiver. Fix this by converting and using qrtr_type, which will remain
QRTR_TYPE_DATA for non-control messages.

Fixes: 8f5e24514c ("net: qrtr: use protocol endiannes variable")
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-20 20:50:31 -04:00
Gustavo A. R. Silva
fa2c52be71 vlan: Mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

This patch fixes the following warning:

net/8021q/vlan_dev.c: In function ‘vlan_dev_ioctl’:
net/8021q/vlan_dev.c:374:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (!net_eq(dev_net(dev), &init_net))
      ^
net/8021q/vlan_dev.c:376:2: note: here
  case SIOCGMIIPHY:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-20 11:38:55 -07:00
Junwei Hu
526f5b851a tipc: fix modprobe tipc failed after switch order of device registration
Error message printed:
modprobe: ERROR: could not insert 'tipc': Address family not
supported by protocol.
when modprobe tipc after the following patch: switch order of
device registration, commit 7e27e8d613
("tipc: switch order of device registration to fix a crash")

Because sock_create_kern(net, AF_TIPC, ...) called by
tipc_topsrv_create_listener() in the initialization process
of tipc_init_net(), so tipc_socket_init() must be execute before that.
Meanwhile, tipc_net_id need to be initialized when sock_create()
called, and tipc_socket_init() is no need to be called for each namespace.

I add a variable tipc_topsrv_net_ops, and split the
register_pernet_subsys() of tipc into two parts, and split
tipc_socket_init() with initialization of pernet params.

By the way, I fixed resources rollback error when tipc_bcast_init()
failed in tipc_init_net().

Fixes: 7e27e8d613 ("tipc: switch order of device registration to fix a crash")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wang Wang <wangwang2@huawei.com>
Reported-by: syzbot+1e8114b61079bfe9cbc5@syzkaller.appspotmail.com
Reviewed-by: Kang Zhou <zhoukang7@huawei.com>
Reviewed-by: Suanming Mou <mousuanming@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-20 10:45:43 -07:00
Florian Westphal
2c82c7e724 netfilter: nf_tables: fix oops during rule dump
We can oops in nf_tables_fill_rule_info().

Its not possible to fetch previous element in rcu-protected lists
when deletions are not prevented somehow: list_del_rcu poisons
the ->prev pointer value.

Before rcu-conversion this was safe as dump operations did hold
nfnetlink mutex.

Pass previous rule as argument, obtained by keeping a pointer to
the previous rule during traversal.

Fixes: d9adf22a29 ("netfilter: nf_tables: use call_rcu in netlink dumps")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-20 19:45:23 +02:00
Linus Torvalds
78e0365184 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung
    Cheng.

 3) Orphan NFC, we'll take the patches directly into my tree. From
    Johannes Berg.

 4) We can't recycle cloned TCP skbs, from Eric Dumazet.

 5) Some flow dissector bpf test fixes, from Stanislav Fomichev.

 6) Fix RCU marking and warnings in rhashtable, from Herbert Xu.

 7) Fix some potential fib6 leaks, from Eric Dumazet.

 8) Fix a _decode_session4 uninitialized memory read bug fix that got
    lost in a merge. From Florian Westphal.

 9) Fix ipv6 source address routing wrt. exception route entries, from
    Wei Wang.

10) The netdev_xmit_more() conversion was not done %100 properly in mlx5
    driver, fix from Tariq Toukan.

11) Clean up botched merge on netfilter kselftest, from Florian
    Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits)
  of_net: fix of_get_mac_address retval if compiled without CONFIG_OF
  net: fix kernel-doc warnings for socket.c
  net: Treat sock->sk_drops as an unsigned int when printing
  kselftests: netfilter: fix leftover net/net-next merge conflict
  mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
  mlxsw: core: Prevent QSFP module initialization for old hardware
  vsock/virtio: Initialize core virtio vsock before registering the driver
  net/mlx5e: Fix possible modify header actions memory leak
  net/mlx5e: Fix no rewrite fields with the same match
  net/mlx5e: Additional check for flow destination comparison
  net/mlx5e: Add missing ethtool driver info for representors
  net/mlx5e: Fix number of vports for ingress ACL configuration
  net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
  net/mlx5e: Fix wrong xmit_more application
  net/mlx5: Fix peer pf disable hca command
  net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
  net/mlx5: Add meaningful return codes to status_to_err function
  net/mlx5: Imply MLXFW in mlx5_core
  Revert "tipc: fix modprobe tipc failed after switch order of device registration"
  vsock/virtio: free packets during the socket release
  ...
2019-05-20 08:21:07 -07:00
Randy Dunlap
85806af0c6 net: fix kernel-doc warnings for socket.c
Fix kernel-doc warnings by moving the kernel-doc notation to be
immediately above the functions that it describes.

Fixes these warnings for sock_sendmsg() and sock_recvmsg():

../net/socket.c:658: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
../net/socket.c:658: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
../net/socket.c:889: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
../net/socket.c:889: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
../net/socket.c:889: warning: Excess function parameter 'flags' description in 'INDIRECT_CALLABLE_DECLARE'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-19 10:33:22 -07:00
Patrick Talbert
ea9a03791a net: Treat sock->sk_drops as an unsigned int when printing
Currently, procfs socket stats format sk_drops as a signed int (%d). For large
values this will cause a negative number to be printed.

We know the drop count can never be a negative so change the format specifier to
%u.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-19 10:31:10 -07:00
Jorge E. Moreira
ba95e5dfd3 vsock/virtio: Initialize core virtio vsock before registering the driver
Avoid a race in which static variables in net/vmw_vsock/af_vsock.c are
accessed (while handling interrupts) before they are initialized.

[    4.201410] BUG: unable to handle kernel paging request at ffffffffffffffe8
[    4.207829] IP: vsock_addr_equals_addr+0x3/0x20
[    4.211379] PGD 28210067 P4D 28210067 PUD 28212067 PMD 0
[    4.211379] Oops: 0000 [#1] PREEMPT SMP PTI
[    4.211379] Modules linked in:
[    4.211379] CPU: 1 PID: 30 Comm: kworker/1:1 Not tainted 4.14.106-419297-gd7e28cc1f241 #1
[    4.211379] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[    4.211379] Workqueue: virtio_vsock virtio_transport_rx_work
[    4.211379] task: ffffa3273d175280 task.stack: ffffaea1800e8000
[    4.211379] RIP: 0010:vsock_addr_equals_addr+0x3/0x20
[    4.211379] RSP: 0000:ffffaea1800ebd28 EFLAGS: 00010286
[    4.211379] RAX: 0000000000000002 RBX: 0000000000000000 RCX: ffffffffb94e42f0
[    4.211379] RDX: 0000000000000400 RSI: ffffffffffffffe0 RDI: ffffaea1800ebdd0
[    4.211379] RBP: ffffaea1800ebd58 R08: 0000000000000001 R09: 0000000000000001
[    4.211379] R10: 0000000000000000 R11: ffffffffb89d5d60 R12: ffffaea1800ebdd0
[    4.211379] R13: 00000000828cbfbf R14: 0000000000000000 R15: ffffaea1800ebdc0
[    4.211379] FS:  0000000000000000(0000) GS:ffffa3273fd00000(0000) knlGS:0000000000000000
[    4.211379] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    4.211379] CR2: ffffffffffffffe8 CR3: 000000002820e001 CR4: 00000000001606e0
[    4.211379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.211379] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    4.211379] Call Trace:
[    4.211379]  ? vsock_find_connected_socket+0x6c/0xe0
[    4.211379]  virtio_transport_recv_pkt+0x15f/0x740
[    4.211379]  ? detach_buf+0x1b5/0x210
[    4.211379]  virtio_transport_rx_work+0xb7/0x140
[    4.211379]  process_one_work+0x1ef/0x480
[    4.211379]  worker_thread+0x312/0x460
[    4.211379]  kthread+0x132/0x140
[    4.211379]  ? process_one_work+0x480/0x480
[    4.211379]  ? kthread_destroy_worker+0xd0/0xd0
[    4.211379]  ret_from_fork+0x35/0x40
[    4.211379] Code: c7 47 08 00 00 00 00 66 c7 07 28 00 c7 47 08 ff ff ff ff c7 47 04 ff ff ff ff c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 8b 47 08 <3b> 46 08 75 0a 8b 47 04 3b 46 04 0f 94 c0 c3 31 c0 c3 90 66 2e
[    4.211379] RIP: vsock_addr_equals_addr+0x3/0x20 RSP: ffffaea1800ebd28
[    4.211379] CR2: ffffffffffffffe8
[    4.211379] ---[ end trace f31cc4a2e6df3689 ]---
[    4.211379] Kernel panic - not syncing: Fatal exception in interrupt
[    4.211379] Kernel Offset: 0x37000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    4.211379] Rebooting in 5 seconds..

Fixes: 22b5c0b63f ("vsock/virtio: fix kernel panic after device hot-unplug")
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Cc: kernel-team@android.com
Cc: stable@vger.kernel.org [4.9+]
Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-18 10:50:28 -07:00
Masahiro Yamada
9cc342f6c4 treewide: prefix header search paths with $(srctree)/
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].

To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.

Having whitespaces after -I does not matter since commit 48f6e3cf5b
("kbuild: do not drop -I without parameter").

[1]: https://patchwork.kernel.org/patch/9632347/

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-05-18 11:49:57 +09:00
David S. Miller
5593530e56 Revert "tipc: fix modprobe tipc failed after switch order of device registration"
This reverts commit 532b0f7ece.

More revisions coming up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-17 12:15:05 -07:00
Stefano Garzarella
ac03046ece vsock/virtio: free packets during the socket release
When the socket is released, we should free all packets
queued in the per-socket list in order to avoid a memory
leak.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-17 11:46:36 -07:00
Junwei Hu
532b0f7ece tipc: fix modprobe tipc failed after switch order of device registration
Error message printed:
modprobe: ERROR: could not insert 'tipc': Address family not
supported by protocol.
when modprobe tipc after the following patch: switch order of
device registration, commit 7e27e8d613
("tipc: switch order of device registration to fix a crash")

Because sock_create_kern(net, AF_TIPC, ...) is called by
tipc_topsrv_create_listener() in the initialization process
of tipc_net_ops, tipc_socket_init() must be execute before that.

I move tipc_socket_init() into function tipc_init_net().

Fixes: 7e27e8d613
("tipc: switch order of device registration to fix a crash")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wang Wang <wangwang2@huawei.com>
Reviewed-by: Kang Zhou <zhoukang7@huawei.com>
Reviewed-by: Suanming Mou <mousuanming@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-17 11:33:51 -07:00
swkhack
34dcf6a190 net: caif: fix the value of size argument of snprintf
Because the function snprintf write at most size bytes(including the
null byte).So the value of the argument size need not to minus one.

Signed-off-by: swkhack <swkhack@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-17 11:31:15 -07:00
Linus Torvalds
227747fb9e AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXN2BC/u3V2unywtrAQJm9w/+L7ufbRkj6XGVongmhf4n+auBQXMJ4jec
 zN6bjWrp/SN9kJfOqOKA+sk9s3cCOCV8SF/2eM5P8DJNtrB6aXlg590u1wSkOp99
 FdSM8Fy7v4bTwW9hCBhvcFpC+layVUEv/WAsCCIZi94W+H43XFY4QM79cqoqIx8r
 nTLu9EcjWFpUoBIAYEU0x/h4IA5Cyl6CUw3YZhZYaGoLLfi9EZkgBLlUU+6OXpDO
 Uepzn1gnpXMCNsiBE/Hr9LR0pfOTtzdJuNADrppRnbPfky8RsPE8tuk6kT6301U1
 IxG66SafYsvbQGzyIdfTydl022DFj5LOtCPFtfALviJqdBOGE/zPPnrBPinHg4oJ
 40P2tIJ/+Ksz5cPzmkA1KanSXaQ2v0sLBVdQJ7yt5EFuAMzj/roWpiPmEmQd6KqB
 ixZdZLehKFPaAB5cR41fHV1jB30HN7oakwqCoYmXd1Chu3AlB15yV9WZMSqjPS8P
 pkNC/X5mU5hDnZUx9e3Fbu8LqoGOjnGvDn5jOxihdKfaGu3A4OlbSerIUbRHvnT8
 u8XDPoq4j61f04MiI9z/bPDFTRYyycIQPcHYQpi4MJt9lSkkydP217P60BJsUv2n
 NIPYwgI7VIse0Gdo8shIg+RnSnJaKHT9Sf86h8pyDFO6wZp/GVVqPSdjjU+Lv5fv
 CZGJ7PCYcfs=
 =2q2Y
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull misc AFS fixes from David Howells:
 "This fixes a set of miscellaneous issues in the afs filesystem,
  including:

   - leak of keys on file close.

   - broken error handling in xattr functions.

   - missing locking when updating VL server list.

   - volume location server DNS lookup whereby preloaded cells may not
     ever get a lookup and regular DNS lookups to maintain server lists
     consume power unnecessarily.

   - incorrect error propagation and handling in the fileserver
     iteration code causes operations to sometimes apparently succeed.

   - interruption of server record check/update side op during
     fileserver iteration causes uninterruptible main operations to fail
     unexpectedly.

   - callback promise expiry time miscalculation.

   - over invalidation of the callback promise on directories.

   - double locking on callback break waking up file locking waiters.

   - double increment of the vnode callback break counter.

  Note that it makes some changes outside of the afs code, including:

   - an extra parameter to dns_query() to allow the dns_resolver key
     just accessed to be immediately invalidated. AFS is caching the
     results itself, so the key can be discarded.

   - an interruptible version of wait_var_event().

   - an rxrpc function to allow the maximum lifespan to be set on a
     call.

   - a way for an rxrpc call to be marked as non-interruptible"

* tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix double inc of vnode->cb_break
  afs: Fix lock-wait/callback-break double locking
  afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set
  afs: Fix calculation of callback expiry time
  afs: Make dynamic root population wait uninterruptibly for proc_cells_lock
  afs: Make some RPC operations non-interruptible
  rxrpc: Allow the kernel to mark a call as being non-interruptible
  afs: Fix error propagation from server record check/update
  afs: Fix the maximum lifespan of VL and probe calls
  rxrpc: Provide kernel interface to set max lifespan on a call
  afs: Fix "kAFS: AFS vnode with undefined type 0"
  afs: Fix cell DNS lookup
  Add wait_var_event_interruptible()
  dns_resolver: Allow used keys to be invalidated
  afs: Fix afs_cell records to always have a VL server list record
  afs: Fix missing lock when replacing VL server list
  afs: Fix afs_xattr_get_yfs() to not try freeing an error value
  afs: Fix incorrect error handling in afs_xattr_get_acl()
  afs: Fix key leak in afs_release() and afs_evict_inode()
2019-05-16 17:00:13 -07:00
Linus Torvalds
1d9d7cbf28 On the filesystem side we have:
- a fix to enforce quotas set above the mount point (Luis Henriques)
 
 - support for exporting snapshots through NFS (Zheng Yan)
 
 - proper statx implementation (Jeff Layton).  statx flags are mapped
   to MDS caps, with AT_STATX_{DONT,FORCE}_SYNC taken into account.
 
 - some follow-up dentry name handling fixes, in particular elimination
   of our hand-rolled helper and the switch to __getname() as suggested
   by Al (Jeff Layton)
 
 - a set of MDS client cleanups in preparation for async MDS requests
   in the future (Jeff Layton)
 
 - a fix to sync the filesystem before remounting (Jeff Layton)
 
 On the rbd side, work is on-going on object-map and fast-diff image
 features.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlzdgEkTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi2w0B/9AsskuQezu8HP0NumCNfdgfI02r6d1
 1ZixMp6q8AAtOZYHP0bmiLzaETwC3+sRkD+8nX5DWuFISyjkTlRn8f7wnoziWkBT
 bBmL21fufkSKXN41VFCdolAbUPCKuA8+Fr7YE2hCl517ejbf47W+htv7+a56eTiR
 iAiDyVYokB8sj7WTVW6ET4HJTvJly1Z4QUNmy9Ljfzc8AvL2LFLOe6FRsJtIThdx
 aE00RX9EQsKO2v9ROd6jDmZocg50TvFmgF14A5GFfMmFrxJuri2yEI4iZd3hSKu2
 yZ+fBWmRy4E9w5E20qufrM+bSVjA+Zi7aiTMriaBm54aYtflgJ5gxhFI
 =68dZ
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.2-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "On the filesystem side we have:

   - a fix to enforce quotas set above the mount point (Luis Henriques)

   - support for exporting snapshots through NFS (Zheng Yan)

   - proper statx implementation (Jeff Layton). statx flags are mapped
     to MDS caps, with AT_STATX_{DONT,FORCE}_SYNC taken into account.

   - some follow-up dentry name handling fixes, in particular
     elimination of our hand-rolled helper and the switch to __getname()
     as suggested by Al (Jeff Layton)

   - a set of MDS client cleanups in preparation for async MDS requests
     in the future (Jeff Layton)

   - a fix to sync the filesystem before remounting (Jeff Layton)

  On the rbd side, work is on-going on object-map and fast-diff image
  features"

* tag 'ceph-for-5.2-rc1' of git://github.com/ceph/ceph-client: (29 commits)
  ceph: flush dirty inodes before proceeding with remount
  ceph: fix unaligned access in ceph_send_cap_releases
  libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer
  libceph: fix unaligned accesses in ceph_entity_addr handling
  rbd: don't assert on writes to snapshots
  rbd: client_mutex is never nested
  ceph: print inode number in __caps_issued_mask debugging messages
  ceph: just call get_session in __ceph_lookup_mds_session
  ceph: simplify arguments and return semantics of try_get_cap_refs
  ceph: fix comment over ceph_drop_caps_for_unlink
  ceph: move wait for mds request into helper function
  ceph: have ceph_mdsc_do_request call ceph_mdsc_submit_request
  ceph: after an MDS request, do callback and completions
  ceph: use pathlen values returned by set_request_path_attr
  ceph: use __getname/__putname in ceph_mdsc_build_path
  ceph: use ceph_mdsc_build_path instead of clone_dentry_name
  ceph: fix potential use-after-free in ceph_mdsc_build_path
  ceph: dump granular cap info in "caps" debugfs file
  ceph: make iterate_session_caps a public symbol
  ceph: fix NULL pointer deref when debugging is enabled
  ...
2019-05-16 16:24:01 -07:00
Wei Wang
510e2ceda0 ipv6: fix src addr routing with the exception table
When inserting route cache into the exception table, the key is
generated with both src_addr and dest_addr with src addr routing.
However, current logic always assumes the src_addr used to generate the
key is a /128 host address. This is not true in the following scenarios:
1. When the route is a gateway route or does not have next hop.
   (rt6_is_gw_or_nonexthop() == false)
2. When calling ip6_rt_cache_alloc(), saddr is passed in as NULL.
This means, when looking for a route cache in the exception table, we
have to do the lookup twice: first time with the passed in /128 host
address, second time with the src_addr stored in fib6_info.

This solves the pmtu discovery issue reported by Mikael Magnusson where
a route cache with a lower mtu info is created for a gateway route with
src addr. However, the lookup code is not able to find this route cache.

Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Reported-by: Mikael Magnusson <mikael.kernel@lists.m7n.se>
Bisected-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 14:30:53 -07:00
Eric Dumazet
d7c04b05c9 net: avoid weird emergency message
When host is under high stress, it is very possible thread
running netdev_wait_allrefs() returns from msleep(250)
10 seconds late.

This leads to these messages in the syslog :

[...] unregister_netdevice: waiting for syz_tun to become free. Usage count = 0

If the device refcount is zero, the wait is over.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 14:25:58 -07:00
Florian Westphal
858e5400e6 xfrm: ressurrect "Fix uninitialized memory read in _decode_session4"
This resurrects commit 8742dc86d0
("xfrm4: Fix uninitialized memory read in _decode_session4"),
which got lost during a merge conflict resolution between ipsec-next
and net-next tree.

c53ac41e37 ("xfrm: remove decode_session indirection from afinfo_policy")
in ipsec-next moved the (buggy) _decode_session4 from
net/ipv4/xfrm4_policy.c to net/xfrm/xfrm_policy.c.
In mean time, 8742dc86d0 was applied to ipsec.git and fixed the
problem in the "old" location.

When the trees got merged, the moved, old function was kept.
This applies the "lost" commit again, to the new location.

Fixes: a658a3f2ec ("Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 14:14:47 -07:00
Junwei Hu
7e27e8d613 tipc: switch order of device registration to fix a crash
When tipc is loaded while many processes try to create a TIPC socket,
a crash occurs:
 PANIC: Unable to handle kernel paging request at virtual
 address "dfff20000000021d"
 pc : tipc_sk_create+0x374/0x1180 [tipc]
 lr : tipc_sk_create+0x374/0x1180 [tipc]
   Exception class = DABT (current EL), IL = 32 bits
 Call trace:
  tipc_sk_create+0x374/0x1180 [tipc]
  __sock_create+0x1cc/0x408
  __sys_socket+0xec/0x1f0
  __arm64_sys_socket+0x74/0xa8
 ...

This is due to race between sock_create and unfinished
register_pernet_device. tipc_sk_insert tries to do
"net_generic(net, tipc_net_id)".
but tipc_net_id is not initialized yet.

So switch the order of the two to close the race.

This can be reproduced with multiple processes doing socket(AF_TIPC, ...)
and one process doing module removal.

Fixes: a62fbccecd ("tipc: make subscriber server support net namespace")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wang Wang <wangwang2@huawei.com>
Reviewed-by: Xiaogang Wang <wangxiaogang3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 12:25:02 -07:00
Eric Dumazet
61fb0d0168 ipv6: prevent possible fib6 leaks
At ipv6 route dismantle, fib6_drop_pcpu_from() is responsible
for finding all percpu routes and set their ->from pointer
to NULL, so that fib6_ref can reach its expected value (1).

The problem right now is that other cpus can still catch the
route being deleted, since there is no rcu grace period
between the route deletion and call to fib6_drop_pcpu_from()

This can leak the fib6 and associated resources, since no
notifier will take care of removing the last reference(s).

I decided to add another boolean (fib6_destroying) instead
of reusing/renaming exception_bucket_flushed to ease stable backports,
and properly document the memory barriers used to implement this fix.

This patch has been co-developped with Wei Wang.

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Wei Wang <weiwan@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 12:21:00 -07:00
Konstantin Khlebnikov
752beb5ec4 net: bpfilter: fallback to netfilter if failed to load bpfilter kernel module
If bpfilter is not available return ENOPROTOOPT to fallback to netfilter.

Function request_module() returns both errors and userspace exit codes.
Just ignore them. Rechecking bpfilter_ops is enough.

Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 12:12:40 -07:00
Sunil Muthuswamy
a9eeb998c2 hv_sock: Add support for delayed close
Currently, hvsock does not implement any delayed or background close
logic. Whenever the hvsock socket is closed, a FIN is sent to the peer, and
the last reference to the socket is dropped, which leads to a call to
.destruct where the socket can hang indefinitely waiting for the peer to
close it's side. The can cause the user application to hang in the close()
call.

This change implements proper STREAM(TCP) closing handshake mechanism by
sending the FIN to the peer and the waiting for the peer's FIN to arrive
for a given timeout. On timeout, it will try to terminate the connection
(i.e. a RST). This is in-line with other socket providers such as virtio.

This change does not address the hang in the vmbus_hvsock_device_unregister
where it waits indefinitely for the host to rescind the channel. That
should be taken up as a separate fix.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 12:10:05 -07:00
Edward Cree
bae9ed6902 flow_offload: support CVLAN match
Plumb it through from the flow_dissector.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 12:02:42 -07:00
David Howells
b960a34b73 rxrpc: Allow the kernel to mark a call as being non-interruptible
Allow kernel services using AF_RXRPC to indicate that a call should be
non-interruptible.  This allows kafs to make things like lock-extension and
writeback data storage calls non-interruptible.

If this is set, signals will be ignored for operations on that call where
possible - such as waiting to get a call channel on an rxrpc connection.

It doesn't prevent UDP sendmsg from being interrupted, but that will be
handled by packet retransmission.

rxrpc_kernel_recv_data() isn't affected by this since that never waits,
preferring instead to return -EAGAIN and leave the waiting to the caller.

Userspace initiated calls can't be set to be uninterruptible at this time.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16 16:25:20 +01:00
David Howells
bbd172e316 rxrpc: Provide kernel interface to set max lifespan on a call
Provide an interface to set max lifespan on a call from inside of the
kernel without having to call kernel_sendmsg().

Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16 15:48:30 +01:00
David S. Miller
c7d5ec26ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-05-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a use after free in __dev_map_entry_free(), from Eric.

2) Several sockmap related bug fixes: a splat in strparser if
   it was never initialized, remove duplicate ingress msg list
   purging which can race, fix msg->sg.size accounting upon
   skb to msg conversion, and last but not least fix a timeout
   bug in tcp_bpf_wait_data(), from John.

3) Fix LRU map to avoid messing with eviction heuristics upon
   syscall lookup, e.g. map walks from user space side will
   then lead to eviction of just recently created entries on
   updates as it would mark all map entries, from Daniel.

4) Don't bail out when libbpf feature probing fails. Also
   various smaller fixes to flow_dissector test, from Stanislav.

5) Fix missing brackets for BTF_INT_OFFSET() in UAPI, from Gary.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-15 18:28:44 -07:00
Linus Torvalds
700a800a94 This pull consists mostly of nfsd container work:
Scott Mayhew revived an old api that communicates with a userspace
 daemon to manage some on-disk state that's used to track clients across
 server reboots.  We've been using a usermode_helper upcall for that, but
 it's tough to run those with the right namespaces, so a daemon is much
 friendlier to container use cases.
 
 Trond fixed nfsd's handling of user credentials in user namespaces.  He
 also contributed patches that allow containers to support different sets
 of NFS protocol versions.
 
 The only remaining container bug I'm aware of is that the NFS reply
 cache is shared between all containers.  If anyone's aware of other gaps
 in our container support, let me know.
 
 The rest of this is miscellaneous bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAlzcWNcVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+DUEP/0WD3jKNAHFV3M5YQPAI9fz/iCND
 Db/A4oWP5qa6JmwmHe61il29QeGqkeFr/NPexgzM3Xw2E39d7RBXBeWyVDuqb0wr
 6SCXjXibTsuAHg11nR8Xf0P5Vej3rfGbG6up5lLCIDTEZxVpWoaBJnM8+3bewuCj
 XbeiDW54oiMbmDjon3MXqVAIF/z7LjorecJ+Yw5+0Jy7KZ6num9Kt8+fi7qkEfFd
 i5Bp9KWgzlTbJUJV4EX3ZKN3zlGkfOvjoo2kP3PODPVMB34W8jSLKkRSA1tDWYZg
 43WhBt5OODDlV6zpxSJXehYKIB4Ae469+RRaIL4F+ORRK+AzR0C/GTuOwJiG+P3J
 n95DX5WzX74nPOGQJgAvq4JNpZci85jM3jEK1TR2M7KiBDG5Zg+FTsPYVxx5Sgah
 Akl/pjLtHQPSdBbFGHn5TsXU+gqWNiKsKa9663tjxLb8ldmJun6JoQGkAEF9UJUn
 dzv0UxyHeHAblhSynY+WsUR+Xep9JDo/p5LyFK4if9Sd62KeA1uF/MFhAqpKZF81
 mrgRCqW4sD8aVTBNZI06pZzmcZx4TRr2o+Oj5KAXf6Yk6TJRSGfnQscoMMBsTLkw
 VK1rBQ/71TpjLHGZZZEx1YJrkVZAMmw2ty4DtK2f9jeKO13bWmUpc6UATzVufHKA
 C1rUZXJ5YioDbYDy
 =TUdw
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "This consists mostly of nfsd container work:

  Scott Mayhew revived an old api that communicates with a userspace
  daemon to manage some on-disk state that's used to track clients
  across server reboots. We've been using a usermode_helper upcall for
  that, but it's tough to run those with the right namespaces, so a
  daemon is much friendlier to container use cases.

  Trond fixed nfsd's handling of user credentials in user namespaces. He
  also contributed patches that allow containers to support different
  sets of NFS protocol versions.

  The only remaining container bug I'm aware of is that the NFS reply
  cache is shared between all containers. If anyone's aware of other
  gaps in our container support, let me know.

  The rest of this is miscellaneous bugfixes"

* tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux: (23 commits)
  nfsd: update callback done processing
  locks: move checks from locks_free_lock() to locks_release_private()
  nfsd: fh_drop_write in nfsd_unlink
  nfsd: allow fh_want_write to be called twice
  nfsd: knfsd must use the container user namespace
  SUNRPC: rsi_parse() should use the current user namespace
  SUNRPC: Fix the server AUTH_UNIX userspace mappings
  lockd: Pass the user cred from knfsd when starting the lockd server
  SUNRPC: Temporary sockets should inherit the cred from their parent
  SUNRPC: Cache the process user cred in the RPC server listener
  nfsd: Allow containers to set supported nfs versions
  nfsd: Add custom rpcbind callbacks for knfsd
  SUNRPC: Allow further customisation of RPC program registration
  SUNRPC: Clean up generic dispatcher code
  SUNRPC: Add a callback to initialise server requests
  SUNRPC/nfs: Fix return value for nfs4_callback_compound()
  nfsd: handle legacy client tracking records sent by nfsdcld
  nfsd: re-order client tracking method selection
  nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld
  nfsd: un-deprecate nfsdcld
  ...
2019-05-15 18:21:43 -07:00
John Fastabend
5fa2ca7c4a bpf, tcp: correctly handle DONT_WAIT flags and timeo == 0
The tcp_bpf_wait_data() routine needs to check timeo != 0 before
calling sk_wait_event() otherwise we may see unexpected stalls
on receiver.

Arika did all the leg work here I just formatted, posted and ran
a few tests.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Arika Chen <eaglesora@gmail.com>
Suggested-by: Arika Chen <eaglesora@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-16 01:36:13 +02:00
David Howells
d0660f0b3b dns_resolver: Allow used keys to be invalidated
Allow used DNS resolver keys to be invalidated after use if the caller is
doing its own caching of the results.  This reduces the amount of resources
required.

Fix AFS to invalidate DNS results to kill off permanent failure records
that get lodged in the resolver keyring and prevent future lookups from
happening.

Fixes: 0a5143f2f8 ("afs: Implement VL server rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15 17:35:54 +01:00
Eric Dumazet
858f501744 tcp: do not recycle cloned skbs
It is illegal to change arbitrary fields in skb_shared_info if the
skb is cloned.

Before calling skb_zcopy_clear() we need to ensure this rule,
therefore we need to move the test from sk_stream_alloc_skb()
to sk_wmem_free_skb()

Fixes: 4f661542a4 ("tcp: fix zerocopy and notsent_lowat issues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-15 09:22:41 -07:00
YueHaibing
80a316ff16 9p/xen: Add cleanup path in p9_trans_xen_init
If xenbus_register_frontend() fails in p9_trans_xen_init,
we should call v9fs_unregister_trans() to do cleanup.

Link: http://lkml.kernel.org/r/20190430143933.19368-1-yuehaibing@huawei.com
Cc: stable@vger.kernel.org
Fixes: 868eb12273 ("xen/9pfs: introduce Xen 9pfs transport driver")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2019-05-15 13:00:07 +00:00
YueHaibing
d4548543fc 9p/virtio: Add cleanup path in p9_virtio_init
KASAN report this:

BUG: unable to handle kernel paging request at ffffffffa0097000
PGD 3870067 P4D 3870067 PUD 3871063 PMD 2326e2067 PTE 0
Oops: 0000 [#1
CPU: 0 PID: 5340 Comm: modprobe Not tainted 5.1.0-rc7+ #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:__list_add_valid+0x10/0x70
Code: c3 48 8b 06 55 48 89 e5 5d 48 39 07 0f 94 c0 0f b6 c0 c3 90 90 90 90 90 90 90 55 48 89 d0 48 8b 52 08 48 89 e5 48 39 f2 75 19 <48> 8b 32 48 39 f0 75 3a

RSP: 0018:ffffc90000e23c68 EFLAGS: 00010246
RAX: ffffffffa00ad000 RBX: ffffffffa009d000 RCX: 0000000000000000
RDX: ffffffffa0097000 RSI: ffffffffa0097000 RDI: ffffffffa009d000
RBP: ffffc90000e23c68 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0097000
R13: ffff888231797180 R14: 0000000000000000 R15: ffffc90000e23e78
FS:  00007fb215285540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0097000 CR3: 000000022f144000 CR4: 00000000000006f0
Call Trace:
 v9fs_register_trans+0x2f/0x60 [9pnet
 ? 0xffffffffa0087000
 p9_virtio_init+0x25/0x1000 [9pnet_virtio
 do_one_initcall+0x6c/0x3cc
 ? kmem_cache_alloc_trace+0x248/0x3b0
 do_init_module+0x5b/0x1f1
 load_module+0x1db1/0x2690
 ? m_show+0x1d0/0x1d0
 __do_sys_finit_module+0xc5/0xd0
 __x64_sys_finit_module+0x15/0x20
 do_syscall_64+0x6b/0x1d0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb214d8e839
Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01

RSP: 002b:00007ffc96554278 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000055e67eed2aa0 RCX: 00007fb214d8e839
RDX: 0000000000000000 RSI: 000055e67ce95c2e RDI: 0000000000000003
RBP: 000055e67ce95c2e R08: 0000000000000000 R09: 000055e67eed2aa0
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
R13: 000055e67eeda500 R14: 0000000000040000 R15: 000055e67eed2aa0
Modules linked in: 9pnet_virtio(+) 9pnet gre rfkill vmw_vsock_virtio_transport_common vsock [last unloaded: 9pnet_virtio
CR2: ffffffffa0097000
---[ end trace 4a52bb13ff07b761

If register_virtio_driver() fails in p9_virtio_init,
we should call v9fs_unregister_trans() to do cleanup.

Link: http://lkml.kernel.org/r/20190430115942.41840-1-yuehaibing@huawei.com
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: b530cc7940 ("9p: add virtio transport")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2019-05-15 12:58:18 +00:00
Sinan Kaya
432d82200f net: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
CONFIG_DEBUG_KERNEL should not impact code generation.  Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Link: http://lkml.kernel.org/r/20190413224438.10802-6-okaya@kernel.org
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc:  Chris Zankel <chris@zankel.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Sabrina Dubroca
feadc4b6cf rtnetlink: always put IFLA_LINK for links with a link-netnsid
Currently, nla_put_iflink() doesn't put the IFLA_LINK attribute when
iflink == ifindex.

In some cases, a device can be created in a different netns with the
same ifindex as its parent. That device will not dump its IFLA_LINK
attribute, which can confuse some userspace software that expects it.
For example, if the last ifindex created in init_net and foo are both
8, these commands will trigger the issue:

    ip link add parent type dummy                   # ifindex 9
    ip link add link parent netns foo type macvlan  # ifindex 9 in ns foo

So, in case a device puts the IFLA_LINK_NETNSID attribute in a dump,
always put the IFLA_LINK attribute as well.

Thanks to Dan Winship for analyzing the original OpenShift bug down to
the missing netlink attribute.

v2: change Fixes tag, it's been here forever, as Nicolas Dichtel said
    add Nicolas' ack
v3: change Fixes tag
    fix subject typo, spotted by Edward Cree

Analyzed-by: Dan Winship <danw@redhat.com>
Fixes: d8a5ec6727 ("[NET]: netlink support for moving devices between network namespaces.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-14 15:40:01 -07:00
Yuchung Cheng
cd736d8b67 tcp: fix retrans timestamp on passive Fast Open
Commit c7d13c8faa ("tcp: properly track retry time on
passive Fast Open") sets the start of SYNACK retransmission
time on passive Fast Open in "retrans_stamp". However the
timestamp is not reset upon the handshake has completed. As a
result, future data packet retransmission may not update it in
tcp_retransmit_skb(). This may lead to socket aborting earlier
unexpectedly by retransmits_timed_out() since retrans_stamp remains
the SYNACK rtx time.

This bug only manifests on passive TFO sender that a) suffered
SYNACK timeout and then b) stalls on very first loss recovery. Any
successful loss recovery would reset the timestamp to avoid this
issue.

Fixes: c7d13c8faa ("tcp: properly track retry time on passive Fast Open")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-14 15:17:49 -07:00
Linus Torvalds
318222a35b Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things and hotfixes

 - ocfs2

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (139 commits)
  kernel/memremap.c: remove the unused device_private_entry_fault() export
  mm: delete find_get_entries_tag
  mm/huge_memory.c: make __thp_get_unmapped_area static
  mm/mprotect.c: fix compilation warning because of unused 'mm' variable
  mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
  mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
  mm/Kconfig: update "Memory Model" help text
  mm/vmscan.c: don't disable irq again when count pgrefill for memcg
  mm: memblock: make keeping memblock memory opt-in rather than opt-out
  hugetlbfs: always use address space in inode for resv_map pointer
  mm/z3fold.c: support page migration
  mm/z3fold.c: add structure for buddy handles
  mm/z3fold.c: improve compression by extending search
  mm/z3fold.c: introduce helper functions
  mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
  mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
  mm/vmscan.c: simplify shrink_inactive_list()
  fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  xen/privcmd-buf.c: convert to use vm_map_pages_zero()
  xen/gntdev.c: convert to use vm_map_pages()
  ...
2019-05-14 10:10:55 -07:00
Ira Weiny
73b0140bf0 mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality.  New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted.  This breaks the current
GUP call site convention of having the returned pages be the final
parameter.  So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Ira Weiny
932f4a630a mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:45 -07:00
Maxim Mikityanskiy
32b5a2c995 wireless: Skip directory when generating certificates
Commit 715a123347 ("wireless: don't write C files on failures") drops
the `test -f $$f` check. The list of targets contains the
CONFIG_CFG80211_EXTRA_REGDB_KEYDIR directory itself, and this check used
to filter it out. After the check was removed, the extra keydir option
no longer works, failing with the following message:

od: 'standard input': read error: Is a directory

This commit restores the check to make extra keydir work again.

Fixes: 715a123347 ("wireless: don't write C files on failures")
Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-14 10:27:57 +02:00
John Fastabend
cabede8b4f bpf: sockmap fix msg->sg.size account on ingress skb
When converting a skb to msg->sg we forget to set the size after the
latest ktls/tls code conversion. This patch can be reached by doing
a redir into ingress path from BPF skb sock recv hook. Then trying to
read the size fails.

Fix this by setting the size.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-14 01:31:43 +02:00
John Fastabend
c42253cc88 bpf: sockmap remove duplicate queue free
In tcp bpf remove we free the cork list and purge the ingress msg
list. However we do this before the ref count reaches zero so it
could be possible some other access is in progress. In this case
(tcp close and/or tcp_unhash) we happen to also hold the sock
lock so no path exists but lets fix it otherwise it is extremely
fragile and breaks the reference counting rules. Also we already
check the cork list and ingress msg queue and free them once the
ref count reaches zero so its wasteful to check twice.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-14 01:31:43 +02:00
John Fastabend
014894360e bpf: sockmap, only stop/flush strp if it was enabled at some point
If we try to call strp_done on a parser that has never been
initialized, because the sockmap user is only using TX side for
example we get the following error.

  [  883.422081] WARNING: CPU: 1 PID: 208 at kernel/workqueue.c:3030 __flush_work+0x1ca/0x1e0
  ...
  [  883.422095] Workqueue: events sk_psock_destroy_deferred
  [  883.422097] RIP: 0010:__flush_work+0x1ca/0x1e0

This had been wrapped in a 'if (psock->parser.enabled)' logic which
was broken because the strp_done() was never actually being called
because we do a strp_stop() earlier in the tear down logic will
set parser.enabled to false. This could result in a use after free
if work was still in the queue and was resolved by the patch here,
1d79895aef ("sk_msg: Always cancel strp work before freeing the
psock"). However, calling strp_stop(), done by the patch marked in
the fixes tag, only is useful if we never initialized a strp parser
program and never initialized the strp to start with. Because if
we had initialized a stream parser strp_stop() would have been called
by sk_psock_drop() earlier in the tear down process.  By forcing the
strp to stop we get past the WARNING in strp_done that checks
the stopped flag but calling cancel_work_sync on work that has never
been initialized is also wrong and generates the warning above.

To fix check if the parser program exists. If the program exists
then the strp work has been initialized and must be sync'd and
cancelled before free'ing any structures. If no program exists we
never initialized the stream parser in the first place so skip the
sync/cancel logic implemented by strp_done.

Finally, remove the strp_done its not needed and in the case where we
are using the stream parser has already been called.

Fixes: e8e3437762 ("bpf: Stop the psock parser before canceling its work")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-14 01:31:31 +02:00
Eric Dumazet
b1c17a9a35 flow_dissector: disable preemption around BPF calls
Various things in eBPF really require us to disable preemption
before running an eBPF program.

syzbot reported :

BUG: assuming atomic context at net/core/flow_dissector.c:737
in_atomic(): 0, irqs_disabled(): 0, pid: 24710, name: syz-executor.3
2 locks held by syz-executor.3/24710:
 #0: 00000000e81a4bf1 (&tfile->napi_mutex){+.+.}, at: tun_get_user+0x168e/0x3ff0 drivers/net/tun.c:1850
 #1: 00000000254afebd (rcu_read_lock){....}, at: __skb_flow_dissect+0x1e1/0x4bb0 net/core/flow_dissector.c:822
CPU: 1 PID: 24710 Comm: syz-executor.3 Not tainted 5.1.0+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 __cant_sleep kernel/sched/core.c:6165 [inline]
 __cant_sleep.cold+0xa3/0xbb kernel/sched/core.c:6142
 bpf_flow_dissect+0xfe/0x390 net/core/flow_dissector.c:737
 __skb_flow_dissect+0x362/0x4bb0 net/core/flow_dissector.c:853
 skb_flow_dissect_flow_keys_basic include/linux/skbuff.h:1322 [inline]
 skb_probe_transport_header include/linux/skbuff.h:2500 [inline]
 skb_probe_transport_header include/linux/skbuff.h:2493 [inline]
 tun_get_user+0x2cfe/0x3ff0 drivers/net/tun.c:1940
 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
 call_write_iter include/linux/fs.h:1872 [inline]
 do_iter_readv_writev+0x5fd/0x900 fs/read_write.c:693
 do_iter_write fs/read_write.c:970 [inline]
 do_iter_write+0x184/0x610 fs/read_write.c:951
 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
 do_writev+0x15b/0x330 fs/read_write.c:1058
 __do_sys_writev fs/read_write.c:1131 [inline]
 __se_sys_writev fs/read_write.c:1128 [inline]
 __x64_sys_writev+0x75/0xb0 fs/read_write.c:1128
 do_syscall_64+0x103/0x670 arch/x86/entry/common.c:298
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Petar Penkov <ppenkov@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-13 09:53:42 -07:00
David S. Miller
3ebb41bf47 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Postpone chain policy update to drop after transaction is complete,
   from Florian Westphal.

2) Add entry to flowtable after confirmation to fix UDP flows with
   packets going in one single direction.

3) Reference count leak in dst object, from Taehee Yoo.

4) Check for TTL field in flowtable datapath, from Taehee Yoo.

5) Fix h323 conntrack helper due to incorrect boundary check,
   from Jakub Jankowski.

6) Fix incorrect rcu dereference when fetching basechain stats,
   from Florian Westphal.

7) Missing error check when adding new entries to flowtable,
   from Taehee Yoo.

8) Use version field in nfnetlink message to honor the nfgen_family
   field, from Kristian Evensen.

9) Remove incorrect configuration check for CONFIG_NF_CONNTRACK_IPV6,
   from Subash Abhinov Kasiviswanathan.

10) Prevent dying entries from being added to the flowtable,
    from Taehee Yoo.

11) Don't hit WARN_ON() with malformed blob in ebtables with
    trailing data after last rule, reported by syzbot, patch
    from Florian Westphal.

12) Remove NFT_CT_TIMEOUT enumeration, never used in the kernel
    code.

13) Fix incorrect definition for NFT_LOGLEVEL_MAX, from Florian
    Westphal.

This batch comes with a conflict that can be fixed with this patch:

diff --cc include/uapi/linux/netfilter/nf_tables.h
index 7bdb234f3d8c,f0cf7b0f4f35..505393c6e959
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@@ -966,6 -966,8 +966,7 @@@ enum nft_socket_keys
   * @NFT_CT_DST_IP: conntrack layer 3 protocol destination (IPv4 address)
   * @NFT_CT_SRC_IP6: conntrack layer 3 protocol source (IPv6 address)
   * @NFT_CT_DST_IP6: conntrack layer 3 protocol destination (IPv6 address)
 - * @NFT_CT_TIMEOUT: connection tracking timeout policy assigned to conntrack
+  * @NFT_CT_ID: conntrack id
   */
  enum nft_ct_keys {
  	NFT_CT_STATE,
@@@ -991,6 -993,8 +992,7 @@@
  	NFT_CT_DST_IP,
  	NFT_CT_SRC_IP6,
  	NFT_CT_DST_IP6,
 -	NFT_CT_TIMEOUT,
+ 	NFT_CT_ID,
  	__NFT_CT_MAX
  };
  #define NFT_CT_MAX		(__NFT_CT_MAX - 1)

That replaces the unused NFT_CT_TIMEOUT definition by NFT_CT_ID. If you prefer,
I can also solve this conflict here, just let me know.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-13 08:55:15 -07:00
Hariprasad Kelam
3285a9aa65 net: dccp : proto: remove Unneeded variable "err"
Fix below issue reported by coccicheck

net/dccp/proto.c:266:5-8: Unneeded variable: "err". Return "0" on line
310

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-12 13:21:30 -07:00
Vladimir Oltean
8767137510 net: dsa: Initialize DSA_SKB_CB(skb)->deferred_xmit variable
The sk_buff control block can have any contents on xmit put there by the
stack, so initialization is mandatory, since we are checking its value
after the actual DSA xmit (the tagger may have changed it).

The DSA_SKB_ZERO() macro could have been used for this purpose, but:
- Zeroizing a 48-byte memory region in the hotpath is best avoided.
- It would have triggered a warning with newer compilers since
  __dsa_skb_cb contains a structure within a structure, and the {0}
  initializer was incorrect for that purpose.

So simply remove the DSA_SKB_ZERO() macro and initialize the
deferred_xmit variable by hand (which should be done for all further
dsa_skb_cb variables which need initialization - currently none - to
avoid the performance penalty).

Fixes: 97a69a0dea ("net: dsa: Add support for deferred xmit")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-12 13:19:46 -07:00
Nicholas Mc Guire
8f5e24514c net: qrtr: use protocol endiannes variable
sparse was unable to verify endiannes correctness due to reassignment
from le32_to_cpu to the same variable - fix this warning up by providing
a proper __le32 type and initializing it. This is not actually fixing
any bug - rather just addressing the sparse warning.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-11 09:59:48 -07:00
YueHaibing
b96a54154b dsa: tag_brcm: Fix build error without CONFIG_NET_DSA_TAG_BRCM_PREPEND
Fix gcc build error:

net/dsa/tag_brcm.c:211:16: error: brcm_prepend_netdev_ops undeclared here (not in a function); did you mean brcm_netdev_ops?
 DSA_TAG_DRIVER(brcm_prepend_netdev_ops);
                ^
./include/net/dsa.h:708:10: note: in definition of macro DSA_TAG_DRIVER
  .ops = &__ops,       \
          ^~~~~
./include/net/dsa.h:701:36: warning: dsa_tag_driver_brcm_prepend_netdev_ops defined but not used [-Wunused-variable]
 #define DSA_TAG_DRIVER_NAME(__ops) dsa_tag_driver ## _ ## __ops
                                    ^
./include/net/dsa.h:707:30: note: in expansion of macro DSA_TAG_DRIVER_NAME
 static struct dsa_tag_driver DSA_TAG_DRIVER_NAME(__ops) = {  \
                              ^~~~~~~~~~~~~~~~~~~
net/dsa/tag_brcm.c:211:1: note: in expansion of macro DSA_TAG_DRIVER
 DSA_TAG_DRIVER(brcm_prepend_netdev_ops);

Like the CONFIG_NET_DSA_TAG_BRCM case,
brcm_prepend_netdev_ops and DSA_TAG_PROTO_BRCM_PREPEND
should be wrappeed by CONFIG_NET_DSA_TAG_BRCM_PREPEND.

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: b74b70c449 ("net: dsa: Support prepended Broadcom tag")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-10 15:06:45 -07:00
Tobin C. Harding
bdfad5aec1 bridge: Fix error path for kobject_init_and_add()
Currently error return from kobject_init_and_add() is not followed by a
call to kobject_put().  This means there is a memory leak.  We currently
set p to NULL so that kfree() may be called on it as a noop, the code is
arguably clearer if we move the kfree() up closer to where it is
called (instead of after goto jump).

Remove a goto label 'err1' and jump to call to kobject_put() in error
return from kobject_init_and_add() fixing the memory leak.  Re-name goto
label 'put_back' to 'err1' now that we don't use err1, following current
nomenclature (err1, err2 ...).  Move call to kfree out of the error
code at bottom of function up to closer to where memory was allocated.
Add comment to clarify call to kfree().

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-10 15:05:08 -07:00
Linus Torvalds
601e6bcc4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Several bug fixes, many are quick merge-window regression cures:

   - When NLM_F_EXCL is not set, allow same fib rule insertion. From
     Hangbin Liu.

   - Several cures in sja1105 DSA driver (while loop exit condition fix,
     return of negative u8, etc.) from Vladimir Oltean.

   - Handle tx/rx delays in realtek PHY driver properly, from Serge
     Semin.

   - Double free in cls_matchall, from Pieter Jansen van Vuuren.

   - Disable SIOCSHWTSTAMP in macvlan/vlan containers, from Hangbin Liu.

   - Endainness fixes in aqc111, from Oliver Neukum.

   - Handle errors in packet_init properly, from Haibing Yue.

   - Various W=1 warning fixes in kTLS, from Jakub Kicinski"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (34 commits)
  nfp: add missing kdoc
  net/tls: handle errors from padding_length()
  net/tls: remove set but not used variables
  docs/btf: fix the missing section marks
  nfp: bpf: fix static check error through tightening shift amount adjustment
  selftests: bpf: initialize bpf_object pointers where needed
  packet: Fix error path in packet_init
  net/tcp: use deferred jump label for TCP acked data hook
  net: aquantia: fix undefined devm_hwmon_device_register_with_info reference
  aqc111: fix double endianness swap on BE
  aqc111: fix writing to the phy on BE
  aqc111: fix endianness issue in aqc111_change_mtu
  vlan: disable SIOCSHWTSTAMP in container
  macvlan: disable SIOCSHWTSTAMP in container
  tipc: fix hanging clients using poll with EPOLLOUT flag
  tuntap: synchronize through tfiles array instead of tun->numqueues
  tuntap: fix dividing by zero in ebpf queue selection
  dwmac4_prog_mtl_tx_algorithms() missing write operation
  ptp_qoriq: fix NULL access if ptp dt node missing
  net/sched: avoid double free on matchall reoffload
  ...
2019-05-09 17:00:51 -07:00
Jakub Kicinski
b53f4976fb net/tls: handle errors from padding_length()
At the time padding_length() is called the record header
is still part of the message.  If malicious TLS 1.3 peer
sends an all-zero record padding_length() will stop at
the record header, and return full length of the data
including the tail_size.

Subsequent subtraction of prot->overhead_size from rxm->full_len
will cause rxm->full_len to turn negative.  skb accessors,
however, will always catch resulting out-of-bounds operation,
so in practice this fix comes down to returning the correct
error code.  It also fixes a set but not used warning.

This code was added by commit 130b392c6c ("net: tls: Add tls 1.3 support").

CC: Dave Watson <davejwatson@fb.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 16:37:39 -07:00
Jakub Kicinski
88c80bee88 net/tls: remove set but not used variables
Commit 4504ab0e6e ("net/tls: Inform user space about send buffer availability")
made us report write_space regardless whether partial record
push was successful or not.  Remove the now unused return value
to clean up the following W=1 warning:

net/tls/tls_device.c: In function ‘tls_device_write_space’:
net/tls/tls_device.c:546:6: warning: variable ‘rc’ set but not used [-Wunused-but-set-variable]
  int rc = 0;
      ^~

CC: Vakul Garg <vakul.garg@nxp.com>
CC: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 16:37:39 -07:00
Linus Torvalds
06cbd26d31 NFS client updates for Linux 5.2
Stable bugfixes:
 - Fall back to MDS if no deviceid is found rather than aborting   # v4.11+
 - NFS4: Fix v4.0 client state corruption when mount
 
 Features:
 - Much improved handling of soft mounts with NFS v4.0
   - Reduce risk of false positive timeouts
   - Faster failover of reads and writes after a timeout
   - Added a "softerr" mount option to return ETIMEDOUT instead of
     EIO to the application after a timeout
 - Increase number of xprtrdma backchannel requests
 - Add additional xprtrdma tracepoints
 - Improved send completion batching for xprtrdma
 
 Other bugfixes and cleanups:
 - Return -EINVAL when NFS v4.2 is passed an invalid dedup mode
 - Reduce usage of GFP_ATOMIC pages in SUNRPC
 - Various minor NFS over RDMA cleanups and bugfixes
 - Use the correct container namespace for upcalls
 - Don't share superblocks between user namespaces
 - Various other container fixes
 - Make nfs_match_client() killable to prevent soft lockups
 - Don't mark all open state for recovery when handling recallable state revoked flag
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlzUjdcACgkQ18tUv7Cl
 QOsUiw/+OirzlZI7XeHfpZ/CwS7A+tSk3AAg9PDS1gjbfylER0g++GpA08tXnmDt
 JdUnBKYC5ujLyAqxN1j7QK+EvmXZQro8rucJxhEdPJMIQDC65fQQnmW7efl2bAEv
 CAWNDCf9Xe4g6X8LSR5jrnaMV4kuOQBYX4wqrrmaV8I+g/A/GKXW262KWnAv+w1M
 Y1ZlX+d1Gm8hODXhvqz4lldW6bkyrpWpU9BKUtYSYnSR0x1fam6PLPuCTm74fEDR
 N/Tgy5XvJi4xgti4SOZ/dI2O/Oqu6ut81PEPlhs8sTX04G8bLhr+hl3rSksCZFlu
 Afz9Hcnxg6XYB3Va7j7AO67H5SbyX4Zyj5cRMipXQE7Ebc1iXo5lu3vdhAEOAtNx
 fdNJlqD86MC/XWbtM+DfWlD+KjtpZ+lkxN+xuMgC/kVaPTeFI7nEWM796hJP/4no
 EYtnSLbSpJyH6F7wH9IL5V2EJYFxbzTvnPSTxV+QNZ0HgF17gTY0AGmQBzDE5bF0
 tfQteOG6MYXMHg64pTEzjlowlXOWdnE5TnuaFpt64/yP+hVznZMepBMSkxZO1xYt
 jc1wQlJkv/SyVH7cMGsj5lw3A6zwTrLManDUUmrLjIsVVmh4dk8WKlNtWQmvf1v6
 nFBklUa2GzH8LWKRT2ftNGcUeEiCuw/QF9oE5T/V7/7SQ/wmmvA=
 =skb2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Highlights include:

  Stable bugfixes:
   - Fall back to MDS if no deviceid is found rather than aborting   # v4.11+
   - NFS4: Fix v4.0 client state corruption when mount

  Features:
   - Much improved handling of soft mounts with NFS v4.0:
       - Reduce risk of false positive timeouts
       - Faster failover of reads and writes after a timeout
       - Added a "softerr" mount option to return ETIMEDOUT instead of
         EIO to the application after a timeout
   - Increase number of xprtrdma backchannel requests
   - Add additional xprtrdma tracepoints
   - Improved send completion batching for xprtrdma

  Other bugfixes and cleanups:
   - Return -EINVAL when NFS v4.2 is passed an invalid dedup mode
   - Reduce usage of GFP_ATOMIC pages in SUNRPC
   - Various minor NFS over RDMA cleanups and bugfixes
   - Use the correct container namespace for upcalls
   - Don't share superblocks between user namespaces
   - Various other container fixes
   - Make nfs_match_client() killable to prevent soft lockups
   - Don't mark all open state for recovery when handling recallable
     state revoked flag"

* tag 'nfs-for-5.2-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (69 commits)
  SUNRPC: Rebalance a kref in auth_gss.c
  NFS: Fix a double unlock from nfs_match,get_client
  nfs: pass the correct prototype to read_cache_page
  NFSv4: don't mark all open state for recovery when handling recallable state revoked flag
  SUNRPC: Fix an error code in gss_alloc_msg()
  SUNRPC: task should be exit if encode return EKEYEXPIRED more times
  NFS4: Fix v4.0 client state corruption when mount
  PNFS fallback to MDS if no deviceid found
  NFS: make nfs_match_client killable
  lockd: Store the lockd client credential in struct nlm_host
  NFS: When mounting, don't share filesystems between different user namespaces
  NFS: Convert NFSv2 to use the container user namespace
  NFSv4: Convert the NFS client idmapper to use the container user namespace
  NFS: Convert NFSv3 to use the container user namespace
  SUNRPC: Use namespace of listening daemon in the client AUTH_GSS upcall
  SUNRPC: Use the client user namespace when encoding creds
  NFS: Store the credential of the mount process in the nfs_server
  SUNRPC: Cache cred of process creating the rpc_client
  xprtrdma: Remove stale comment
  xprtrdma: Update comments that reference ib_drain_qp
  ...
2019-05-09 14:33:15 -07:00
YueHaibing
36096f2f4f packet: Fix error path in packet_init
kernel BUG at lib/list_debug.c:47!
invalid opcode: 0000 [#1
CPU: 0 PID: 12914 Comm: rmmod Tainted: G        W         5.1.0+ #47
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:__list_del_entry_valid+0x53/0x90
Code: 48 8b 32 48 39 fe 75 35 48 8b 50 08 48 39 f2 75 40 b8 01 00 00 00 5d c3 48
89 fe 48 89 c2 48 c7 c7 18 75 fe 82 e8 cb 34 78 ff <0f> 0b 48 89 fe 48 c7 c7 50 75 fe 82 e8 ba 34 78 ff 0f 0b 48 89 f2
RSP: 0018:ffffc90001c2fe40 EFLAGS: 00010286
RAX: 000000000000004e RBX: ffffffffa0184000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff888237a17788 RDI: 00000000ffffffff
RBP: ffffc90001c2fe40 R08: 0000000000000000 R09: 0000000000000000
R10: ffffc90001c2fe10 R11: 0000000000000000 R12: 0000000000000000
R13: ffffc90001c2fe50 R14: ffffffffa0184000 R15: 0000000000000000
FS:  00007f3d83634540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555c350ea818 CR3: 0000000231677000 CR4: 00000000000006f0
Call Trace:
 unregister_pernet_operations+0x34/0x120
 unregister_pernet_subsys+0x1c/0x30
 packet_exit+0x1c/0x369 [af_packet
 __x64_sys_delete_module+0x156/0x260
 ? lockdep_hardirqs_on+0x133/0x1b0
 ? do_syscall_64+0x12/0x1f0
 do_syscall_64+0x6e/0x1f0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

When modprobe af_packet, register_pernet_subsys
fails and does a cleanup, ops->list is set to LIST_POISON1,
but the module init is considered to success, then while rmmod it,
BUG() is triggered in __list_del_entry_valid which is called from
unregister_pernet_subsys. This patch fix error handing path in
packet_init to avoid possilbe issue if some error occur.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 13:45:46 -07:00
Chuck Lever
5940d1cf9f SUNRPC: Rebalance a kref in auth_gss.c
Restore the kref_get that matches the gss_put_auth(gss_msg->auth)
done by gss_release_msg().

Fixes: ac83228a71 ("SUNRPC: Use namespace of listening daemon ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:27:24 -04:00
Dan Carpenter
fe31ce83cb SUNRPC: Fix an error code in gss_alloc_msg()
If kstrdup_const() then this function returns zero (success) but it
should return -ENOMEM.

Fixes: ac83228a71 ("SUNRPC: Use namespace of listening daemon in the client AUTH_GSS upcall")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:26:56 -04:00
ZhangXiaoxu
9c5948c248 SUNRPC: task should be exit if encode return EKEYEXPIRED more times
If the rpc.gssd always return cred success, but now the cred is
expired, then the task will loop in call_refresh and call_transmit.

Exit the rpc task after retry.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-09 16:26:56 -04:00
Jakub Kicinski
494bc1d281 net/tcp: use deferred jump label for TCP acked data hook
User space can flip the clean_acked_data_enabled static branch
on and off with TLS offload when CONFIG_TLS_DEVICE is enabled.
jump_label.h suggests we use the delayed version in this case.

Deferred branches now also don't take the branch mutex on
decrement, so we avoid potential locking issues.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 11:13:57 -07:00
David S. Miller
d7e163ced4 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich (we forgot to include
    this patch previously ...)
 
  - fix multicast tt/tvlv worker locking, by Linus Lüssing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlzUKp4WHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoX1kD/oCq7+iLO1oaewGrD/gnrGJu2Wk
 FDlVRRBvUOwdJM2zo7o5xLCU1+Pohv0H5uBIXUw4vhcXoqhkqyysLz9lotPrxRG1
 KD1Jl8FwOPLkSiLPxj7QurfBU8G0V/A7fYkAt+RTTkk9iSniL8lCc6hE6hj6KimP
 GaqnkYJyudhOkimYyLb4exUfq4Ls7J95x3LhN+YzXk8PgN3qgrD/dVuSu7dmuyse
 bfTP1dyWBtQiByEZqNCqrWnj2nBTvCQW9UwuOCg1W1z5coh5GWXTvL61ePu35/Vn
 0LxUNNHbbkjVwbVz9T3ExMTHlrV2nEOaZnw4kiuVAbqDNDqG5XkqD1iG2H5wVh74
 BPTtk2hd2b8PrvBsyzj2Q+RaTEk0o3jIbP6A/jM+9GpVukT1M0SohcYEFISUeh1f
 PXMBpAc7qVrP8+VefCkXGFWbKyRg60k2cJQipS/Q9Aef7xOv8DyDbOrnd/a5xPUS
 zXFRAsKx16gA8SUxNgYz/67jwamVhplHSpgKG2KFtEh0CxJJBY7Z9hlEPM3uZLmT
 QHUbTS+FTlBNS/Go48pnGd5QTEGNeP5gQkiovtLmIRULRTfbwbI1/4VQCp9/XahT
 Vj+4RhhULC7i4joqEeKv+GyO36tydsU93yBt66Xo0JI8W9SLfga5zREdrmlGYRZr
 ECBszSxPpQ8zp3A2Hg==
 =Dq5n
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20190509' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich (we forgot to include
   this patch previously ...)

 - fix multicast tt/tvlv worker locking, by Linus Lüssing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 09:44:17 -07:00
Hangbin Liu
873017af77 vlan: disable SIOCSHWTSTAMP in container
With NET_ADMIN enabled in container, a normal user could be mapped to
root and is able to change the real device's rx filter via ioctl on
vlan, which would affect the other ptp process on host. Fix it by
disabling SIOCSHWTSTAMP in container.

Fixes: a6111d3c93 ("vlan: Pass SIOC[SG]HWTSTAMP ioctls to real device")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 09:31:16 -07:00
Parthasarathy Bhuvaragan
ff946833b7 tipc: fix hanging clients using poll with EPOLLOUT flag
commit 517d7c79bd ("tipc: fix hanging poll() for stream sockets")
introduced a regression for clients using non-blocking sockets.
After the commit, we send EPOLLOUT event to the client even in
TIPC_CONNECTING state. This causes the subsequent send() to fail
with ENOTCONN, as the socket is still not in TIPC_ESTABLISHED state.

In this commit, we:
- improve the fix for hanging poll() by replacing sk_data_ready()
  with sk_state_change() to wake up all clients.
- revert the faulty updates introduced by commit 517d7c79bd
  ("tipc: fix hanging poll() for stream sockets").

Fixes: 517d7c79bd ("tipc: fix hanging poll() for stream sockets")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 09:26:09 -07:00
Linus Torvalds
dce45af5c2 5.2 Merge Window pull request
This has been a smaller cycle than normal. One new driver was accepted,
 which is unusual, and at least one more driver remains in review on the
 list.
 
 - Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4, vmw_pvrdma
 
 - Many patches from MatthewW converting radix tree and IDR users to use
   xarray
 
 - Introduction of tracepoints to the MAD layer
 
 - Build large SGLs at the start for DMA mapping and get the driver to
   split them
 
 - Generally clean SGL handling code throughout the subsystem
 
 - Support for restricting RDMA devices to net namespaces for containers
 
 - Progress to remove object allocation boilerplate code from drivers
 
 - Change in how the mlx5 driver shows representor ports linked to VFs
 
 - mlx5 uapi feature to access the on chip SW ICM memory
 
 - Add a new driver for 'EFA'. This is HW that supports user space packet
   processing through QPs in Amazon's cloud
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlzTIU0ACgkQOG33FX4g
 mxrGKQ/8CqpyvuCyZDW5ovO4DI4YlzYSPXehWlwxA4CWhU1AYTujutnNOdZdngnz
 atTthOlJpZWJV26orvvzwIOi4qX/5UjLXEY3HYdn07JP1Z4iT7E3P4W2sdU3vdl3
 j8bU7xM7ZWmnGxrBZ6yQlVRadEhB8+HJIZWMw+wx66cIPnvU+g9NgwouH67HEEQ3
 PU8OCtGBwNNR508WPiZhjqMDfi/3BED4BfCihFhMbZEgFgObjRgtCV0M33SSXKcR
 IO2FGNVuDAUBlND3vU9guW1+M77xE6p1GvzkIgdCp6qTc724NuO5F2ngrpHKRyZT
 CxvBhAJI6tAZmjBVnmgVJex7rA8p+y/8M/2WD6GE3XSO89XVOkzNBiO2iTMeoxXr
 +CX6VvP2BWwCArxsfKMgW3j0h/WVE9w8Ciej1628m1NvvKEV4AGIJC1g93lIJkRN
 i3RkJ5PkIrdBrTEdKwDu1FdXQHaO7kGgKvwzJ7wBFhso8BRMrMfdULiMbaXs2Bw1
 WdL5zoSe/bLUpPZxcT9IjXRxY5qR0FpIOoo6925OmvyYe/oZo1zbitS5GGbvV90g
 tkq6Jb+aq8ZKtozwCo+oMcg9QPLYNibQsnkL3QirtURXWCG467xdgkaJLdF6s5Oh
 cp+YBqbR/8HNMG/KQlCfnNQKp1ci8mG3EdthQPhvdcZ4jtbqnSI=
 =TS64
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a smaller cycle than normal. One new driver was
  accepted, which is unusual, and at least one more driver remains in
  review on the list.

  Summary:

   - Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4,
     vmw_pvrdma

   - Many patches from MatthewW converting radix tree and IDR users to
     use xarray

   - Introduction of tracepoints to the MAD layer

   - Build large SGLs at the start for DMA mapping and get the driver to
     split them

   - Generally clean SGL handling code throughout the subsystem

   - Support for restricting RDMA devices to net namespaces for
     containers

   - Progress to remove object allocation boilerplate code from drivers

   - Change in how the mlx5 driver shows representor ports linked to VFs

   - mlx5 uapi feature to access the on chip SW ICM memory

   - Add a new driver for 'EFA'. This is HW that supports user space
     packet processing through QPs in Amazon's cloud"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (186 commits)
  RDMA/ipoib: Allow user space differentiate between valid dev_port
  IB/core, ipoib: Do not overreact to SM LID change event
  RDMA/device: Don't fire uevent before device is fully initialized
  lib/scatterlist: Remove leftover from sg_page_iter comment
  RDMA/efa: Add driver to Kconfig/Makefile
  RDMA/efa: Add the efa module
  RDMA/efa: Add EFA verbs implementation
  RDMA/efa: Add common command handlers
  RDMA/efa: Implement functions that submit and complete admin commands
  RDMA/efa: Add the ABI definitions
  RDMA/efa: Add the com service API definitions
  RDMA/efa: Add the efa_com.h file
  RDMA/efa: Add the efa.h header file
  RDMA/efa: Add EFA device definitions
  RDMA: Add EFA related definitions
  RDMA/umem: Remove hugetlb flag
  RDMA/bnxt_re: Use core helpers to get aligned DMA address
  RDMA/i40iw: Use core helpers to get aligned DMA address within a supported page size
  RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks
  RDMA/umem: Add API to find best driver supported page size in an MR
  ...
2019-05-09 09:02:46 -07:00
Florian Westphal
680f6af533 netfilter: ebtables: CONFIG_COMPAT: reject trailing data after last rule
If userspace provides a rule blob with trailing data after last target,
we trigger a splat, then convert ruleset to 64bit format (with trailing
data), then pass that to do_replace_finish() which then returns -EINVAL.

Erroring out right away avoids the splat plus unneeded translation and
error unwind.

Fixes: 81e675c227 ("netfilter: ebtables: add CONFIG_COMPAT support")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-09 08:54:49 +02:00
Pieter Jansen van Vuuren
5f05836831 net/sched: avoid double free on matchall reoffload
Avoid freeing cls_mall.rule twice when failing to setup flow_action
offload used in the hardware intermediate representation. This is
achieved by returning 0 when the setup fails but the skip software
flag has not been set.

Fixes: f00cbf1968 ("net/sched: use the hardware intermediate representation for matchall")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08 16:34:58 -07:00
Geert Uytterhoeven
f319ca6557 openvswitch: Replace removed NF_NAT_NEEDED with IS_ENABLED(CONFIG_NF_NAT)
Commit 4806e97572 ("netfilter: replace NF_NAT_NEEDED with
IS_ENABLED(CONFIG_NF_NAT)") removed CONFIG_NF_NAT_NEEDED, but a new user
popped up afterwards.

Fixes: fec9c271b8 ("openvswitch: load and reference the NAT helper.")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08 09:43:15 -07:00
David Ahern
19e4e76806 ipv4: Fix raw socket lookup for local traffic
inet_iif should be used for the raw socket lookup. inet_iif considers
rt_iif which handles the case of local traffic.

As it stands, ping to a local address with the '-I <dev>' option fails
ever since ping was changed to use SO_BINDTODEVICE instead of
cmsg + IP_PKTINFO.

IPv6 works fine.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08 09:38:30 -07:00
Hangbin Liu
e9919a24d3 fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied
With commit 153380ec4b ("fib_rules: Added NLM_F_EXCL support to
fib_nl_newrule") we now able to check if a rule already exists. But this
only works with iproute2. For other tools like libnl, NetworkManager,
it still could add duplicate rules with only NLM_F_CREATE flag, like

[localhost ~ ]# ip rule
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default
100000: from 192.168.7.5 lookup 5
100000: from 192.168.7.5 lookup 5

As it doesn't make sense to create two duplicate rules, let's just return
0 if the rule exists.

Fixes: 153380ec4b ("fib_rules: Added NLM_F_EXCL support to fib_nl_newrule")
Reported-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08 09:32:10 -07:00
Linus Torvalds
80f232121b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

   2) Add fib_sync_mem to control the amount of dirty memory we allow to
      queue up between synchronize RCU calls, from David Ahern.

   3) Make flow classifier more lockless, from Vlad Buslov.

   4) Add PHY downshift support to aquantia driver, from Heiner
      Kallweit.

   5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
      contention on SLAB spinlocks in heavy RPC workloads.

   6) Partial GSO offload support in XFRM, from Boris Pismenny.

   7) Add fast link down support to ethtool, from Heiner Kallweit.

   8) Use siphash for IP ID generator, from Eric Dumazet.

   9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
      entries, from David Ahern.

  10) Move skb->xmit_more into a per-cpu variable, from Florian
      Westphal.

  11) Improve eBPF verifier speed and increase maximum program size,
      from Alexei Starovoitov.

  12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
      spinlocks. From Neil Brown.

  13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

  14) Improve link partner cap detection in generic PHY code, from
      Heiner Kallweit.

  15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
      Maguire.

  16) Remove SKB list implementation assumptions in SCTP, your's truly.

  17) Various cleanups, optimizations, and simplifications in r8169
      driver. From Heiner Kallweit.

  18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

  19) Switch PHY drivers over to use dynamic featue detection, from
      Heiner Kallweit.

  20) Support flow steering without masking in dpaa2-eth, from Ioana
      Ciocoi.

  21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
      Pirko.

  22) Increase the strict parsing of current and future netlink
      attributes, also export such policies to userspace. From Johannes
      Berg.

  23) Allow DSA tag drivers to be modular, from Andrew Lunn.

  24) Remove legacy DSA probing support, also from Andrew Lunn.

  25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
      Haabendal.

  26) Add a generic tracepoint for TX queue timeouts to ease debugging,
      from Cong Wang.

  27) More indirect call optimizations, from Paolo Abeni"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
  cxgb4: Fix error path in cxgb4_init_module
  net: phy: improve pause mode reporting in phy_print_status
  dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
  net: macb: Change interrupt and napi enable order in open
  net: ll_temac: Improve error message on error IRQ
  net/sched: remove block pointer from common offload structure
  net: ethernet: support of_get_mac_address new ERR_PTR error
  net: usb: smsc: fix warning reported by kbuild test robot
  staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
  net: dsa: support of_get_mac_address new ERR_PTR error
  net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
  vrf: sit mtu should not be updated when vrf netdev is the link
  net: dsa: Fix error cleanup path in dsa_init_module
  l2tp: Fix possible NULL pointer dereference
  taprio: add null check on sched_nest to avoid potential null pointer dereference
  net: mvpp2: cls: fix less than zero check on a u32 variable
  net_sched: sch_fq: handle non connected flows
  net_sched: sch_fq: do not assume EDT packets are ordered
  net: hns3: use devm_kcalloc when allocating desc_cb
  net: hns3: some cleanup for struct hns3_enet_ring
  ...
2019-05-07 22:03:58 -07:00
David S. Miller
a9e41a5296 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor conflict with the DSA legacy code removal.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 17:22:09 -07:00
Linus Torvalds
cf482a49af Driver core/kobject patches for 5.2-rc1
Here is the "big" set of driver core patches for 5.2-rc1
 
 There are a number of ACPI patches in here as well, as Rafael said they
 should go through this tree due to the driver core changes they
 required.  They have all been acked by the ACPI developers.
 
 There are also a number of small subsystem-specific changes in here, due
 to some changes to the kobject core code.  Those too have all been acked
 by the various subsystem maintainers.
 
 As for content, it's pretty boring outside of the ACPI changes:
   - spdx cleanups
   - kobject documentation updates
   - default attribute groups for kobjects
   - other minor kobject/driver core fixes
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXNHDbw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynDAgCfbb4LBR6I50wFXb8JM/R6cAS7qrsAn1unshKV
 8XCYcif2RxjtdJWXbjdm
 =/rLh
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core/kobject updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.2-rc1

  There are a number of ACPI patches in here as well, as Rafael said
  they should go through this tree due to the driver core changes they
  required. They have all been acked by the ACPI developers.

  There are also a number of small subsystem-specific changes in here,
  due to some changes to the kobject core code. Those too have all been
  acked by the various subsystem maintainers.

  As for content, it's pretty boring outside of the ACPI changes:
   - spdx cleanups
   - kobject documentation updates
   - default attribute groups for kobjects
   - other minor kobject/driver core fixes

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
  kobject: clean up the kobject add documentation a bit more
  kobject: Fix kernel-doc comment first line
  kobject: Remove docstring reference to kset
  firmware_loader: Fix a typo ("syfs" -> "sysfs")
  kobject: fix dereference before null check on kobj
  Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
  init/config: Do not select BUILD_BIN2C for IKCONFIG
  Provide in-kernel headers to make extending kernel easier
  kobject: Improve doc clarity kobject_init_and_add()
  kobject: Improve docs for kobject_add/del
  driver core: platform: Fix the usage of platform device name(pdev->name)
  livepatch: Replace klp_ktype_patch's default_attrs with groups
  cpufreq: schedutil: Replace default_attrs field with groups
  padata: Replace padata_attr_type default_attrs field with groups
  irqdesc: Replace irq_kobj_type's default_attrs field with groups
  net-sysfs: Replace ktype default_attrs field with groups
  block: Replace all ktype default_attrs with groups
  samples/kobject: Replace foo_ktype's default_attrs field with groups
  kobject: Add support for default attribute groups to kobj_type
  driver core: Postpone DMA tear-down until after devres release for probe failure
  ...
2019-05-07 13:01:40 -07:00
Pieter Jansen van Vuuren
d6787147e1 net/sched: remove block pointer from common offload structure
Based on feedback from Jiri avoid carrying a pointer to the tcf_block
structure in the tc_cls_common_offload structure. Instead store
a flag in driver private data which indicates if offloads apply
to a shared block at block binding time.

Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:23:40 -07:00
Petr Štetiar
a51645f70f net: ethernet: support of_get_mac_address new ERR_PTR error
There was NVMEM support added to of_get_mac_address, so it could now
return ERR_PTR encoded error values, so we need to adjust all current
users of of_get_mac_address to this new fact.

While at it, remove superfluous is_valid_ether_addr as the MAC address
returned from of_get_mac_address is always valid and checked by
is_valid_ether_addr anyway.

Fixes: d01f449c00 ("of_net: add NVMEM support to of_get_mac_address")
Signed-off-by: Petr Štetiar <ynezz@true.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:22:47 -07:00
Petr Štetiar
4974f9b7e0 net: dsa: support of_get_mac_address new ERR_PTR error
There was NVMEM support added to of_get_mac_address, so it could now
return ERR_PTR encoded error values, so we need to adjust all current
users of of_get_mac_address to this new fact.

While at it, remove superfluous is_valid_ether_addr as the MAC address
returned from of_get_mac_address is always valid and checked by
is_valid_ether_addr anyway.

Fixes: d01f449c00 ("of_net: add NVMEM support to of_get_mac_address")
Signed-off-by: Petr Štetiar <ynezz@true.cz>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:22:47 -07:00
Stephen Suryaputra
ff6ab32bd4 vrf: sit mtu should not be updated when vrf netdev is the link
VRF netdev mtu isn't typically set and have an mtu of 65536. When the
link of a tunnel is set, the tunnel mtu is changed from 1480 to the link
mtu minus tunnel header. In the case of VRF netdev is the link, then the
tunnel mtu becomes 65516. So, fix it by not setting the tunnel mtu in
this case.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:19:19 -07:00
YueHaibing
68be930249 net: dsa: Fix error cleanup path in dsa_init_module
BUG: unable to handle kernel paging request at ffffffffa01c5430
PGD 3270067 P4D 3270067 PUD 3271063 PMD 230bc5067 PTE 0
Oops: 0000 [#1
CPU: 0 PID: 6159 Comm: modprobe Not tainted 5.1.0+ #33
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:raw_notifier_chain_register+0x16/0x40
Code: 63 f8 66 90 e9 5d ff ff ff 90 90 90 90 90 90 90 90 90 90 90 55 48 8b 07 48 89 e5 48 85 c0 74 1c 8b 56 10 3b 50 10 7e 07 eb 12 <39> 50 10 7c 0d 48 8d 78 08 48 8b 40 08 48 85 c0 75 ee 48 89 46 08
RSP: 0018:ffffc90001c33c08 EFLAGS: 00010282
RAX: ffffffffa01c5420 RBX: ffffffffa01db420 RCX: 4fcef45928070a8b
RDX: 0000000000000000 RSI: ffffffffa01db420 RDI: ffffffffa01b0068
RBP: ffffc90001c33c08 R08: 000000003e0a33d0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000094443661 R12: ffff88822c320700
R13: ffff88823109be80 R14: 0000000000000000 R15: ffffc90001c33e78
FS:  00007fab8bd08540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa01c5430 CR3: 00000002297ea000 CR4: 00000000000006f0
Call Trace:
 register_netdevice_notifier+0x43/0x250
 ? 0xffffffffa01e0000
 dsa_slave_register_notifier+0x13/0x70 [dsa_core
 ? 0xffffffffa01e0000
 dsa_init_module+0x2e/0x1000 [dsa_core
 do_one_initcall+0x6c/0x3cc
 ? do_init_module+0x22/0x1f1
 ? rcu_read_lock_sched_held+0x97/0xb0
 ? kmem_cache_alloc_trace+0x325/0x3b0
 do_init_module+0x5b/0x1f1
 load_module+0x1db1/0x2690
 ? m_show+0x1d0/0x1d0
 __do_sys_finit_module+0xc5/0xd0
 __x64_sys_finit_module+0x15/0x20
 do_syscall_64+0x6b/0x1d0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Cleanup allocated resourses if there are errors,
otherwise it will trgger memleak.

Fixes: c9eb3e0f87 ("net: dsa: Add support for learning FDB through notification")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:17:23 -07:00
YueHaibing
638a3a1e34 l2tp: Fix possible NULL pointer dereference
BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
PGD 0 P4D 0
Oops: 0000 [#1
CPU: 0 PID: 5697 Comm: modprobe Tainted: G        W         5.1.0-rc7+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:__lock_acquire+0x53/0x10b0
Code: 8b 1c 25 40 5e 01 00 4c 8b 6d 10 45 85 e4 0f 84 bd 06 00 00 44 8b 1d 7c d2 09 02 49 89 fe 41 89 d2 45 85 db 0f 84 47 02 00 00 <48> 81 3f a0 05 70 83 b8 00 00 00 00 44 0f 44 c0 83 fe 01 0f 86 3a
RSP: 0018:ffffc90001c07a28 EFLAGS: 00010002
RAX: 0000000000000000 RBX: ffff88822f038440 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000128
RBP: ffffc90001c07a88 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000128 R15: 0000000000000000
FS:  00007fead0811540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000128 CR3: 00000002310da000 CR4: 00000000000006f0
Call Trace:
 ? __lock_acquire+0x24e/0x10b0
 lock_acquire+0xdf/0x230
 ? flush_workqueue+0x71/0x530
 flush_workqueue+0x97/0x530
 ? flush_workqueue+0x71/0x530
 l2tp_exit_net+0x170/0x2b0 [l2tp_core
 ? l2tp_exit_net+0x93/0x2b0 [l2tp_core
 ops_exit_list.isra.6+0x36/0x60
 unregister_pernet_operations+0xb8/0x110
 unregister_pernet_device+0x25/0x40
 l2tp_init+0x55/0x1000 [l2tp_core
 ? 0xffffffffa018d000
 do_one_initcall+0x6c/0x3cc
 ? do_init_module+0x22/0x1f1
 ? rcu_read_lock_sched_held+0x97/0xb0
 ? kmem_cache_alloc_trace+0x325/0x3b0
 do_init_module+0x5b/0x1f1
 load_module+0x1db1/0x2690
 ? m_show+0x1d0/0x1d0
 __do_sys_finit_module+0xc5/0xd0
 __x64_sys_finit_module+0x15/0x20
 do_syscall_64+0x6b/0x1d0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fead031a839
Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe8d9acca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 0000560078398b80 RCX: 00007fead031a839
RDX: 0000000000000000 RSI: 000056007659dc2e RDI: 0000000000000003
RBP: 000056007659dc2e R08: 0000000000000000 R09: 0000560078398b80
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
R13: 00005600783a04a0 R14: 0000000000040000 R15: 0000560078398b80
Modules linked in: l2tp_core(+) e1000 ip_tables ipv6 [last unloaded: l2tp_core
CR2: 0000000000000128
---[ end trace 8322b2b8bf83f8e1

If alloc_workqueue fails in l2tp_init, l2tp_net_ops
is unregistered on failure path. Then l2tp_exit_net
is called which will flush NULL workqueue, this patch
add a NULL check to fix it.

Fixes: 67e04c29ec ("l2tp: unregister l2tp_net_ops on failure path")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:16:47 -07:00
Colin Ian King
e4acf42741 taprio: add null check on sched_nest to avoid potential null pointer dereference
The call to nla_nest_start_noflag can return a null pointer and currently
this is not being checked and this can lead to a null pointer dereference
when the null pointer sched_nest is passed to function nla_nest_end. Fix
this by adding in a null pointer check.

Addresses-Coverity: ("Dereference null return value")
Fixes: a3d43c0d56 ("taprio: Add support adding an admin schedule")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:15:33 -07:00
Linus Torvalds
41bc10cabe stream_open related patches for Linux 5.2
https://lore.kernel.org/linux-fsdevel/CAHk-=wg1tFzcaX2v9Z91vPJiBR486ddW5MtgDL02-fOen2F0Aw@mail.gmail.com/T/#m5b2d9ad3aeacea4bd6aa1964468ac074bf3aa5bf
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEECVWwJCUO7/z+QjZbZsp4hBP2dUkFAlzR1UgQHGtpcnJAbmV4
 ZWRpLmNvbQAKCRBmyniEE/Z1SZBiEACGw1LzUmjV9eBYFjqaUkgX/Zfcu42D4Ek2
 8MuWnNdRabtpGQq0LccYlfoL3yH5xECp14IkCgJvkjqoZ3CcqWcv6uDxf0WtnUqZ
 wPx1RYZykb4RZj2A6/ndhInReP4AlXICyTVulKb+BquVkemMvmXX8k+bkr/msKfT
 9jdKWFIn+ANNABt3y2D7ywZvs9mkxIx+Fti+tVV4BFBeGfUuj4ArZBOHnngRnIk/
 XYlQ7FVzENSPSB+3GvL34jTGEzo8suPHKhHQlIhtcd5hwzVRZKE2sdVXsCc6/WbY
 YnT32gmT1/+cUuDl1mZSiQY5R4Xkb07k6/jNrdmjQpwmWbZu90cuRhb+JBXwnmjZ
 2Wgy3sfwYISDxtePukg1iYePlHlVlGTYqMo3AQrTBs/gEwCKWrsKQb98mRxlf1YK
 e2mdtmq6upYoorLFQesfRgrCg4GTBiPkrR3amXsFgJ2O5fhV6R98ZdGSv4kip19f
 ZNoc/t1EtKGwyAJwjINduv36E3RSHODWwSPtSnmSS1ieCGToY1SI3bVUkFM4C0tO
 5GMdSugHgXRGGVbTd/VftndJm6Wtj8b1j8c/1Vh04Q8qbKKJDRTDzAbK1v8oLaDh
 UXAKMIc8uY4caZy3/bTAB2Ou9dibrSi8Oc+LwZqJlwIcbkwn/IGNvmwtWv4ehorE
 N7EhCFZsFQ==
 =Mavg
 -----END PGP SIGNATURE-----

Merge tag 'stream_open-5.2' of https://lab.nexedi.com/kirr/linux

Pull stream_open conversion from Kirill Smelkov:

 - remove unnecessary double nonseekable_open from drivers/char/dtlk.c
   as noticed by Pavel Machek while reviewing nonseekable_open ->
   stream_open mass conversion.

 - the mass conversion patch promised in commit 10dce8af34 ("fs:
   stream_open - opener for stream-like files so that read and write can
   run simultaneously without deadlock") and is automatically generated
   by running

        $ make coccicheck MODE=patch COCCI=scripts/coccinelle/api/stream_open.cocci

   I've verified each generated change manually - that it is correct to
   convert - and each other nonseekable_open instance left - that it is
   either not correct to convert there, or that it is not converted due
   to current stream_open.cocci limitations. More details on this in the
   patch.

 - finally, change VFS to pass ppos=NULL into .read/.write for files
   that declare themselves streams. It was suggested by Rasmus Villemoes
   and makes sure that if ppos starts to be erroneously used in a stream
   file, such bug won't go unnoticed and will produce an oops instead of
   creating illusion of position change being taken into account.

   Note: this patch does not conflict with "fuse: Add FOPEN_STREAM to
   use stream_open()" that will be hopefully coming via FUSE tree,
   because fs/fuse/ uses new-style .read_iter/.write_iter, and for these
   accessors position is still passed as non-pointer kiocb.ki_pos .

* tag 'stream_open-5.2' of https://lab.nexedi.com/kirr/linux:
  vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
  *: convert stream-like files from nonseekable_open -> stream_open
  dtlk: remove double call to nonseekable_open
2019-05-07 12:15:13 -07:00
Eric Dumazet
37c0aead79 net_sched: sch_fq: handle non connected flows
FQ packet scheduler assumed that packets could be classified
based on their owning socket.

This means that if a UDP server uses one UDP socket to send
packets to different destinations, packets all land
in one FQ flow.

This is unfair, since each TCP flow has a unique bucket, meaning
that in case of pressure (fully utilised uplink), TCP flows
have more share of the bandwidth.

If we instead detect unconnected sockets, we can use a stochastic
hash based on the 4-tuple hash.

This also means a QUIC server using one UDP socket will properly
spread the outgoing packets to different buckets, and in-kernel
pacing based on EDT model will no longer risk having big rb-tree on
one flow.

Note that UDP application might provide the skb->hash in an
ancillary message at sendmsg() time to avoid the cost of a dissection
in fq packet scheduler.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:09:25 -07:00
Eric Dumazet
eeb84aa0d0 net_sched: sch_fq: do not assume EDT packets are ordered
TCP stack makes sure packets for a given flow are monotically
increasing, but we want to allow UDP packets to use EDT as
well, so that QUIC servers can use in-kernel pacing.

This patch adds a per-flow rb-tree on which packets might
be stored. We still try to use the linear list for the
typical cases where packets are queued with monotically
increasing skb->tstamp, since queue/dequeue packets on
a standard list is O(1).

Note that the ability to store packets in arbitrary EDT
order will allow us to implement later a per TCP socket
mechanism adding delays (with jitter eventually) and reorders,
to implement convenient network emulators.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:09:25 -07:00
Linus Torvalds
168e153d5e Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs inode freeing updates from Al Viro:
 "Introduction of separate method for RCU-delayed part of
  ->destroy_inode() (if any).

  Pretty much as posted, except that destroy_inode() stashes
  ->free_inode into the victim (anon-unioned with ->i_fops) before
  scheduling i_callback() and the last two patches (sockfs conversion
  and folding struct socket_wq into struct socket) are excluded - that
  pair should go through netdev once davem reopens his tree"

* 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
  orangefs: make use of ->free_inode()
  shmem: make use of ->free_inode()
  hugetlb: make use of ->free_inode()
  overlayfs: make use of ->free_inode()
  jfs: switch to ->free_inode()
  fuse: switch to ->free_inode()
  ext4: make use of ->free_inode()
  ecryptfs: make use of ->free_inode()
  ceph: use ->free_inode()
  btrfs: use ->free_inode()
  afs: switch to use of ->free_inode()
  dax: make use of ->free_inode()
  ntfs: switch to ->free_inode()
  securityfs: switch to ->free_inode()
  apparmor: switch to ->free_inode()
  rpcpipe: switch to ->free_inode()
  bpf: switch to ->free_inode()
  mqueue: switch to ->free_inode()
  ufs: switch to ->free_inode()
  coda: switch to ->free_inode()
  ...
2019-05-07 10:57:05 -07:00
Jeff Layton
b726ec972c libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer
GCC9 is throwing a lot of warnings about unaligned accesses by
callers of ceph_pr_addr. All of the current callers are passing a
pointer to the sockaddr inside struct ceph_entity_addr.

Fix it to take a pointer to a struct ceph_entity_addr instead,
and then have the function make a copy of the sockaddr before
printing it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07 19:43:05 +02:00
Jeff Layton
cede185b1b libceph: fix unaligned accesses in ceph_entity_addr handling
GCC9 is throwing a lot of warnings about unaligned access. This patch
fixes some of them by changing most of the sockaddr handling functions
to take a pointer to struct ceph_entity_addr instead of struct
sockaddr_storage.  The lower functions can then make copies or do
unaligned accesses as needed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07 19:43:05 +02:00
David S. Miller
14cfbdac66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-05-06

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Two AF_XDP libbpf fixes for socket teardown; first one an invalid
   munmap and the other one an invalid skmap cleanup, both from Björn.

2) More graceful CONFIG_DEBUG_INFO_BTF handling when pahole is not
   present in the system to generate vmlinux btf info, from Andrii.

3) Fix libbpf and thus fix perf build error with uClibc on arc
   architecture, from Vineet.

4) Fix missing libbpf_util.h header install in libbpf, from William.

5) Exclude bash-completion/bpftool from .gitignore pattern, from Masahiro.

6) Fix up rlimit in test_libbpf_open kselftest test case, from Yonghong.

7) Minor misc cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 09:29:16 -07:00
Linus Torvalds
0968621917 Printk changes for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
 lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
 EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
 r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
 FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
 uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
 lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
 waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
 wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
 =WZfC
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow state reset of printk_once() calls.

 - Prevent crashes when dereferencing invalid pointers in vsprintf().
   Only the first byte is checked for simplicity.

 - Make vsprintf warnings consistent and inlined.

 - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
   modifiers.

 - Some clean up of vsprintf and test_printf code.

* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  lib/vsprintf: Make function pointer_string static
  vsprintf: Limit the length of inlined error messages
  vsprintf: Avoid confusion between invalid address and value
  vsprintf: Prevent crash when dereferencing invalid pointers
  vsprintf: Consolidate handling of unknown pointer specifiers
  vsprintf: Factor out %pO handler as kobject_string()
  vsprintf: Factor out %pV handler as va_format()
  vsprintf: Factor out %p[iI] handler as ip_addr_string()
  vsprintf: Do not check address of well-known strings
  vsprintf: Consistent %pK handling for kptr_restrict == 0
  vsprintf: Shuffle restricted_pointer()
  printk: Tie printk_once / printk_deferred_once into .data.once for reset
  treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
  lib/test_printf: Switch to bitmap_zalloc()
2019-05-07 09:18:12 -07:00
Linus Torvalds
81ff5d2cba Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:
   - Add support for AEAD in simd
   - Add fuzz testing to testmgr
   - Add panic_on_fail module parameter to testmgr
   - Use per-CPU struct instead multiple variables in scompress
   - Change verify API for akcipher

  Algorithms:
   - Convert x86 AEAD algorithms over to simd
   - Forbid 2-key 3DES in FIPS mode
   - Add EC-RDSA (GOST 34.10) algorithm

  Drivers:
   - Set output IV with ctr-aes in crypto4xx
   - Set output IV in rockchip
   - Fix potential length overflow with hashing in sun4i-ss
   - Fix computation error with ctr in vmx
   - Add SM4 protected keys support in ccree
   - Remove long-broken mxc-scc driver
   - Add rfc4106(gcm(aes)) cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits)
  crypto: ccree - use a proper le32 type for le32 val
  crypto: ccree - remove set but not used variable 'du_size'
  crypto: ccree - Make cc_sec_disable static
  crypto: ccree - fix spelling mistake "protedcted" -> "protected"
  crypto: caam/qi2 - generate hash keys in-place
  crypto: caam/qi2 - fix DMA mapping of stack memory
  crypto: caam/qi2 - fix zero-length buffer DMA mapping
  crypto: stm32/cryp - update to return iv_out
  crypto: stm32/cryp - remove request mutex protection
  crypto: stm32/cryp - add weak key check for DES
  crypto: atmel - remove set but not used variable 'alg_name'
  crypto: picoxcell - Use dev_get_drvdata()
  crypto: crypto4xx - get rid of redundant using_sd variable
  crypto: crypto4xx - use sync skcipher for fallback
  crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
  crypto: crypto4xx - fix ctr-aes missing output IV
  crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA
  crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o
  crypto: ccree - handle tee fips error during power management resume
  crypto: ccree - add function to handle cryptocell tee fips error
  ...
2019-05-06 20:15:06 -07:00
Linus Torvalds
a0e928ed7c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "This cycle had the following changes:

   - Timer tracing improvements (Anna-Maria Gleixner)

   - Continued tasklet reduction work: remove the hrtimer_tasklet
     (Thomas Gleixner)

   - Fix CPU hotplug remove race in the tick-broadcast mask handling
     code (Thomas Gleixner)

   - Force upper bound for setting CLOCK_REALTIME, to fix ABI
     inconsistencies with handling values that are close to the maximum
     supported and the vagueness of when uptime related wraparound might
     occur. Make the consistent maximum the year 2232 across all
     relevant ABIs and APIs. (Thomas Gleixner)

   - various cleanups and smaller fixes"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Fix typos in comments
  tick/broadcast: Fix warning about undefined tick_broadcast_oneshot_offline()
  timekeeping: Force upper bound for setting CLOCK_REALTIME
  timer/trace: Improve timer tracing
  timer/trace: Replace deprecated vsprintf pointer extension %pf by %ps
  timer: Move trace point to get proper index
  tick/sched: Update tick_sched struct documentation
  tick: Remove outgoing CPU from broadcast masks
  timekeeping: Consistently use unsigned int for seqcount snapshot
  softirq: Remove tasklet_hrtimer
  xfrm: Replace hrtimer tasklet with softirq hrtimer
  mac80211_hwsim: Replace hrtimer tasklet with softirq hrtimer
2019-05-06 14:50:46 -07:00
Linus Torvalds
5ba2a4b12f Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "This cycles's RCU changes include:

   - a couple of straggling RCU flavor consolidation updates

   - SRCU updates

   - RCU CPU stall-warning updates

   - torture-test updates

   - an LKMM commit adding support for synchronize_srcu_expedited()

   - documentation updates

   - miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  net/ipv4/netfilter: Update comment from call_rcu_bh() to call_rcu()
  tools/memory-model: Add support for synchronize_srcu_expedited()
  doc/kprobes: Update obsolete RCU update functions
  torture: Suppress false-positive CONFIG_INITRAMFS_SOURCE complaint
  locktorture: NULL cxt.lwsa and cxt.lrsa to allow bad-arg detection
  rcuperf: Fix cleanup path for invalid perf_type strings
  rcutorture: Fix cleanup path for invalid torture_type strings
  rcutorture: Fix expected forward progress duration in OOM notifier
  rcutorture: Remove ->ext_irq_conflict field
  rcutorture: Make rcutorture_extend_mask() comment match the code
  tools/.../rcutorture: Convert to SPDX license identifier
  torture: Don't try to offline the last CPU
  rcu: Fix nohz status in stall warning
  rcu: Move forward-progress checkers into tree_stall.h
  rcu: Move irq-disabled stall-warning checking to tree_stall.h
  rcu: Organize functions in tree_stall.h
  rcu: Move FAST_NO_HZ stall-warning code to tree_stall.h
  rcu: Inline RCU stall-warning info helper functions
  rcu: Move rcu_print_task_exp_stall() to tree_exp.h
  rcu: Inline RCU task stall-warning helper functions
  ...
2019-05-06 12:04:02 -07:00
Kirill Smelkov
c5bf68fe0c *: convert stream-like files from nonseekable_open -> stream_open
Using scripts/coccinelle/api/stream_open.cocci added in 10dce8af34
("fs: stream_open - opener for stream-like files so that read and write
can run simultaneously without deadlock"), search and convert to
stream_open all in-kernel nonseekable_open users for which read and
write actually do not depend on ppos and where there is no other methods
in file_operations which assume @offset access.

I've verified each generated change manually - that it is correct to convert -
and each other nonseekable_open instance left - that it is either not correct
to convert there, or that it is not converted due to current stream_open.cocci
limitations. The script also does not convert files that should be valid to
convert, but that currently have .llseek = noop_llseek or generic_file_llseek
for unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)

Among cases converted 14 were potentially vulnerable to read vs write deadlock
(see details in 10dce8af34):

	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/infiniband/core/user_mad.c:988:1-17: ERROR: umad_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/input/misc/uinput.c:401:1-17: ERROR: uinput_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write(); change nonseekable_open -> stream_open to fix.

and the rest were just safe to convert to stream_open because their read and
write do not use ppos at all and corresponding file_operations do not
have methods that assume @offset file access(*):

	arch/powerpc/platforms/52xx/mpc52xx_gpt.c:631:8-24: WARNING: mpc52xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_ibox_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_ibox_stat_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_mbox_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_mbox_stat_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_wbox_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/powerpc/platforms/cell/spufs/file.c:591:8-24: WARNING: spufs_wbox_stat_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/um/drivers/harddog_kern.c:88:8-24: WARNING: harddog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	arch/x86/kernel/cpu/microcode/core.c:430:33-49: WARNING: microcode_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/ds1620.c:215:8-24: WARNING: ds1620_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/dtlk.c:301:1-17: WARNING: dtlk_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/ipmi/ipmi_watchdog.c:840:9-25: WARNING: ipmi_wdog_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/pcmcia/scr24x_cs.c:95:8-24: WARNING: scr24x_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/char/tb0219.c:246:9-25: WARNING: tb0219_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/firewire/nosy.c:306:8-24: WARNING: nosy_ops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/hwmon/fschmd.c:840:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/hwmon/w83793.c:1344:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/infiniband/core/ucma.c:1747:8-24: WARNING: ucma_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/infiniband/core/ucm.c:1178:8-24: WARNING: ucm_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/infiniband/core/uverbs_main.c:1086:8-24: WARNING: uverbs_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/input/joydev.c:282:1-17: WARNING: joydev_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/pci/switch/switchtec.c:393:1-17: WARNING: switchtec_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/platform/chrome/cros_ec_debugfs.c:135:8-24: WARNING: cros_ec_console_log_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/rtc/rtc-ds1374.c:470:9-25: WARNING: ds1374_wdt_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/rtc/rtc-m41t80.c:805:9-25: WARNING: wdt_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/s390/char/tape_char.c:293:2-18: WARNING: tape_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/s390/char/zcore.c:194:8-24: WARNING: zcore_reipl_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/s390/crypto/zcrypt_api.c:528:8-24: WARNING: zcrypt_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/spi/spidev.c:594:1-17: WARNING: spidev_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/staging/pi433/pi433_if.c:974:1-17: WARNING: pi433_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/acquirewdt.c:203:8-24: WARNING: acq_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/advantechwdt.c:202:8-24: WARNING: advwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/alim1535_wdt.c:252:8-24: WARNING: ali_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/alim7101_wdt.c:217:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ar7_wdt.c:166:8-24: WARNING: ar7_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/at91rm9200_wdt.c:113:8-24: WARNING: at91wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ath79_wdt.c:135:8-24: WARNING: ath79_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/bcm63xx_wdt.c:119:8-24: WARNING: bcm63xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/cpu5wdt.c:143:8-24: WARNING: cpu5wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/cpwd.c:397:8-24: WARNING: cpwd_fops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/eurotechwdt.c:319:8-24: WARNING: eurwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/f71808e_wdt.c:528:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/gef_wdt.c:232:8-24: WARNING: gef_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/geodewdt.c:95:8-24: WARNING: geodewdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ib700wdt.c:241:8-24: WARNING: ibwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ibmasr.c:326:8-24: WARNING: asr_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/indydog.c:80:8-24: WARNING: indydog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/intel_scu_watchdog.c:307:8-24: WARNING: intel_scu_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/iop_wdt.c:104:8-24: WARNING: iop_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/it8712f_wdt.c:330:8-24: WARNING: it8712f_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ixp4xx_wdt.c:68:8-24: WARNING: ixp4xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/ks8695_wdt.c:145:8-24: WARNING: ks8695wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/m54xx_wdt.c:88:8-24: WARNING: m54xx_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/machzwd.c:336:8-24: WARNING: zf_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/mixcomwd.c:153:8-24: WARNING: mixcomwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/mtx-1_wdt.c:121:8-24: WARNING: mtx1_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/mv64x60_wdt.c:136:8-24: WARNING: mv64x60_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/nuc900_wdt.c:134:8-24: WARNING: nuc900wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/nv_tco.c:164:8-24: WARNING: nv_tco_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pc87413_wdt.c:289:8-24: WARNING: pc87413_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd.c:698:8-24: WARNING: pcwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd.c:737:8-24: WARNING: pcwd_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_pci.c:581:8-24: WARNING: pcipcwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_pci.c:623:8-24: WARNING: pcipcwd_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_usb.c:488:8-24: WARNING: usb_pcwd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pcwd_usb.c:527:8-24: WARNING: usb_pcwd_temperature_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pika_wdt.c:121:8-24: WARNING: pikawdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/pnx833x_wdt.c:119:8-24: WARNING: pnx833x_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/rc32434_wdt.c:153:8-24: WARNING: rc32434_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/rdc321x_wdt.c:145:8-24: WARNING: rdc321x_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/riowd.c:79:1-17: WARNING: riowd_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sa1100_wdt.c:62:8-24: WARNING: sa1100dog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc60xxwdt.c:211:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc7240_wdt.c:139:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc8360.c:274:8-24: WARNING: sbc8360_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc_epx_c3.c:81:8-24: WARNING: epx_c3_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sbc_fitpc2_wdt.c:78:8-24: WARNING: fitpc2_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sb_wdog.c:108:1-17: WARNING: sbwdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sc1200wdt.c:181:8-24: WARNING: sc1200wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sc520_wdt.c:261:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/sch311x_wdt.c:319:8-24: WARNING: sch311x_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/scx200_wdt.c:105:8-24: WARNING: scx200_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/smsc37b787_wdt.c:369:8-24: WARNING: wb_smsc_wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/w83877f_wdt.c:227:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/w83977f_wdt.c:301:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wafer5823wdt.c:200:8-24: WARNING: wafwdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/watchdog_dev.c:828:8-24: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdrtas.c:379:8-24: WARNING: wdrtas_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdrtas.c:445:8-24: WARNING: wdrtas_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt285.c:104:1-17: WARNING: watchdog_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt977.c:276:8-24: WARNING: wdt977_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt.c:424:8-24: WARNING: wdt_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt.c:484:8-24: WARNING: wdt_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt_pci.c:464:8-24: WARNING: wdtpci_fops: .write() has stream semantic; safe to change nonseekable_open -> stream_open.
	drivers/watchdog/wdt_pci.c:527:8-24: WARNING: wdtpci_temp_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	net/batman-adv/log.c:105:1-17: WARNING: batadv_log_fops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/control.c:57:7-23: WARNING: snd_ctl_f_ops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/rawmidi.c:385:7-23: WARNING: snd_rawmidi_f_ops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/seq/seq_clientmgr.c:310:7-23: WARNING: snd_seq_f_ops: .read() and .write() have stream semantic; safe to change nonseekable_open -> stream_open.
	sound/core/timer.c:1428:7-23: WARNING: snd_timer_f_ops: .read() has stream semantic; safe to change nonseekable_open -> stream_open.

One can also recheck/review the patch via generating it with explanation comments included via

	$ make coccicheck MODE=patch COCCI=scripts/coccinelle/api/stream_open.cocci SPFLAGS="-D explain"

(*) This second group also contains cases with read/write deadlocks that
stream_open.cocci don't yet detect, but which are still valid to convert to
stream_open since ppos is not used. For example drivers/pci/switch/switchtec.c
calls wait_for_completion_interruptible() in its .read, but stream_open.cocci
currently detects only "wait_event*" as blocking.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James R. Van Zandt" <jrv@vanzandt.mv.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Harald Welte <laforge@gnumonks.org>
Acked-by: Lubomir Rintel <lkundrak@v3.sk> [scr24x_cs]
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Johan Hovold <johan@kernel.org>
Cc: David Herrmann <dh.herrmann@googlemail.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Jean Delvare <jdelvare@suse.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>	[watchdog/* hwmon/*]
Cc: Rudolf Marek <r.marek@assembler.cz>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com>
Acked-by: Logan Gunthorpe <logang@deltatee.com> [drivers/pci/switch/switchtec]
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [drivers/pci/switch/switchtec]
Cc: Benson Leung <bleung@chromium.org>
Acked-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> [platform/chrome]
Cc: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> [rtc/*]
Cc: Mark Brown <broonie@kernel.org>
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: bcm-kernel-feedback-list@broadcom.com
Cc: Wan ZongShun <mcuos.com@gmail.com>
Cc: Zwane Mwaikambo <zwanem@gmail.com>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <a@unstable.cc>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
2019-05-06 17:46:41 +03:00
Taehee Yoo
8cd2bc981c netfilter: nf_flow_table: do not flow offload deleted conntrack entries
Conntrack entries can be deleted by the masquerade module. In that case,
flow offload should be deleted too, but GC and data-path of flow offload
do not check for conntrack status bits, hence flow offload entries will
be removed only by the timeout.

Update garbage collector and data-path to check for ct->status. If
IPS_DYING_BIT is set, garbage collector removes flow offload entries and
data-path routine ignores them.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 15:15:09 +02:00
Subash Abhinov Kasiviswanathan
b33c448c4f netfilter: nf_conntrack_h323: Remove deprecated config check
CONFIG_NF_CONNTRACK_IPV6 has been deprecated so replace it with a check
for IPV6 instead.

Use nf_ip6_route6() instead of v6ops->route() and keep the IS_MODULE()
in nf_ipv6_ops as mentioned by Florian so that direct calls are used
when IPV6 is builtin and indirect calls are used only when IPV6 is a
module.

Fixes: a0ae2562c6 ("netfilter: conntrack: remove l3proto abstraction")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 15:15:09 +02:00
Kristian Evensen
f8e6089820 netfilter: ctnetlink: Resolve conntrack L3-protocol flush regression
Commit 59c08c69c2 ("netfilter: ctnetlink: Support L3 protocol-filter
on flush") introduced a user-space regression when flushing connection
track entries. Before this commit, the nfgen_family field was not used
by the kernel and all entries were removed. Since this commit,
nfgen_family is used to filter out entries that should not be removed.
One example a broken tool is conntrack. conntrack always sets
nfgen_family to AF_INET, so after 59c08c69c2 only IPv4 entries were
removed with the -F parameter.

Pablo Neira Ayuso suggested using nfgenmsg->version to resolve the
regression, and this commit implements his suggestion. nfgenmsg->version
is so far set to zero, so it is well-suited to be used as a flag for
selecting old or new flush behavior. If version is 0, nfgen_family is
ignored and all entries are used. If user-space sets the version to one
(or any other value than 0), then the new behavior is used. As version
only can have two valid values, I chose not to add a new
NFNETLINK_VERSION-constant.

Fixes: 59c08c69c2 ("netfilter: ctnetlink: Support L3 protocol-filter on flush")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 15:15:02 +02:00
Linus Lüssing
a3c7cd0cdf batman-adv: mcast: fix multicast tt/tvlv worker locking
Syzbot has reported some issues with the locking assumptions made for
the multicast tt/tvlv worker: It was able to trigger the WARN_ON() in
batadv_mcast_mla_tt_retract() and batadv_mcast_mla_tt_add().
While hard/not reproduceable for us so far it seems that the
delayed_work_pending() we use might not be quite safe from reordering.

Therefore this patch adds an explicit, new spinlock to protect the
update of the mla_list and flags in bat_priv and then removes the
WARN_ON(delayed_work_pending()).

Reported-by: syzbot+83f2d54ec6b7e417e13f@syzkaller.appspotmail.com
Reported-by: syzbot+050927a651272b145a5d@syzkaller.appspotmail.com
Reported-by: syzbot+979ffc89b87309b1b94b@syzkaller.appspotmail.com
Reported-by: syzbot+f9f3f388440283da2965@syzkaller.appspotmail.com
Fixes: cbebd363b2 ("batman-adv: Use own timer for multicast TT and TVLV updates")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-06 11:40:46 +02:00
Simon Wunderlich
bdc76fd299 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-05-06 11:40:46 +02:00
Cong Wang
b362487a3b sch_htb: redefine htb qdisc overlimits
In commit 3c75f6ee13 ("net_sched: sch_htb: add per class overlimits counter")
we added an overlimits counter for each HTB class which could
properly reflect how many times we use up all the bandwidth
on each class. However, the overlimits counter in HTB qdisc
does not, it is way bigger than the sum of each HTB class.
In fact, this qdisc overlimits counter increases when we have
no skb to dequeue, which happens more often than we run out of
bandwidth.

It makes more sense to make this qdisc overlimits counter just
be a sum of each HTB class, in case people still get confused.

I have verified this patch with one single HTB class, where HTB
qdisc counters now always match HTB class counters as expected.

Eric suggested we could fold this field into 'direct_pkts' as
we only use its 32bit on 64bit CPU, this saves one cache line.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:55:20 -07:00
Vladimir Oltean
227d07a07e net: dsa: sja1105: Add support for traffic through standalone ports
In order to support this, we are creating a make-shift switch tag out of
a VLAN trunk configured on the CPU port. Termination of normal traffic
on switch ports only works when not under a vlan_filtering bridge.
Termination of management (PTP, BPDU) traffic works under all
circumstances because it uses a different tagging mechanism
(incl_srcpt). We are making use of the generic CONFIG_NET_DSA_TAG_8021Q
code and leveraging it from our own CONFIG_NET_DSA_TAG_SJA1105.

There are two types of traffic: regular and link-local.

The link-local traffic received on the CPU port is trapped from the
switch's regular forwarding decisions because it matched one of the two
DMAC filters for management traffic.

On transmission, the switch requires special massaging for these
link-local frames. Due to a weird implementation of the switching IP, by
default it drops link-local frames that originate on the CPU port.
It needs to be told where to forward them to, through an SPI command
("management route") that is valid for only a single frame.
So when we're sending link-local traffic, we are using the
dsa_defer_xmit mechanism.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
97a69a0dea net: dsa: Add support for deferred xmit
Some hardware needs to take work to get convinced to receive frames on
the CPU port (such as the sja1105 which takes temporary L2 forwarding
rules over SPI that last for a single frame). Such work needs a
sleepable context, and because the regular .ndo_start_xmit is atomic,
this cannot be done in the tagger. So introduce a generic DSA mechanism
that sets up a transmit skb queue and a workqueue for deferred
transmission.

The new driver callback (.port_deferred_xmit) is in dsa_switch and not
in the tagger because the operations that require sleeping typically
also involve interacting with the hardware, and not simply skb
manipulations. Therefore having it there simplifies the structure a bit
and makes it unnecessary to export functions from the driver to the
tagger.

The driver is responsible of calling dsa_enqueue_skb which transfers it
to the master netdevice. This is so that it has a chance of performing
some more work afterwards, such as cleanup or TX timestamping.

To tell DSA that skb xmit deferral is required, I have thought about
changing the return type of the tagger .xmit from struct sk_buff * into
a enum dsa_tx_t that could potentially encode a DSA_XMIT_DEFER value.

But the trailer tagger is reallocating every skb on xmit and therefore
making a valid use of the pointer return value. So instead of reworking
the API in complicated ways, right now a boolean property in the newly
introduced DSA_SKB_CB is set.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
cc1939e4b3 net: dsa: Allow drivers to filter packets they can decode source port from
Frames get processed by DSA and redirected to switch port net devices
based on the ETH_P_XDSA multiplexed packet_type handler found by the
network stack when calling eth_type_trans().

The running assumption is that once the DSA .rcv function is called, DSA
is always able to decode the switch tag in order to change the skb->dev
from its master.

However there are tagging protocols (such as the new DSA_TAG_PROTO_SJA1105,
user of DSA_TAG_PROTO_8021Q) where this assumption is not completely
true, since switch tagging piggybacks on the absence of a vlan_filtering
bridge. Moreover, management traffic (BPDU, PTP) for this switch doesn't
rely on switch tagging, but on a different mechanism. So it would make
sense to at least be able to terminate that.

Having DSA receive traffic it can't decode would put it in an impossible
situation: the eth_type_trans() function would invoke the DSA .rcv(),
which could not change skb->dev, then eth_type_trans() would be invoked
again, which again would call the DSA .rcv, and the packet would never
be able to exit the DSA filter and would spiral in a loop until the
whole system dies.

This happens because eth_type_trans() doesn't actually look at the skb
(so as to identify a potential tag) when it deems it as being
ETH_P_XDSA. It just checks whether skb->dev has a DSA private pointer
installed (therefore it's a DSA master) and that there exists a .rcv
callback (everybody except DSA_TAG_PROTO_NONE has that). This is
understandable as there are many switch tags out there, and exhaustively
checking for all of them is far from ideal.

The solution lies in introducing a filtering function for each tagging
protocol. In the absence of a filtering function, all traffic is passed
to the .rcv DSA callback. The tagging protocol should see the filtering
function as a pre-validation that it can decode the incoming skb. The
traffic that doesn't match the filter will bypass the DSA .rcv callback
and be left on the master netdevice, which wasn't previously possible.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
f9bbe4477c net: dsa: Optional VLAN-based port separation for switches without tagging
This patch provides generic DSA code for using VLAN (802.1Q) tags for
the same purpose as a dedicated switch tag for injection/extraction.
It is based on the discussions and interest that has been so far
expressed in https://www.spinics.net/lists/netdev/msg556125.html.

Unlike all other DSA-supported tagging protocols, CONFIG_NET_DSA_TAG_8021Q
does not offer a complete solution for drivers (nor can it). Instead, it
provides generic code that driver can opt into calling:
- dsa_8021q_xmit: Inserts a VLAN header with the specified contents.
  Can be called from another tagging protocol's xmit function.
  Currently the LAN9303 driver is inserting headers that are simply
  802.1Q with custom fields, so this is an opportunity for code reuse.
- dsa_8021q_rcv: Retrieves the TPID and TCI from a VLAN-tagged skb.
  Removing the VLAN header is left as a decision for the caller to make.
- dsa_port_setup_8021q_tagging: For each user port, installs an Rx VID
  and a Tx VID, for proper untagged traffic identification on ingress
  and steering on egress. Also sets up the VLAN trunk on the upstream
  (CPU or DSA) port. Drivers are intentionally left to call this
  function explicitly, depending on the context and hardware support.
  The expected switch behavior and VLAN semantics should not be violated
  under any conditions. That is, after calling
  dsa_port_setup_8021q_tagging, the hardware should still pass all
  ingress traffic, be it tagged or untagged.

For uniformity with the other tagging protocols, a module for the
dsa_8021q_netdev_ops structure is registered, but the typical usage is
to set up another tagging protocol which selects CONFIG_NET_DSA_TAG_8021Q,
and calls the API from tag_8021q.h. Null function definitions are also
provided so that a "depends on" is not forced in the Kconfig.

This tagging protocol only works when switch ports are standalone, or
when they are added to a VLAN-unaware bridge. It will probably remain
this way for the reasons below.

When added to a bridge that has vlan_filtering 1, the bridge core will
install its own VLANs and reset the pvids through switchdev. For the
bridge core, switchdev is a write-only pipe. All VLAN-related state is
kept in the bridge core and nothing is read from DSA/switchdev or from
the driver. So the bridge core will break this port separation because
it will install the vlan_default_pvid into all switchdev ports.

Even if we could teach the bridge driver about switchdev preference of a
certain vlan_default_pvid (task difficult in itself since the current
setting is per-bridge but we would need it per-port), there would still
exist many other challenges.

Firstly, in the DSA rcv callback, a driver would have to perform an
iterative reverse lookup to find the correct switch port. That is
because the port is a bridge slave, so its Rx VID (port PVID) is subject
to user configuration. How would we ensure that the user doesn't reset
the pvid to a different value (which would make an O(1) translation
impossible), or to a non-unique value within this DSA switch tree (which
would make any translation impossible)?

Finally, not all switch ports are equal in DSA, and that makes it
difficult for the bridge to be completely aware of this anyway.
The CPU port needs to transmit tagged packets (VLAN trunk) in order for
the DSA rcv code to be able to decode source information.
But the bridge code has absolutely no idea which switch port is the CPU
port, if nothing else then just because there is no netdevice registered
by DSA for the CPU port.
Also DSA does not currently allow the user to specify that they want the
CPU port to do VLAN trunking anyway. VLANs are added to the CPU port
using the same flags as they were added on the user port.

So the VLANs installed by dsa_port_setup_8021q_tagging per driver
request should remain private from the bridge's and user's perspective,
and should not alter the VLAN semantics observed by the user.

In the current implementation a VLAN range ending at 4095 (VLAN_N_VID)
is reserved for this purpose. Each port receives a unique Rx VLAN and a
unique Tx VLAN. Separate VLANs are needed for Rx and Tx because they
serve different purposes: on Rx the switch must process traffic as
untagged and process it with a port-based VLAN, but with care not to
hinder bridging. On the other hand, the Tx VLAN is where the
reachability restrictions are imposed, since by tagging frames in the
xmit callback we are telling the switch onto which port to steer the
frame.

Some general guidance on how this support might be employed for
real-life hardware (some comments made by Florian Fainelli):

- If the hardware supports VLAN tag stacking, it should somehow back
  up its private VLAN settings when the bridge tries to override them.
  Then the driver could re-apply them as outer tags. Dedicating an outer
  tag per bridge device would allow identical inner tag VID numbers to
  co-exist, yet preserve broadcast domain isolation.

- If the switch cannot handle VLAN tag stacking, it should disable this
  port separation when added as slave to a vlan_filtering bridge, in
  that case having reduced functionality.

- Drivers for old switches that don't support the entire VLAN_N_VID
  range will need to rework the current range selection mechanism.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
146c1bed44 net: dsa: Export symbols for dsa_port_vid_{add, del}
This is needed so that the newly introduced tag_8021q may access these
core DSA functions when built as a module.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
b2243b369c net: dsa: Call driver's setup callback after setting up its switchdev notifier
This allows the driver to perform some manipulations of its own during
setup, using generic switchdev calls. Having the notifiers registered at
setup time is important because otherwise any switchdev transaction
emitted during this time would be ignored (dispatched to an empty call
chain).

One current usage scenario is for the driver to request DSA to set up
802.1Q based switch tagging for its ports.

There is no danger for the driver setup code to start racing now with
switchdev events emitted from the network stack (such as bridge core)
even if the notifier is registered earlier. This is because the network
stack needs a net_device as a vehicle to perform switchdev operations,
and the slave net_devices are registered later than the core driver
setup anyway (ds->ops->setup in dsa_switch_setup vs dsa_port_setup).

Luckily DSA doesn't need a net_device to carry out switchdev callbacks,
and therefore drivers shouldn't assume either that net_devices are
available at the time their switchdev callbacks get invoked.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>-
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Pieter Jansen van Vuuren
88c44a5200 net/sched: add block pointer to tc_cls_common_offload structure
Some actions like the police action are stateful and could share state
between devices. This is incompatible with offloading to multiple devices
and drivers might want to test for shared blocks when offloading.
Store a pointer to the tcf_block structure in the tc_cls_common_offload
structure to allow drivers to determine when offloads apply to a shared
block.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
12f02b6b15 net/sched: allow stats updates from offloaded police actions
Implement the stats_update callback for the police action that
will be used by drivers for hardware offload.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
b7fe4ab8a6 net/sched: extend matchall offload for hardware statistics
Introduce a new command for matchall classifiers that allows hardware
to update statistics.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
8c8cfc6ed2 net/sched: add police action to the hardware intermediate representation
Add police action to the hardware intermediate representation which
would subsequently allow it to be used by drivers for offload.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
fa762da94d net/sched: move police action structures to header
Move tcf_police_params, tcf_police and tc_police_compat structures to a
header. Making them usable to other code for example drivers that would
offload police actions to hardware.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
dfcb19f0fa net/sched: remove unused functions for matchall offload
Cleanup unused functions and variables after porting to the newer
intermediate representation.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
9681e8b3ef net/dsa: use intermediate representation for matchall offload
Updates dsa hardware switch handling infrastructure to use the newer
intermediate representation for flow actions in matchall offloads.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
f00cbf1968 net/sched: use the hardware intermediate representation for matchall
Extends matchall offload to make use of the hardware intermediate
representation. More specifically, this patch moves the native TC
actions in cls_matchall offload to the newer flow_action
representation. This ultimately allows us to avoid a direct
dependency on native TC actions for matchall.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:23 -07:00
Pieter Jansen van Vuuren
a7a7be6087 net/sched: add sample action to the hardware intermediate representation
Add sample action to the hardware intermediate representation model which
would subsequently allow it to be used by drivers for offload.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:23 -07:00
David S. Miller
1ffad6d1af Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

===================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next, they are:

1) Move nft_expr_clone() to nft_dynset, from Paul Gortmaker.

2) Do not include module.h from net/netfilter/nf_tables.h,
   also from Paul.

3) Restrict conntrack sysctl entries to boolean, from Tonghao Zhang.

4) Several patches to add infrastructure to autoload NAT helper
   modules from their respective conntrack helper, this also includes
   the first client of this code in OVS, patches from Flavio Leitner.

5) Add support to match for conntrack ID, from Brett Mastbergen.

6) Spelling fix in connlabel, from Colin Ian King.

7) Use struct_size() from hashlimit, from Gustavo A. R. Silva.

8) Add optimized version of nf_inet_addr_mask(), from Li RongQing.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:35:08 -07:00
Gustavo A. R. Silva
eabb478219 netfilter: xt_hashlimit: use struct_size() helper
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.

So, replace code of the following form:

sizeof(struct xt_hashlimit_htable) + sizeof(struct hlist_head) * size

with:

struct_size(hinfo, hash, size)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 01:03:04 +02:00
Taehee Yoo
43c8f13118 netfilter: nf_flow_table: fix missing error check for rhashtable_insert_fast
rhashtable_insert_fast() may return an error value when memory
allocation fails, but flow_offload_add() does not check for errors.
This patch just adds missing error checking.

Fixes: ac2a66665e ("netfilter: add generic flow table infrastructure")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 00:36:30 +02:00
Florian Westphal
edbd82c5fb netfilter: nf_tables: fix base chain stat rcu_dereference usage
Following splat gets triggered when nfnetlink monitor is running while
xtables-nft selftests are running:

net/netfilter/nf_tables_api.c:1272 suspicious rcu_dereference_check() usage!
other info that might help us debug this:

1 lock held by xtables-nft-mul/27006:
 #0: 00000000e0f85be9 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x1a/0x50
Call Trace:
 nf_tables_fill_chain_info.isra.45+0x6cc/0x6e0
 nf_tables_chain_notify+0xf8/0x1a0
 nf_tables_commit+0x165c/0x1740

nf_tables_fill_chain_info() can be called both from dumps (rcu read locked)
or from the transaction path if a userspace process subscribed to nftables
notifications.

In the 'table dump' case, rcu_access_pointer() cannot be used: We do not
hold transaction mutex so the pointer can be NULLed right after the check.
Just unconditionally fetch the value, then have the helper return
immediately if its NULL.

In the notification case we don't hold the rcu read lock, but updates are
prevented due to transaction mutex. Use rcu_dereference_check() to make lockdep
aware of this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 00:36:29 +02:00
Jakub Jankowski
f5e85ce8e7 netfilter: nf_conntrack_h323: restore boundary check correctness
Since commit bc7d811ace ("netfilter: nf_ct_h323: Convert
CHECK_BOUND macro to function"), NAT traversal for H.323
doesn't work, failing to parse H323-UserInformation.
nf_h323_error_boundary() compares contents of the bitstring,
not the addresses, preventing valid H.323 packets from being
conntrack'd.

This looks like an oversight from when CHECK_BOUND macro was
converted to a function.

To fix it, stop dereferencing bs->cur and bs->end.

Fixes: bc7d811ace ("netfilter: nf_ct_h323: Convert CHECK_BOUND macro to function")
Signed-off-by: Jakub Jankowski <shasta@toxcorp.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-06 00:36:17 +02:00
David S. Miller
19ab5f4023 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2019-05-05

Here's one more bluetooth-next pull request for 5.2:

 - Fixed Command Complete event handling check for matching opcode
 - Added support for Qualcomm WCN3998 controller, along with DT bindings
 - Added default address for Broadcom BCM2076B1 controllers

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 13:10:36 -07:00
Paolo Abeni
8c3c447b3c net: use indirect calls helpers at the socket layer
This avoids an indirect call per {send,recv}msg syscall in
the common (IPv6 or IPv4 socket) case.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:38:04 -07:00
Paolo Abeni
97ff7ffb11 net: use indirect calls helpers at early demux stage
So that we avoid another indirect call per RX packet, if
early demux is enabled.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:38:04 -07:00
Paolo Abeni
0e219ae48c net: use indirect calls helpers for L3 handler hooks
So that we avoid another indirect call per RX packet in the common
case.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:38:04 -07:00
Paolo Abeni
f5737cbadb net: use indirect calls helpers for ptype hook
This avoids an indirect call per RX IPv6/IPv4 packet.
Note that we don't want to use the indirect calls helper for taps.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:38:04 -07:00
João Paulo Rechi Vita
f80c5dad7b Bluetooth: Ignore CC events not matching the last HCI command
This commit makes the kernel not send the next queued HCI command until
a command complete arrives for the last HCI command sent to the
controller. This change avoids a problem with some buggy controllers
(seen on two SKUs of QCA9377) that send an extra command complete event
for the previous command after the kernel had already sent a new HCI
command to the controller.

The problem was reproduced when starting an active scanning procedure,
where an extra command complete event arrives for the LE_SET_RANDOM_ADDR
command. When this happends the kernel ends up not processing the
command complete for the following commmand, LE_SET_SCAN_PARAM, and
ultimately behaving as if a passive scanning procedure was being
performed, when in fact controller is performing an active scanning
procedure. This makes it impossible to discover BLE devices as no device
found events are sent to userspace.

This problem is reproducible on 100% of the attempts on the affected
controllers. The extra command complete event can be seen at timestamp
27.420131 on the btmon logs bellow.

Bluetooth monitor ver 5.50
= Note: Linux version 5.0.0+ (x86_64)                                  0.352340
= Note: Bluetooth subsystem version 2.22                               0.352343
= New Index: 80:C5:F2:8F:87:84 (Primary,USB,hci0)               [hci0] 0.352344
= Open Index: 80:C5:F2:8F:87:84                                 [hci0] 0.352345
= Index Info: 80:C5:F2:8F:87:84 (Qualcomm)                      [hci0] 0.352346
@ MGMT Open: bluetoothd (privileged) version 1.14             {0x0001} 0.352347
@ MGMT Open: btmon (privileged) version 1.14                  {0x0002} 0.352366
@ MGMT Open: btmgmt (privileged) version 1.14                {0x0003} 27.302164
@ MGMT Command: Start Discovery (0x0023) plen 1       {0x0003} [hci0] 27.302310
        Address type: 0x06
          LE Public
          LE Random
< HCI Command: LE Set Random Address (0x08|0x0005) plen 6   #1 [hci0] 27.302496
        Address: 15:60:F2:91:B2:24 (Non-Resolvable)
> HCI Event: Command Complete (0x0e) plen 4                 #2 [hci0] 27.419117
      LE Set Random Address (0x08|0x0005) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Scan Parameters (0x08|0x000b) plen 7  #3 [hci0] 27.419244
        Type: Active (0x01)
        Interval: 11.250 msec (0x0012)
        Window: 11.250 msec (0x0012)
        Own address type: Random (0x01)
        Filter policy: Accept all advertisement (0x00)
> HCI Event: Command Complete (0x0e) plen 4                 #4 [hci0] 27.420131
      LE Set Random Address (0x08|0x0005) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Scan Enable (0x08|0x000c) plen 2      #5 [hci0] 27.420259
        Scanning: Enabled (0x01)
        Filter duplicates: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                 #6 [hci0] 27.420969
      LE Set Scan Parameters (0x08|0x000b) ncmd 1
        Status: Success (0x00)
> HCI Event: Command Complete (0x0e) plen 4                 #7 [hci0] 27.421983
      LE Set Scan Enable (0x08|0x000c) ncmd 1
        Status: Success (0x00)
@ MGMT Event: Command Complete (0x0001) plen 4        {0x0003} [hci0] 27.422059
      Start Discovery (0x0023) plen 1
        Status: Success (0x00)
        Address type: 0x06
          LE Public
          LE Random
@ MGMT Event: Discovering (0x0013) plen 2             {0x0003} [hci0] 27.422067
        Address type: 0x06
          LE Public
          LE Random
        Discovery: Enabled (0x01)
@ MGMT Event: Discovering (0x0013) plen 2             {0x0002} [hci0] 27.422067
        Address type: 0x06
          LE Public
          LE Random
        Discovery: Enabled (0x01)
@ MGMT Event: Discovering (0x0013) plen 2             {0x0001} [hci0] 27.422067
        Address type: 0x06
          LE Public
          LE Random
        Discovery: Enabled (0x01)

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-05-05 19:29:04 +02:00
Dan Carpenter
fdd1a8103a net: atm: clean up a range check
The code works fine but the problem is that check for negatives is a
no-op:

	if (arg < 0)
		i = 0;

The "i" value isn't used.  We immediately overwrite it with:

	i = array_index_nospec(arg, MAX_LEC_ITF);

The array_index_nospec() macro returns zero if "arg" is out of bounds so
this works, but the dead code is confusing and it doesn't look very
intentional.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:25:52 -07:00
Colin Ian King
d14a108d51 net: rds: fix spelling mistake "syctl" -> "sysctl"
There is a spelling mistake in a pr_warn warning. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:19:43 -07:00
Matteo Croce
594725db0c cls_cgroup: avoid panic when receiving a packet before filter set
When a cgroup classifier is added, there is a small time interval in
which tp->root is NULL. If we receive a packet in this small time slice
a NULL pointer dereference will happen, leading to a kernel panic:

    # mkdir /sys/fs/cgroup/net_cls/0
    # echo 0x100001 >  /sys/fs/cgroup/net_cls/0/net_cls.classid
    # echo $$ >/sys/fs/cgroup/net_cls/0/tasks
    # ping -qfb 255.255.255.255 -I eth0 &>/dev/null &
    # tc qdisc add dev eth0 root handle 10: htb
    # while : ; do
    > tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
    > tc filter delete dev eth0
    > done
    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
    Mem abort info:
      ESR = 0x96000005
      Exception class = DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
    Data abort info:
      ISV = 0, ISS = 0x00000005
      CM = 0, WnR = 0
    user pgtable: 4k pages, 39-bit VAs, pgdp = 0000000098a7ff91
    [0000000000000028] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] SMP
    Modules linked in: sch_htb cls_cgroup algif_hash af_alg nls_iso8859_1 nls_cp437 vfat fat xhci_plat_hcd m25p80 spi_nor xhci_hcd mtd usbcore usb_common spi_orion sfp i2c_mv64xxx phy_generic mdio_i2c marvell10g i2c_core mvpp2 mvmdio phylink sbsa_gwdt ip_tables x_tables autofs4
    Process ping (pid: 5421, stack limit = 0x00000000b20b1505)
    CPU: 3 PID: 5421 Comm: ping Not tainted 5.1.0-rc6 #31
    Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : cls_cgroup_classify+0x80/0xec [cls_cgroup]
    lr : cls_cgroup_classify+0x34/0xec [cls_cgroup]
    sp : ffffff8012e6b850
    x29: ffffff8012e6b850 x28: ffffffc423dd3c00
    x27: ffffff801093ebc0 x26: ffffffc425a85b00
    x25: 0000000020000000 x24: 0000000000000000
    x23: ffffff8012e6b910 x22: ffffffc428db4900
    x21: ffffff8012e6b910 x20: 0000000000100001
    x19: 0000000000000000 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000000
    x15: 0000000000000000 x14: 0000000000000000
    x13: 0000000000000000 x12: 000000000000001c
    x11: 0000000000000018 x10: ffffff8012e6b840
    x9 : 0000000000003580 x8 : 000000000000009d
    x7 : 0000000000000002 x6 : ffffff8012e6b860
    x5 : 000000007cd66ffe x4 : 000000009742a193
    x3 : ffffff800865b4d8 x2 : ffffff8012e6b910
    x1 : 0000000000000400 x0 : ffffffc42c38f300
    Call trace:
     cls_cgroup_classify+0x80/0xec [cls_cgroup]
     tcf_classify+0x78/0x138
     htb_enqueue+0x74/0x320 [sch_htb]
     __dev_queue_xmit+0x3e4/0x9d0
     dev_queue_xmit+0x24/0x30
     ip_finish_output2+0x2e4/0x4d0
     ip_finish_output+0x1d8/0x270
     ip_mc_output+0xa8/0x240
     ip_local_out+0x58/0x68
     ip_send_skb+0x2c/0x88
     ip_push_pending_frames+0x44/0x50
     raw_sendmsg+0x458/0x830
     inet_sendmsg+0x54/0xe8
     sock_sendmsg+0x34/0x50
     __sys_sendto+0xd0/0x120
     __arm64_sys_sendto+0x30/0x40
     el0_svc_common.constprop.0+0x88/0xf8
     el0_svc_handler+0x2c/0x38
     el0_svc+0x8/0xc
    Code: 39496001 360002a1 b9425c14 34000274 (79405260)

Fixes: ed76f5edcc ("net: sched: protect filter_chain list with filter_chain_lock mutex")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:00:17 -07:00
Colin Ian King
ca96534630 openvswitch: check for null pointer return from nla_nest_start_noflag
The call to nla_nest_start_noflag can return null in the unlikely
event that nla_put returns -EMSGSIZE.  Check for this condition to
avoid a null pointer dereference on pointer nla_reply.

Addresses-Coverity: ("Dereference null return value")
Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 00:52:07 -07:00
David Ahern
a5995e7107 ipv4: Move exception bucket to nh_common
Similar to the cached routes, make IPv4 exceptions accessible when
using an IPv6 nexthop struct with IPv4 routes. Simplify the exception
functions by passing in fib_nh_common since that is all it needs,
and then cleanup the call sites that have extraneous fib_nh conversions.

As with the cached routes this is a change in location only, from fib_nh
up to fib_nh_common; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 00:47:16 -07:00
David Ahern
87063a1fa6 ipv4: Pass fib_nh_common to rt_cache_route
Now that the cached routes are in fib_nh_common, pass it to
rt_cache_route and simplify its callers. For rt_set_nexthop,
the tclassid becomes the last user of fib_nh so move the
container_of under the #ifdef CONFIG_IP_ROUTE_CLASSID.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 00:47:16 -07:00
David Ahern
0f457a3662 ipv4: Move cached routes to fib_nh_common
While the cached routes, nh_pcpu_rth_output and nh_rth_input, are IPv4
specific, a later patch wants to make them accessible for IPv6 nexthops
with IPv4 routes using a fib6_nh. Move the cached routes from fib_nh to
fib_nh_common and update references.

Initialization of the cached entries is moved to fib_nh_common_init,
and free is moved to fib_nh_common_release.

Change in location only, from fib_nh up to fib_nh_common; no functional
change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 00:47:16 -07:00
YueHaibing
71f150f4c2 bpf: Use PTR_ERR_OR_ZERO in bpf_fd_sk_storage_update_elem()
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-04 23:20:58 -07:00
David Ahern
7fcd1e033d ipmr_base: Do not reset index in mr_table_dump
e is the counter used to save the location of a dump when an
skb is filled. Once the walk of the table is complete, mr_table_dump
needs to return without resetting that index to 0. Dump of a specific
table is looping because of the reset because there is no way to
indicate the walk of the table is done.

Move the reset to the caller so the dump of each table starts at 0,
but the loop counter is maintained if a dump fills an skb.

Fixes: e1cedae1ba ("ipmr: Refactor mr_rtm_dumproute")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 01:38:15 -04:00
Eelco Chaudron
a734d1f4c2 net: openvswitch: return an error instead of doing BUG_ON()
For all other error cases in queue_userspace_packet() the error is
returned, so it makes sense to do the same for these two error cases.

Reported-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 01:36:36 -04:00
Michal Kubecek
05d7f547be genetlink: do not validate dump requests if there is no policy
Unlike do requests, dump genetlink requests now perform strict validation
by default even if the genetlink family does not set policy and maxtype
because it does validation and parsing on its own (e.g. because it wants to
allow different message format for different commands). While the null
policy will be ignored, maxtype (which would be zero) is still checked so
that any attribute will fail validation.

The solution is to only call __nla_validate() from genl_family_rcv_msg()
if family->maxtype is set.

Fixes: ef6243acb4 ("genetlink: optionally validate strictly/dumps")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 01:27:10 -04:00
Tuong Lien
c0b14a0854 tipc: fix missing Name entries due to half-failover
TIPC link can temporarily fall into "half-establish" that only one of
the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
messages, whereas the other link endpoint is not up (e.g. immediately
when the endpoint receives ACTIVATE_MSG, the network interface goes
down...).

This is a normal situation and will be settled because the link
endpoint will be eventually brought down after the link tolerance time.

However, the situation will become worse when the second link is
established before the first link endpoint goes down,
For example:

   1. Both links <1A-2A>, <1B-2B> down
   2. Link endpoint 2A up, but 1A still down (e.g. due to network
      disturbance, wrong session, etc.)
   3. Link <1B-2B> up
   4. Link endpoint 2A down (e.g. due to link tolerance timeout)
   5. Node B starts failover onto link <1B-2B>

   ==> Node A does never start link failover.

When the "half-failover" situation happens, two consequences have been
observed:

a) Peer link/node gets stuck in FAILINGOVER state;
b) Traffic or user messages that peer node is trying to failover onto
the second link can be partially or completely dropped by this node.

The consequence a) was actually solved by commit c140eb166d ("tipc:
fix failover problem"), but that commit didn't cover the b). It's due
to the fact that the tunnel link endpoint has never been prepared for a
failover, so the 'l->drop_point' (and the other data...) is not set
correctly. When a TUNNEL_MSG from peer node arrives on the link,
depending on the inner message's seqno and the current 'l->drop_point'
value, the message can be dropped (- treated as a duplicate message) or
processed.
At this early stage, the traffic messages from peer are likely to be
NAME_DISTRIBUTORs, this means some name table entries will be missed on
the node forever!

The commit resolves the issue by starting the FAILOVER process on this
node as well. Another benefit from this solution is that we ensure the
link will not be re-established until the failover ends.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 00:59:51 -04:00
Matteo Croce
25426043ec cls_matchall: avoid panic when receiving a packet before filter set
When a matchall classifier is added, there is a small time interval in
which tp->root is NULL. If we receive a packet in this small time slice
a NULL pointer dereference will happen, leading to a kernel panic:

    # tc qdisc replace dev eth0 ingress
    # tc filter add dev eth0 parent ffff: matchall action gact drop
    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000034
    Mem abort info:
      ESR = 0x96000005
      Exception class = DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
    Data abort info:
      ISV = 0, ISS = 0x00000005
      CM = 0, WnR = 0
    user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000a623d530
    [0000000000000034] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] SMP
    Modules linked in: cls_matchall sch_ingress nls_iso8859_1 nls_cp437 vfat fat m25p80 spi_nor mtd xhci_plat_hcd xhci_hcd phy_generic sfp mdio_i2c usbcore i2c_mv64xxx marvell10g mvpp2 usb_common spi_orion mvmdio i2c_core sbsa_gwdt phylink ip_tables x_tables autofs4
    Process ksoftirqd/0 (pid: 9, stack limit = 0x0000000009de7d62)
    CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.1.0-rc6 #21
    Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
    pstate: 40000005 (nZcv daif -PAN -UAO)
    pc : mall_classify+0x28/0x78 [cls_matchall]
    lr : tcf_classify+0x78/0x138
    sp : ffffff80109db9d0
    x29: ffffff80109db9d0 x28: ffffffc426058800
    x27: 0000000000000000 x26: ffffffc425b0dd00
    x25: 0000000020000000 x24: 0000000000000000
    x23: ffffff80109dbac0 x22: 0000000000000001
    x21: ffffffc428ab5100 x20: ffffffc425b0dd00
    x19: ffffff80109dbac0 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000000
    x15: 0000000000000000 x14: 0000000000000000
    x13: ffffffbf108ad288 x12: dead000000000200
    x11: 00000000f0000000 x10: 0000000000000001
    x9 : ffffffbf1089a220 x8 : 0000000000000001
    x7 : ffffffbebffaa950 x6 : 0000000000000000
    x5 : 000000442d6ba000 x4 : 0000000000000000
    x3 : ffffff8008735ad8 x2 : ffffff80109dbac0
    x1 : ffffffc425b0dd00 x0 : ffffff8010592078
    Call trace:
     mall_classify+0x28/0x78 [cls_matchall]
     tcf_classify+0x78/0x138
     __netif_receive_skb_core+0x29c/0xa20
     __netif_receive_skb_one_core+0x34/0x60
     __netif_receive_skb+0x28/0x78
     netif_receive_skb_internal+0x2c/0xc0
     napi_gro_receive+0x1a0/0x1d8
     mvpp2_poll+0x928/0xb18 [mvpp2]
     net_rx_action+0x108/0x378
     __do_softirq+0x128/0x320
     run_ksoftirqd+0x44/0x60
     smpboot_thread_fn+0x168/0x1b0
     kthread+0x12c/0x130
     ret_from_fork+0x10/0x1c
    Code: aa0203f3 aa1e03e0 d503201f f9400684 (b9403480)
    ---[ end trace fc71e2ef7b8ab5a5 ]---
    Kernel panic - not syncing: Fatal exception in interrupt
    SMP: stopping secondary CPUs
    Kernel Offset: disabled
    CPU features: 0x002,00002000
    Memory Limit: none
    Rebooting in 1 seconds..

Fix this by adding a NULL check in mall_classify().

Fixes: ed76f5edcc ("net: sched: protect filter_chain list with filter_chain_lock mutex")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 00:57:48 -04:00
Gustavo A. R. Silva
e512fcf028 net: sched: cls_u32: use struct_size() helper
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.

So, replace code of the following form:

sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key)

with:

struct_size(s, keys, s->nkeys)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 00:46:27 -04:00
Cong Wang
141b6b2ad7 net: add a generic tracepoint for TX queue timeout
Although devlink health report does a nice job on reporting TX
timeout and other NIC errors, unfortunately it requires drivers
to support it but currently only mlx5 has implemented it.
Before other drivers could catch up, it is useful to have a
generic tracepoint to monitor this kind of TX timeout. We have
been suffering TX timeout with different drivers, we plan to
start to monitor it with rasdaemon which just needs a new tracepoint.

Sample output:

  ksoftirqd/1-16    [001] ..s2   144.043173: net_dev_xmit_timeout: dev=ens3 driver=e1000 queue=0

Cc: Eran Ben Elisha <eranbe@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 00:41:41 -04:00
David Ahern
4b2a2bfeb3 neighbor: Call __ipv4_neigh_lookup_noref in neigh_xmit
Commit cd9ff4de01 changed the key for IFF_POINTOPOINT devices to
INADDR_ANY but neigh_xmit which is used for MPLS encapsulations was not
updated to use the altered key. The result is that every packet Tx does
a lookup on the gateway address which does not find an entry, a new one
is created only to find the existing one in the table right before the
insert since arp_constructor was updated to reset the primary key. This
is seen in the allocs and destroys counters:
    ip -s -4 ntable show | head -10 | grep alloc

which increase for each packet showing the unnecessary overhread.

Fix by having neigh_xmit use __ipv4_neigh_lookup_noref for NEIGH_ARP_TABLE.

Fixes: cd9ff4de01 ("ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY")
Reported-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 00:38:53 -04:00
David Ahern
64c6f4bbca neighbor: Reset gc_entries counter if new entry is released before insert
Ian and Alan both reported seeing overflows after upgrades to 5.x kernels:
  neighbour: arp_cache: neighbor table overflow!

Alan's mpls script helped get to the bottom of this bug. When a new entry
is created the gc_entries counter is bumped in neigh_alloc to check if a
new one is allowed to be created. ___neigh_create then searches for an
existing entry before inserting the just allocated one. If an entry
already exists, the new one is dropped in favor of the existing one. In
this case the cleanup path needs to drop the gc_entries counter. There
is no memory leak, only a counter leak.

Fixes: 58956317c8 ("neighbor: Improve garbage collection")
Reported-by: Ian Kumlien <ian.kumlien@gmail.com>
Reported-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 00:36:19 -04:00
Parav Pandit
5102eca903 net/smc: Use rdma_read_gid_l2_fields to L2 fields
Use core provided API to fill the source MAC address and use
rdma_read_gid_attr_ndev_rcu() to get stable netdev.

This is preparation patch to allow gid attribute to become NULL when
associated net device is removed.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-03 11:10:03 -03:00
David S. Miller
ff24e4980a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three trivial overlapping conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 22:14:21 -04:00
Al Viro
bef252fa19 rpcpipe: switch to ->free_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-01 22:43:26 -04:00
Eric Dumazet
4dd2b82d5a udp: fix GRO packet of death
syzbot was able to crash host by sending UDP packets with a 0 payload.

TCP does not have this issue since we do not aggregate packets without
payload.

Since dev_gro_receive() sets gso_size based on skb_gro_len(skb)
it seems not worth trying to cope with padded packets.

BUG: KASAN: slab-out-of-bounds in skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826
Read of size 16 at addr ffff88808893fff0 by task syz-executor612/7889

CPU: 0 PID: 7889 Comm: syz-executor612 Not tainted 5.1.0-rc7+ #96
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 __asan_report_load16_noabort+0x14/0x20 mm/kasan/generic_report.c:133
 skb_gro_receive+0xf5f/0x10e0 net/core/skbuff.c:3826
 udp_gro_receive_segment net/ipv4/udp_offload.c:382 [inline]
 call_gro_receive include/linux/netdevice.h:2349 [inline]
 udp_gro_receive+0xb61/0xfd0 net/ipv4/udp_offload.c:414
 udp4_gro_receive+0x763/0xeb0 net/ipv4/udp_offload.c:478
 inet_gro_receive+0xe72/0x1110 net/ipv4/af_inet.c:1510
 dev_gro_receive+0x1cd0/0x23c0 net/core/dev.c:5581
 napi_gro_frags+0x36b/0xd10 net/core/dev.c:5843
 tun_get_user+0x2f24/0x3fb0 drivers/net/tun.c:1981
 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2027
 call_write_iter include/linux/fs.h:1866 [inline]
 do_iter_readv_writev+0x5e1/0x8e0 fs/read_write.c:681
 do_iter_write fs/read_write.c:957 [inline]
 do_iter_write+0x184/0x610 fs/read_write.c:938
 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1002
 do_writev+0x15e/0x370 fs/read_write.c:1037
 __do_sys_writev fs/read_write.c:1110 [inline]
 __se_sys_writev fs/read_write.c:1107 [inline]
 __x64_sys_writev+0x75/0xb0 fs/read_write.c:1107
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441cc0
Code: 05 48 3d 01 f0 ff ff 0f 83 9d 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 3d 51 93 29 00 00 75 14 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 74 09 fc ff c3 48 83 ec 08 e8 ba 2b 00 00
RSP: 002b:00007ffe8c716118 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 00007ffe8c716150 RCX: 0000000000441cc0
RDX: 0000000000000001 RSI: 00007ffe8c716170 RDI: 00000000000000f0
RBP: 0000000000000000 R08: 000000000000ffff R09: 0000000000a64668
R10: 0000000020000040 R11: 0000000000000246 R12: 000000000000c2d9
R13: 0000000000402b50 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 5143:
 save_stack+0x45/0xd0 mm/kasan/common.c:75
 set_track mm/kasan/common.c:87 [inline]
 __kasan_kmalloc mm/kasan/common.c:497 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:505
 slab_post_alloc_hook mm/slab.h:437 [inline]
 slab_alloc mm/slab.c:3393 [inline]
 kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3555
 mm_alloc+0x1d/0xd0 kernel/fork.c:1030
 bprm_mm_init fs/exec.c:363 [inline]
 __do_execve_file.isra.0+0xaa3/0x23f0 fs/exec.c:1791
 do_execveat_common fs/exec.c:1865 [inline]
 do_execve fs/exec.c:1882 [inline]
 __do_sys_execve fs/exec.c:1958 [inline]
 __se_sys_execve fs/exec.c:1953 [inline]
 __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 5351:
 save_stack+0x45/0xd0 mm/kasan/common.c:75
 set_track mm/kasan/common.c:87 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
 __cache_free mm/slab.c:3499 [inline]
 kmem_cache_free+0x86/0x260 mm/slab.c:3765
 __mmdrop+0x238/0x320 kernel/fork.c:677
 mmdrop include/linux/sched/mm.h:49 [inline]
 finish_task_switch+0x47b/0x780 kernel/sched/core.c:2746
 context_switch kernel/sched/core.c:2880 [inline]
 __schedule+0x81b/0x1cc0 kernel/sched/core.c:3518
 preempt_schedule_irq+0xb5/0x140 kernel/sched/core.c:3745
 retint_kernel+0x1b/0x2d
 arch_local_irq_restore arch/x86/include/asm/paravirt.h:767 [inline]
 kmem_cache_free+0xab/0x260 mm/slab.c:3766
 anon_vma_chain_free mm/rmap.c:134 [inline]
 unlink_anon_vmas+0x2ba/0x870 mm/rmap.c:401
 free_pgtables+0x1af/0x2f0 mm/memory.c:394
 exit_mmap+0x2d1/0x530 mm/mmap.c:3144
 __mmput kernel/fork.c:1046 [inline]
 mmput+0x15f/0x4c0 kernel/fork.c:1067
 exec_mmap fs/exec.c:1046 [inline]
 flush_old_exec+0x8d9/0x1c20 fs/exec.c:1279
 load_elf_binary+0x9bc/0x53f0 fs/binfmt_elf.c:864
 search_binary_handler fs/exec.c:1656 [inline]
 search_binary_handler+0x17f/0x570 fs/exec.c:1634
 exec_binprm fs/exec.c:1698 [inline]
 __do_execve_file.isra.0+0x1394/0x23f0 fs/exec.c:1818
 do_execveat_common fs/exec.c:1865 [inline]
 do_execve fs/exec.c:1882 [inline]
 __do_sys_execve fs/exec.c:1958 [inline]
 __se_sys_execve fs/exec.c:1953 [inline]
 __x64_sys_execve+0x8f/0xc0 fs/exec.c:1953
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff88808893f7c0
 which belongs to the cache mm_struct of size 1496
The buggy address is located 600 bytes to the right of
 1496-byte region [ffff88808893f7c0, ffff88808893fd98)
The buggy address belongs to the page:
page:ffffea0002224f80 count:1 mapcount:0 mapping:ffff88821bc40ac0 index:0xffff88808893f7c0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea00025b4f08 ffffea00027b9d08 ffff88821bc40ac0
raw: ffff88808893f7c0 ffff88808893e440 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88808893fe80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88808893ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88808893ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                             ^
 ffff888088940000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888088940080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fixes: e20cf8d3f1 ("udp: implement GRO for plain UDP sockets.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 22:29:56 -04:00
Martin KaFai Lau
886b7a5010 ipv6: A few fixes on dereferencing rt->from
It is a followup after the fix in
commit 9c69a13205 ("route: Avoid crash from dereferencing NULL rt->from")

rt6_do_redirect():
1. NULL checking is needed on rt->from because a parallel
   fib6_info delete could happen that sets rt->from to NULL.
   (e.g. rt6_remove_exception() and fib6_drop_pcpu_from()).

2. fib6_info_hold() is not enough.  Same reason as (1).
   Meaning, holding dst->__refcnt cannot ensure
   rt->from is not NULL or rt->from->fib6_ref is not 0.

   Instead of using fib6_info_hold_safe() which ip6_rt_cache_alloc()
   is already doing, this patch chooses to extend the rcu section
   to keep "from" dereference-able after checking for NULL.

inet6_rtm_getroute():
1. NULL checking is also needed on rt->from for a similar reason.
   Note that inet6_rtm_getroute() is using RTNL_FLAG_DOIT_UNLOCKED.

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 17:17:54 -04:00
Nicholas Mc Guire
f3505745c0 rds: ib: force endiannes annotation
While the endiannes is being handled correctly as indicated by the comment
above the offending line - sparse was unhappy with the missing annotation
as be64_to_cpu() expects a __be64 argument. To mitigate this annotation
all involved variables are changed to a consistent __le64 and the
 conversion to uint64_t delayed to the call to rds_cong_map_updated().

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 17:15:36 -04:00
Shmulik Ladkani
d2f0c96114 ipv4: ip_do_fragment: Preserve skb_iif during fragmentation
Previously, during fragmentation after forwarding, skb->skb_iif isn't
preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given
'from' skb.

As a result, ip_do_fragment's creates fragments with zero skb_iif,
leading to inconsistent behavior.

Assume for example an eBPF program attached at tc egress (post
forwarding) that examines __sk_buff->ingress_ifindex:
 - the correct iif is observed if forwarding path does not involve
   fragmentation/refragmentation
 - a bogus iif is observed if forwarding path involves
   fragmentation/refragmentatiom

Fix, by preserving skb_iif during 'ip_copy_metadata'.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 13:28:34 -04:00
Vinicius Costa Gomes
c25031e993 taprio: Add support for cycle-time-extension
IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
last entry of a schedule before the start of a new schedule can be
extended, so "too-short" entries can be avoided.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:58:51 -04:00
Vinicius Costa Gomes
6ca6a66542 taprio: Add support for setting the cycle-time manually
IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
overridden, so the schedule is truncated to a determined "width".

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:58:51 -04:00
Vinicius Costa Gomes
a3d43c0d56 taprio: Add support adding an admin schedule
The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
operational?) and "Admin" ones. Up until now, 'taprio' only had
support for the "Oper" one, added when the qdisc is created. This adds
support for the "Admin" one, which allows the .change() operation to
be supported.

Just for clarification, some quick (and dirty) definitions, the "Oper"
schedule is the currently (as in this instant) running one, and it's
read-only. The "Admin" one is the one that the system configurator has
installed, it can be changed, and it will be "promoted" to "Oper" when
it's 'base-time' is reached.

The idea behing this patch is that calling something like the below,
(after taprio is already configured with an initial schedule):

$ tc qdisc change taprio dev IFACE parent root 	     \
     	   base-time X 	     	   	       	     \
     	   sched-entry <CMD> <GATES> <INTERVAL>	     \
	   ...

Will cause a new admin schedule to be created and programmed to be
"promoted" to "Oper" at instant X. If an "Admin" schedule already
exists, it will be overwritten with the new parameters.

Up until now, there was some code that was added to ease the support
of changing a single entry of a schedule, but was ultimately unused.
Now, that we have support for "change" with more well thought
semantics, updating a single entry seems to be less useful.

So we remove what is in practice dead code, and return a "not
supported" error if the user tries to use it. If changing a single
entry would make the user's life easier we may ressurrect this idea,
but at this point, removing it simplifies the code.

For now, only the schedule specific bits are allowed to be added for a
new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
cannot be modified.

Example:

$ tc qdisc change dev IFACE parent root handle 100 taprio \
      base-time $BASE_TIME \
      sched-entry S 00 500000 \
      sched-entry S 0f 500000 \
      clockid CLOCK_TAI

The only change in the netlink API introduced by this change is the
introduction of an "admin" type in the response to a dump request,
that type allows userspace to separate the "oper" schedule from the
"admin" schedule. If userspace doesn't support the "admin" type, it
will only display the "oper" schedule.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:58:51 -04:00
Vinicius Costa Gomes
8c79f0ea5d taprio: Fix potencial use of invalid memory during dequeue()
Right now, this isn't a problem, but the next commit allows schedules
to be added during runtime. When a new schedule transitions from the
inactive to the active state ("admin" -> "oper") the previous one can
be freed, if it's freed just after the RCU read lock is released, we
may access an invalid entry.

So, we should take care to protect the dequeue() flow, so all the
places that access the entries are protected by the RCU read lock.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:58:51 -04:00
Yuchung Cheng
98fa6271cf tcp: refactor setting the initial congestion window
Relocate the congestion window initialization from tcp_init_metrics()
to tcp_init_transfer() to improve code readability.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Yuchung Cheng
6b94b1c88b tcp: refactor to consolidate TFO passive open code
Use a helper to consolidate two identical code block for passive TFO.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Yuchung Cheng
794200d662 tcp: undo cwnd on Fast Open spurious SYNACK retransmit
This patch makes passive Fast Open reverts the cwnd to default
initial cwnd (10 packets) if the SYNACK timeout is spurious.

Passive Fast Open uses a full socket during handshake so it can
use the existing undo logic to detect spurious retransmission
by recording the first SYNACK timeout in key state variable
retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket
has sent some data before the timeout, the spurious timeout
is detected by tcp_try_undo_recovery() in tcp_process_loss()
in tcp_ack().

But if the socket has not send any data yet, tcp_ack() does not
execute the undo code since no data is acknowledged. The fix is to
check such case explicitly after tcp_ack() during the ACK processing
in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state
in case the server closes the socket before handshake completes.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Yuchung Cheng
8c3cfe19fe tcp: lower congestion window on Fast Open SYNACK timeout
TCP sender would use congestion window of 1 packet on the second SYN
and SYNACK timeout except passive TCP Fast Open. This makes passive
TFO too aggressive and unfair during congestion at handshake. This
patch fixes this issue so TCP (fast open or not, passive or active)
always conforms to the RFC6298.

Note that tcp_enter_loss() is called only once during recurring
timeouts.  This is because during handshake, high_seq and snd_una
are the same so tcp_enter_loss() would incorrect set the undo state
variables multiple times.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Yuchung Cheng
336c39a031 tcp: undo init congestion window on false SYNACK timeout
Linux implements RFC6298 and use an initial congestion window
of 1 upon establishing the connection if the SYNACK packet is
retransmitted 2 or more times. In cellular networks SYNACK timeouts
are often spurious if the wireless radio was dormant or idle. Also
some network path is longer than the default SYNACK timeout. In
both cases falsely starting with a minimal cwnd are detrimental
to performance.

This patch avoids doing so when the final ACK's TCP timestamp
indicates the original SYNACK was delivered. It remembers the
original SYNACK timestamp when SYNACK timeout has occurred and
re-uses the function to detect spurious SYN timeout conveniently.

Note that a server may receives multiple SYNs from and immediately
retransmits SYNACKs without any SYNACK timeout. This often happens
on when the client SYNs have timed out due to wireless delay
above. In this case since the server will still use the default
initial congestion (e.g. 10) because tp->undo_marker is reset in
tcp_init_metrics(). This is an intentional design because packets
are not lost but delayed.

This patch only covers regular TCP passive open. Fast Open is
supported in the next patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Yuchung Cheng
9e450c1ecb tcp: better SYNACK sent timestamp
Detecting spurious SYNACK timeout using timestamp option requires
recording the exact SYNACK skb timestamp. Previously the SYNACK
sent timestamp was stamped slightly earlier before the skb
was transmitted. This patch uses the SYNACK skb transmission
timestamp directly.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Yuchung Cheng
7c1f08154c tcp: undo initial congestion window on false SYN timeout
Linux implements RFC6298 and use an initial congestion window of 1
upon establishing the connection if the SYN packet is retransmitted 2
or more times. In cellular networks SYN timeouts are often spurious
if the wireless radio was dormant or idle. Also some network path
is longer than the default SYN timeout. Having a minimal cwnd on
both cases are detrimental to TCP startup performance.

This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect
spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp
records the initial SYN timestamp instead of first retransmission, we
have to implement a different undo code additionally. The detection
also must happen before tcp_ack() as retrans_stamp is reset when
SYN is acknowledged.

Note this patch covers both active regular and fast open.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Yuchung Cheng
bc9f38c832 tcp: avoid unconditional congestion window undo on SYN retransmit
Previously if an active TCP open has SYN timeout, it always undo the
cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue
would reset tp->retrans_stamp when SYN is acked, which fools then
tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is
required to properly support undo for spurious SYN timeout.

Fixing this is tricky -- for active TCP open tp->retrans_stamp
records the time when the handshake starts, not the first
retransmission time as the name may suggest. The simplest fix is
for tcp_packet_delayed to ensure it is valid before comparing with
other timestamp.

One side effect of this change is active TCP Fast Open that incurred
SYN timeout. Upon receiving a SYN-ACK that only acknowledged
the SYN, it would immediately retransmit unacknowledged data in
tcp_ack() because the data is marked lost after SYN timeout. But
the retransmission would have an incorrect ack sequence number since
rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the
retransmission needs to properly handed by tcp_rcv_fastopen_synack()
like before.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:47:54 -04:00
Jakub Kicinski
2dcb003314 net/tls: avoid NULL pointer deref on nskb->sk in fallback
update_chksum() accesses nskb->sk before it has been set
by complete_skb(), move the init up.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:37:56 -04:00
Willem de Bruijn
486efdc8f6 packet: validate msg_namelen in send directly
Packet sockets in datagram mode take a destination address. Verify its
length before passing to dev_hard_header.

Prior to 2.6.14-rc3, the send code ignored sll_halen. This is
established behavior. Directly compare msg_namelen to dev->addr_len.

Change v1->v2: initialize addr in all paths

Fixes: 6b8d95f179 ("packet: validate address length if non-zero")
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:28:35 -04:00
Willem de Bruijn
b2cf86e156 packet: in recvmsg msg_name return at least sizeof sockaddr_ll
Packet send checks that msg_name is at least sizeof sockaddr_ll.
Packet recv must return at least this length, so that its output
can be passed unmodified to packet send.

This ceased to be true since adding support for lladdr longer than
sll_addr. Since, the return value uses true address length.

Always return at least sizeof sockaddr_ll, even if address length
is shorter. Zero the padding bytes.

Change v1->v2: do not overwrite zeroed padding again. use copy_len.

Fixes: 0fb375fb9b ("[AF_PACKET]: Allow for > 8 byte hardware addresses.")
Suggested-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:27:55 -04:00
Moshe Shemesh
b587bdaf5f devlink: Change devlink health locking mechanism
The devlink health reporters create/destroy and user commands currently
use the devlink->lock as a locking mechanism. Different reporters have
different rules in the driver and are being created/destroyed during
different stages of driver load/unload/running. So during execution of a
reporter recover the flow can go through another reporter's destroy and
create. Such flow leads to deadlock trying to lock a mutex already
held.

With the new locking mechanism the different reporters share mutex lock
only to protect access to shared reporters list.
Added refcount per reporter, to protect the reporters from destroy while
being used.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:07:03 -04:00
Xin Long
fbd019737d sctp: avoid running the sctp state machine recursively
Ying triggered a call trace when doing an asconf testing:

  BUG: scheduling while atomic: swapper/12/0/0x10000100
  Call Trace:
   <IRQ>  [<ffffffffa4375904>] dump_stack+0x19/0x1b
   [<ffffffffa436fcaf>] __schedule_bug+0x64/0x72
   [<ffffffffa437b93a>] __schedule+0x9ba/0xa00
   [<ffffffffa3cd5326>] __cond_resched+0x26/0x30
   [<ffffffffa437bc4a>] _cond_resched+0x3a/0x50
   [<ffffffffa3e22be8>] kmem_cache_alloc_node+0x38/0x200
   [<ffffffffa423512d>] __alloc_skb+0x5d/0x2d0
   [<ffffffffc0995320>] sctp_packet_transmit+0x610/0xa20 [sctp]
   [<ffffffffc098510e>] sctp_outq_flush+0x2ce/0xc00 [sctp]
   [<ffffffffc098646c>] sctp_outq_uncork+0x1c/0x20 [sctp]
   [<ffffffffc0977338>] sctp_cmd_interpreter.isra.22+0xc8/0x1460 [sctp]
   [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp]
   [<ffffffffc099443d>] sctp_primitive_ASCONF+0x3d/0x50 [sctp]
   [<ffffffffc0977384>] sctp_cmd_interpreter.isra.22+0x114/0x1460 [sctp]
   [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp]
   [<ffffffffc097b3a4>] sctp_assoc_bh_rcv+0xf4/0x1b0 [sctp]
   [<ffffffffc09840f1>] sctp_inq_push+0x51/0x70 [sctp]
   [<ffffffffc099732b>] sctp_rcv+0xa8b/0xbd0 [sctp]

As it shows, the first sctp_do_sm() running under atomic context (NET_RX
softirq) invoked sctp_primitive_ASCONF() that uses GFP_KERNEL flag later,
and this flag is supposed to be used in non-atomic context only. Besides,
sctp_do_sm() was called recursively, which is not expected.

Vlad tried to fix this recursive call in Commit c078669340 ("sctp: Fix
oops when sending queued ASCONF chunks") by introducing a new command
SCTP_CMD_SEND_NEXT_ASCONF. But it didn't work as this command is still
used in the first sctp_do_sm() call, and sctp_primitive_ASCONF() will
be called in this command again.

To avoid calling sctp_do_sm() recursively, we send the next queued ASCONF
not by sctp_primitive_ASCONF(), but by sctp_sf_do_prm_asconf() in the 1st
sctp_do_sm() directly.

Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 09:18:57 -04:00
Eric Dumazet
0e23387491 ipv6: fix races in ip6_dst_destroy()
We had many syzbot reports that seem to be caused by use-after-free
of struct fib6_info.

ip6_dst_destroy(), fib6_drop_pcpu_from() and rt6_remove_exception()
are writers vs rt->from, and use non consistent synchronization among
themselves.

Switching to xchg() will solve the issues with no possible
lockdep issues.

BUG: KASAN: user-memory-access in atomic_dec_and_test include/asm-generic/atomic-instrumented.h:747 [inline]
BUG: KASAN: user-memory-access in fib6_info_release include/net/ip6_fib.h:294 [inline]
BUG: KASAN: user-memory-access in fib6_info_release include/net/ip6_fib.h:292 [inline]
BUG: KASAN: user-memory-access in fib6_drop_pcpu_from net/ipv6/ip6_fib.c:927 [inline]
BUG: KASAN: user-memory-access in fib6_purge_rt+0x4f6/0x670 net/ipv6/ip6_fib.c:960
Write of size 4 at addr 0000000000ffffb4 by task syz-executor.1/7649

CPU: 0 PID: 7649 Comm: syz-executor.1 Not tainted 5.1.0-rc6+ #183
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 kasan_report.cold+0x5/0x40 mm/kasan/report.c:321
 check_memory_region_inline mm/kasan/generic.c:185 [inline]
 check_memory_region+0x123/0x190 mm/kasan/generic.c:191
 kasan_check_write+0x14/0x20 mm/kasan/common.c:108
 atomic_dec_and_test include/asm-generic/atomic-instrumented.h:747 [inline]
 fib6_info_release include/net/ip6_fib.h:294 [inline]
 fib6_info_release include/net/ip6_fib.h:292 [inline]
 fib6_drop_pcpu_from net/ipv6/ip6_fib.c:927 [inline]
 fib6_purge_rt+0x4f6/0x670 net/ipv6/ip6_fib.c:960
 fib6_del_route net/ipv6/ip6_fib.c:1813 [inline]
 fib6_del+0xac2/0x10a0 net/ipv6/ip6_fib.c:1844
 fib6_clean_node+0x3a8/0x590 net/ipv6/ip6_fib.c:2006
 fib6_walk_continue+0x495/0x900 net/ipv6/ip6_fib.c:1928
 fib6_walk+0x9d/0x100 net/ipv6/ip6_fib.c:1976
 fib6_clean_tree+0xe0/0x120 net/ipv6/ip6_fib.c:2055
 __fib6_clean_all+0x118/0x2a0 net/ipv6/ip6_fib.c:2071
 fib6_clean_all+0x2b/0x40 net/ipv6/ip6_fib.c:2082
 rt6_sync_down_dev+0x134/0x150 net/ipv6/route.c:4057
 rt6_disable_ip+0x27/0x5f0 net/ipv6/route.c:4062
 addrconf_ifdown+0xa2/0x1220 net/ipv6/addrconf.c:3705
 addrconf_notify+0x19a/0x2260 net/ipv6/addrconf.c:3630
 notifier_call_chain+0xc7/0x240 kernel/notifier.c:93
 __raw_notifier_call_chain kernel/notifier.c:394 [inline]
 raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:401
 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1753
 call_netdevice_notifiers_extack net/core/dev.c:1765 [inline]
 call_netdevice_notifiers net/core/dev.c:1779 [inline]
 dev_close_many+0x33f/0x6f0 net/core/dev.c:1522
 rollback_registered_many+0x43b/0xfd0 net/core/dev.c:8177
 rollback_registered+0x109/0x1d0 net/core/dev.c:8242
 unregister_netdevice_queue net/core/dev.c:9289 [inline]
 unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9282
 unregister_netdevice include/linux/netdevice.h:2658 [inline]
 __tun_detach+0xd5b/0x1000 drivers/net/tun.c:727
 tun_detach drivers/net/tun.c:744 [inline]
 tun_chr_close+0xe0/0x180 drivers/net/tun.c:3443
 __fput+0x2e5/0x8d0 fs/file_table.c:278
 ____fput+0x16/0x20 fs/file_table.c:309
 task_work_run+0x14a/0x1c0 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x90a/0x2fa0 kernel/exit.c:876
 do_group_exit+0x135/0x370 kernel/exit.c:980
 __do_sys_exit_group kernel/exit.c:991 [inline]
 __se_sys_exit_group kernel/exit.c:989 [inline]
 __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458da9
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffeafc2a6a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 000000000000001c RCX: 0000000000458da9
RDX: 0000000000412a80 RSI: 0000000000a54ef0 RDI: 0000000000000043
RBP: 00000000004be552 R08: 000000000000000c R09: 000000000004c0d1
R10: 0000000002341940 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00007ffeafc2a7f0 R14: 000000000004c065 R15: 00007ffeafc2a800

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:38:38 -04:00
Andrew Lunn
93e86b3bc8 net: dsa: Remove legacy probing support
Now that all drivers can be probed using more traditional methods,
remove the legacy probe code.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:15:35 -04:00
Vladimir Oltean
314f76d7a6 net: dsa: Add more convenient functions for installing port VLANs
This hides the need to perform a two-phase transaction and construct a
switchdev_obj_port_vlan struct.

Call graph (including a function that will be introduced in a follow-up
patch) looks like this now (same for the *_vlan_del function):

dsa_slave_vlan_rx_add_vid   dsa_port_setup_8021q_tagging
            |                        |
            |                        |
            |          +-------------+
            |          |
            v          v
           dsa_port_vid_add      dsa_slave_port_obj_add
                  |                         |
                  +-------+         +-------+
                          |         |
                          v         v
                       dsa_port_vlan_add

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:29 -04:00
Vladimir Oltean
ec9121e7d2 net: dsa: Skip calling .port_vlan_filtering on no change
Even if VLAN filtering is global, DSA will call this callback once per
each port. Drivers should not have to compare the global state with the
requested change. So let DSA do it.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:29 -04:00
Vladimir Oltean
145746765f net: dsa: Keep the vlan_filtering setting in dsa_switch if it's global
The current behavior is not as obvious as one would assume (which is
that, if the driver set vlan_filtering_is_global = 1, then checking any
dp->vlan_filtering would yield the same result). Only the ports which
are actively enslaved into a bridge would have vlan_filtering set.

This makes it tricky for drivers to check what the global state is.
So fix this and make the struct dsa_switch hold this global setting.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:29 -04:00
Vladimir Oltean
d371b7c92d net: dsa: Unset vlan_filtering when ports leave the bridge
When ports are standalone (after they left the bridge), they should have
no VLAN filtering semantics (they should pass all traffic to the CPU).
Currently this is not true for switchdev drivers, because the bridge
"forgets" to unset that.

Normally one would think that doing this at the bridge layer would be a
better idea, i.e. call br_vlan_filter_toggle() from br_del_if(), similar
to how nbp_vlan_init() is called from br_add_if().

However what complicates that approach, and makes this one preferable,
is the fact that for the bridge core, vlan_filtering is a per-bridge
setting, whereas for switchdev/DSA it is per-port. Also there are
switches where the setting is per the entire device, and unsetting
vlan_filtering one by one, for each leaving port, would not be possible
from the bridge core without a certain level of awareness. So do this in
DSA and let drivers be unaware of it.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:28 -04:00
Vladimir Oltean
8f5d16f638 net: dsa: Be aware of switches where VLAN filtering is a global setting
On some switches, the action of whether to parse VLAN frame headers and use
that information for ingress admission is configurable, but not per
port. Such is the case for the Broadcom BCM53xx and the NXP SJA1105
families, for example. In that case, DSA can prevent the bridge core
from trying to apply different VLAN filtering settings on net devices
that belong to the same switch.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:28 -04:00
Vladimir Oltean
33162e9a05 net: dsa: Store vlan_filtering as a property of dsa_port
This allows drivers to query the VLAN setting imposed by the bridge
driver directly from DSA, instead of keeping their own state based on
the .port_vlan_filtering callback.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:28 -04:00
Vladimir Oltean
85478d73c9 net: dsa: Fix pharse -> phase typo
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:28 -04:00
Eric Dumazet
a622b40035 l2ip: fix possible use-after-free
Before taking a refcount on a rcu protected structure,
we need to make sure the refcount is not zero.

syzbot reported :

refcount_t: increment on 0; use-after-free.
WARNING: CPU: 1 PID: 23533 at lib/refcount.c:156 refcount_inc_checked lib/refcount.c:156 [inline]
WARNING: CPU: 1 PID: 23533 at lib/refcount.c:156 refcount_inc_checked+0x61/0x70 lib/refcount.c:154
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 23533 Comm: syz-executor.2 Not tainted 5.1.0-rc7+ #93
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 panic+0x2cb/0x65c kernel/panic.c:214
 __warn.cold+0x20/0x45 kernel/panic.c:571
 report_bug+0x263/0x2b0 lib/bug.c:186
 fixup_bug arch/x86/kernel/traps.c:179 [inline]
 fixup_bug arch/x86/kernel/traps.c:174 [inline]
 do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
RIP: 0010:refcount_inc_checked lib/refcount.c:156 [inline]
RIP: 0010:refcount_inc_checked+0x61/0x70 lib/refcount.c:154
Code: 1d 98 2b 2a 06 31 ff 89 de e8 db 2c 40 fe 84 db 75 dd e8 92 2b 40 fe 48 c7 c7 20 7a a1 87 c6 05 78 2b 2a 06 01 e8 7d d9 12 fe <0f> 0b eb c1 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 41 57 41
RSP: 0018:ffff888069f0fba8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 000000000000f353 RSI: ffffffff815afcb6 RDI: ffffed100d3e1f67
RBP: ffff888069f0fbb8 R08: ffff88809b1845c0 R09: ffffed1015d23ef1
R10: ffffed1015d23ef0 R11: ffff8880ae91f787 R12: ffff8880a8f26968
R13: 0000000000000004 R14: dffffc0000000000 R15: ffff8880a49a6440
 l2tp_tunnel_inc_refcount net/l2tp/l2tp_core.h:240 [inline]
 l2tp_tunnel_get+0x250/0x580 net/l2tp/l2tp_core.c:173
 pppol2tp_connect+0xc00/0x1c70 net/l2tp/l2tp_ppp.c:702
 __sys_connect+0x266/0x330 net/socket.c:1808
 __do_sys_connect net/socket.c:1819 [inline]
 __se_sys_connect net/socket.c:1816 [inline]
 __x64_sys_connect+0x73/0xb0 net/socket.c:1816

Fixes: 54652eb12c ("l2tp: hold tunnel while looking up sessions in l2tp_netlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 11:35:48 -04:00
YueHaibing
c93ad1337a appletalk: Set error code if register_snap_client failed
If register_snap_client fails in atalk_init,
error code should be set, otherwise it will
triggers NULL pointer dereference while unloading
module.

Fixes: 9804501fa1 ("appletalk: Fix potential NULL pointer dereference in unregister_snap_client")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 11:09:28 -04:00
David Howells
b13023421b rxrpc: Fix net namespace cleanup
In rxrpc_destroy_all_calls(), there are two phases: (1) make sure the
->calls list is empty, emitting error messages if not, and (2) wait for the
RCU cleanup to happen on outstanding calls (ie. ->nr_calls becomes 0).

To avoid taking the call_lock, the function prechecks ->calls and if empty,
it returns to avoid taking the lock - this is wrong, however: it still
needs to go and do the second phase and wait for ->nr_calls to become 0.

Without this, the rxrpc_net struct may get deallocated before we get to the
RCU cleanup for the last calls.  This can lead to:

  Slab corruption (Not tainted): kmalloc-16k start=ffff88802b178000, len=16384
  050: 6b 6b 6b 6b 6b 6b 6b 6b 61 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkakkkkkkk

Note the "61" at offset 0x58.  This corresponds to the ->nr_calls member of
struct rxrpc_net (which is >9k in size, and thus allocated out of the 16k
slab).

Fix this by flipping the condition on the if-statement, putting the locked
section inside the if-body and dropping the return from there.  The
function will then always go on to wait for the RCU cleanup on outstanding
calls.

Fixes: 2baec2c3f8 ("rxrpc: Support network namespacing")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 10:50:50 -04:00
David S. Miller
a658a3f2ec Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2019-04-30

1) A lot of work to remove indirections from the xfrm code.
   From Florian Westphal.

2) Support ESP offload in combination with gso partial.
   From Boris Pismenny.

3) Remove some duplicated code from vti4.
   From Jeremy Sowden.

Please note that there is merge conflict

between commit:

8742dc86d0 ("xfrm4: Fix uninitialized memory read in _decode_session4")

from the ipsec tree and commit:

c53ac41e37 ("xfrm: remove decode_session indirection from afinfo_policy")

from the ipsec-next tree. The merge conflict will appear
when those trees get merged during the merge window.
The conflict can be solved as it is done in linux-next:

https://lkml.org/lkml/2019/4/25/1207

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 09:26:13 -04:00
David S. Miller
b145745fc8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2019-04-30

1) Fix an out-of-bound array accesses in __xfrm_policy_unlink.
   From YueHaibing.

2) Reset the secpath on failure in the ESP GRO handlers
   to avoid dereferencing an invalid pointer on error.
   From Myungho Jung.

3) Add and revert a patch that tried to add rcu annotations
   to netns_xfrm. From Su Yanjun.

4) Wait for rcu callbacks before freeing xfrm6_tunnel_spi_kmem.
   From Su Yanjun.

5) Fix forgotten vti4 ipip tunnel deregistration.
   From Jeremy Sowden:

6) Remove some duplicated log messages in vti4.
   From Jeremy Sowden.

7) Don't use IPSEC_PROTO_ANY when flushing states because
   this will flush only IPsec portocol speciffic states.
   IPPROTO_ROUTING states may remain in the lists when
   doing net exit. Fix this by replacing IPSEC_PROTO_ANY
   with zero. From Cong Wang.

8) Add length check for UDP encapsulation to fix "Oversized IP packet"
   warnings on receive side. From Sabrina Dubroca.

9) Fix xfrm interface lookup when the interface is associated to
   a vrf layer 3 master device. From Martin Willi.

10) Reload header pointers after pskb_may_pull() in _decode_session4(),
    otherwise we may read from uninitialized memory.

11) Update the documentation about xfrm[46]_gc_thresh, it
    is not used anymore after the flowcache removal.
    From Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 09:11:10 -04:00
Colin Ian King
1de6f33421 netfilter: connlabels: fix spelling mistake "trackling" -> "tracking"
There is a spelling mistake in the module description. Fix this.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:57 +02:00
Brett Mastbergen
3087c3f7c2 netfilter: nft_ct: Add ct id support
The 'id' key returns the unique id of the conntrack entry as returned
by nf_ct_get_id().

Signed-off-by: Brett Mastbergen <bmastbergen@untangle.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:57 +02:00
Flavio Leitner
fec9c271b8 openvswitch: load and reference the NAT helper.
This improves the original commit 17c357efe5 ("openvswitch: load
NAT helper") where it unconditionally tries to load the module for
every flow using NAT, so not efficient when loading multiple flows.
It also doesn't hold any references to the NAT module while the
flow is active.

This change fixes those problems. It will try to load the module
only if it's not present. It grabs a reference to the NAT module
and holds it while the flow is active. Finally, an error message
shows up if either actions above fails.

Fixes: 17c357efe5 ("openvswitch: load NAT helper")
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:56 +02:00
Flavio Leitner
53b11308a1 netfilter: nf_nat: register NAT helpers.
Register amanda, ftp, irc, sip and tftp NAT helpers.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:55 +02:00
Flavio Leitner
08010a2160 netfilter: add API to manage NAT helpers.
The API allows a conntrack helper to indicate its corresponding
NAT helper which then can be loaded and reference counted.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:55 +02:00
Flavio Leitner
e1f172e162 netfilter: use macros to create module aliases.
Each NAT helper creates a module alias which follows a pattern.
Use macros for consistency.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:54 +02:00
Tonghao Zhang
8f14c99c7e netfilter: conntrack: limit sysctl setting for boolean options
We use the zero and one to limit the boolean options setting.
After this patch we only set 0 or 1 to boolean options for nf
conntrack sysctl.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:18:56 +02:00
Taehee Yoo
33cc3c0cfa netfilter: nf_flow_table: check ttl value in flow offload data path
nf_flow_offload_ip_hook() and nf_flow_offload_ipv6_hook() do not check
ttl value. So, ttl value overflow may occur.

Fixes: 97add9f0d6 ("netfilter: flow table support for IPv4")
Fixes: 0995210753 ("netfilter: flow table support for IPv6")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:56:19 +02:00
Taehee Yoo
26a302afbe netfilter: nf_flow_table: fix netdev refcnt leak
flow_offload_alloc() calls nf_route() to get a dst_entry. Internally,
nf_route() calls ip_route_output_key() that allocates a dst_entry and
holds it. So, a dst_entry should be released by dst_release() if
nf_route() is successful.

Otherwise, netns exit routine cannot be finished and the following
message is printed:

[  257.490952] unregister_netdevice: waiting for lo to become free. Usage count = 1

Fixes: ac2a66665e ("netfilter: add generic flow table infrastructure")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:55:00 +02:00
Pablo Neira Ayuso
270a8a297f netfilter: nft_flow_offload: add entry to flowtable after confirmation
This is fixing flow offload for UDP traffic where packets only follow
one single direction.

The flow_offload_fixup_tcp() mechanism works fine in case that the
offloaded entry remains in SYN_RECV state, given sequence tracking is
reset and that conntrack handles syn+ack packets as a retransmission, ie.

	sES + synack => sIG

for reply traffic.

Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:48:23 +02:00
Florian Westphal
66293c46c9 netfilter: nf_tables: delay chain policy update until transaction is complete
When we process a long ruleset of the form

chain input {
   type filter hook input priority filter; policy drop;
   ...
}

Then the base chain gets registered early on, we then continue to
process/validate the next messages coming in the same transaction.

Problem is that if the base chain policy is 'drop', it will take effect
immediately, which causes all traffic to get blocked until the
transaction completes or is aborted.

Fix this by deferring the policy until the transaction has been
processed and all of the rules have been flagged as active.

Reported-by: Jann Haber <jann.haber@selfnet.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:47:32 +02:00
Paul Gortmaker
c5f1931f66 netfilter: nf_tables: fix implicit include of module.h
This file clearly uses modular infrastructure but does not call
out the inclusion of <linux/module.h> explicitly.  We add that
include explicitly here, so we can tidy up some header usage
elsewhere w/o causing build breakage.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:35:26 +02:00
Paul Gortmaker
f1f86d09ca netfilter: nf_tables: relocate header content to consumer
The nf_tables.h header is used in a lot of files, but it turns out
that there is only one actual user of nft_expr_clone().

Hence we relocate that function to be with the one consumer of it
and avoid having to process it with CPP for all the other files.

This will also enable a reduction in the other headers that the
nf_tables.h itself has to include just to be stand-alone, hence
a pending further significant reduction in the CPP content that
needs to get processed for each netfilter file.

Note that the explicit "inline" has been dropped as part of this
relocation.  In similar changes to this, I believe Dave has asked
this be done, so we free up gcc to make the choice of whether to
inline or not.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:35:17 +02:00
Eric Dumazet
6c0afef5fb ipv6/flowlabel: wait rcu grace period before put_pid()
syzbot was able to catch a use-after-free read in pid_nr_ns() [1]

ip6fl_seq_show() seems to use RCU protection, dereferencing fl->owner.pid
but fl_free() releases fl->owner.pid before rcu grace period is started.

[1]

BUG: KASAN: use-after-free in pid_nr_ns+0x128/0x140 kernel/pid.c:407
Read of size 4 at addr ffff888094012a04 by task syz-executor.0/18087

CPU: 0 PID: 18087 Comm: syz-executor.0 Not tainted 5.1.0-rc6+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
 pid_nr_ns+0x128/0x140 kernel/pid.c:407
 ip6fl_seq_show+0x2f8/0x4f0 net/ipv6/ip6_flowlabel.c:794
 seq_read+0xad3/0x1130 fs/seq_file.c:268
 proc_reg_read+0x1fe/0x2c0 fs/proc/inode.c:227
 do_loop_readv_writev fs/read_write.c:701 [inline]
 do_loop_readv_writev fs/read_write.c:688 [inline]
 do_iter_read+0x4a9/0x660 fs/read_write.c:922
 vfs_readv+0xf0/0x160 fs/read_write.c:984
 kernel_readv fs/splice.c:358 [inline]
 default_file_splice_read+0x475/0x890 fs/splice.c:413
 do_splice_to+0x12a/0x190 fs/splice.c:876
 splice_direct_to_actor+0x2d2/0x970 fs/splice.c:953
 do_splice_direct+0x1da/0x2a0 fs/splice.c:1062
 do_sendfile+0x597/0xd00 fs/read_write.c:1443
 __do_sys_sendfile64 fs/read_write.c:1498 [inline]
 __se_sys_sendfile64 fs/read_write.c:1490 [inline]
 __x64_sys_sendfile64+0x15a/0x220 fs/read_write.c:1490
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458da9
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f300d24bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 0000000000458da9
RDX: 00000000200000c0 RSI: 0000000000000008 RDI: 0000000000000007
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000005a R11: 0000000000000246 R12: 00007f300d24c6d4
R13: 00000000004c5fa3 R14: 00000000004da748 R15: 00000000ffffffff

Allocated by task 17543:
 save_stack+0x45/0xd0 mm/kasan/common.c:75
 set_track mm/kasan/common.c:87 [inline]
 __kasan_kmalloc mm/kasan/common.c:497 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:470
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:505
 slab_post_alloc_hook mm/slab.h:437 [inline]
 slab_alloc mm/slab.c:3393 [inline]
 kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3555
 alloc_pid+0x55/0x8f0 kernel/pid.c:168
 copy_process.part.0+0x3b08/0x7980 kernel/fork.c:1932
 copy_process kernel/fork.c:1709 [inline]
 _do_fork+0x257/0xfd0 kernel/fork.c:2226
 __do_sys_clone kernel/fork.c:2333 [inline]
 __se_sys_clone kernel/fork.c:2327 [inline]
 __x64_sys_clone+0xbf/0x150 kernel/fork.c:2327
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 7789:
 save_stack+0x45/0xd0 mm/kasan/common.c:75
 set_track mm/kasan/common.c:87 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:459
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:467
 __cache_free mm/slab.c:3499 [inline]
 kmem_cache_free+0x86/0x260 mm/slab.c:3765
 put_pid.part.0+0x111/0x150 kernel/pid.c:111
 put_pid+0x20/0x30 kernel/pid.c:105
 fl_free+0xbe/0xe0 net/ipv6/ip6_flowlabel.c:102
 ip6_fl_gc+0x295/0x3e0 net/ipv6/ip6_flowlabel.c:152
 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
 expire_timers kernel/time/timer.c:1362 [inline]
 __run_timers kernel/time/timer.c:1681 [inline]
 __run_timers kernel/time/timer.c:1649 [inline]
 run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
 __do_softirq+0x266/0x95a kernel/softirq.c:293

The buggy address belongs to the object at ffff888094012a00
 which belongs to the cache pid_2 of size 88
The buggy address is located 4 bytes inside of
 88-byte region [ffff888094012a00, ffff888094012a58)
The buggy address belongs to the page:
page:ffffea0002500480 count:1 mapcount:0 mapping:ffff88809a483080 index:0xffff888094012980
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00018a3508 ffffea0002524a88 ffff88809a483080
raw: ffff888094012980 ffff888094012000 000000010000001b 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888094012900: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
 ffff888094012980: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
>ffff888094012a00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
                   ^
 ffff888094012a80: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
 ffff888094012b00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc

Fixes: 4f82f45730 ("net ip6 flowlabel: Make owner a union of struct pid * and kuid_t")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29 23:30:13 -04:00
Stephen Suryaputra
1d3fd8a10b vrf: Use orig netdev to count Ip6InNoRoutes and a fresh route lookup when sending dest unreach
When there is no route to an IPv6 dest addr, skb_dst(skb) points
to loopback dev in the case of that the IP6CB(skb)->iif is
enslaved to a vrf. This causes Ip6InNoRoutes to be incremented on the
loopback dev. This also causes the lookup to fail on icmpv6_send() and
the dest unreachable to not sent and Ip6OutNoRoutes gets incremented on
the loopback dev.

To reproduce:
* Gateway configuration:
        ip link add dev vrf_258 type vrf table 258
        ip link set dev enp0s9 master vrf_258
        ip addr add 66:1/64 dev enp0s9
        ip -6 route add unreachable default metric 8192 table 258
        sysctl -w net.ipv6.conf.all.forwarding=1
        sysctl -w net.ipv6.conf.enp0s9.forwarding=1
* Sender configuration:
        ip addr add 66::2/64 dev enp0s9
        ip -6 route add default via 66::1
and ping 67::1 for example from the sender.

Fix this by counting on the original netdev and reset the skb dst to
force a fresh lookup.

v2: Fix typo of destination address in the repro steps.
v3: Simplify the loopback check (per David Ahern) and use reverse
    Christmas tree format (per David Miller).

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29 23:28:52 -04:00
Eric Dumazet
ca2fe2956a tcp: add sanity tests in tcp_add_backlog()
Richard and Bruno both reported that my commit added a bug,
and Bruno was able to determine the problem came when a segment
wih a FIN packet was coalesced to a prior one in tcp backlog queue.

It turns out the header prediction in tcp_rcv_established()
looks back to TCP headers in the packet, not in the metadata
(aka TCP_SKB_CB(skb)->tcp_flags)

The fast path in tcp_rcv_established() is not supposed to
handle a FIN flag (it does not call tcp_fin())

Therefore we need to make sure to propagate the FIN flag,
so that the coalesced packet does not go through the fast path,
the same than a GRO packet carrying a FIN flag.

While we are at it, make sure we do not coalesce packets with
RST or SYN, or if they do not have ACK set.

Many thanks to Richard and Bruno for pinpointing the bad commit,
and to Richard for providing a first version of the fix.

Fixes: 4f693b55c3 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Reported-by: Bruno Prémont <bonbons@sysophe.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29 23:20:37 -04:00
Willem de Bruijn
95c169251b ipv6: invert flowlabel sharing check in process and user mode
A request for a flowlabel fails in process or user exclusive mode must
fail if the caller pid or uid does not match. Invert the test.

Previously, the test was unsafe wrt PID recycling, but indeed tested
for inequality: fl1->owner != fl->owner

Fixes: 4f82f45730 ("net ip6 flowlabel: Make owner a union of struct pid* and kuid_t")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29 18:22:56 -04:00
David S. Miller
2ae7a39770 * fix use-after-free in mac80211 TXQs
* fix RX STBC byte order
  * fix debugfs rename crashing due to ERR_PTR()
  * fix missing regulatory notification
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlzCyg8ACgkQB8qZga/f
 l8TTzQ//UdRQL5Mt6GfTBrZ/V8gp0+urykENefErjQi7GzUpZC5+bHpm2IGKr6y3
 rf1q3Tr+vAr4WdHxQxjVP0AJIMd5+UAEGgevgMj0yY7Nu9+ktYryhnXrKPBbZRQ+
 mrnZ1u/t6RjA4rCHtEpaW8Xb8FT01VpdtZXbDOaNmGKLLwCbGJb9I9rCfEWczUQ+
 b/a6IFhSE7kxeoceiHXjA7X3eNex/DlV5oE8RCK5gcUtWj8gMzVWNjdiC3gt0VLo
 zLdteEv/C7e+XycWIeHHIWEdScJJFeFNY4e7mvwNIRnil+HrShyrkJUbuHCc/UC/
 USVc/PJX05UTjuAFn1wOdJerUzzQAOYPurZSxuFVl/v2YXjybMYLcCfk1Hplp1v6
 a6nkc/CYGCytHqKteyG9XzkgM3c1rmr8bUDnvJjTxhPh5htlln876VPYnjpEXgGD
 lJgyEgY7M65c3C33dAvilKS3MteZKABXYSCMA+K7mNY5PVcWrC69FYBh9oG4NQQk
 53h+4VxmOBBW4UkZpptbxp4HIHIJCoKR09K9fWuvqGNj1uyexgMY1k6C1g3q2SkA
 PXr+ZGgx5P/DFJsz7U4aqG3mi8vagAUwKEZ6Tt7/BuozrOlwTR/bKoaIZq+Q1Lcu
 SUlyEj18qBZVCKyD2KEveLIGXLjCbHOw0HHB6A4ICEqNcCW6jGE=
 =KjkZ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
 * fix use-after-free in mac80211 TXQs
 * fix RX STBC byte order
 * fix debugfs rename crashing due to ERR_PTR()
 * fix missing regulatory notification
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-29 11:00:58 -04:00
Andrew Lunn
0b9f9dfbfa dsa: Allow tag drivers to be built as modules
Make the CONFIG symbols tristate and add help text.

The broadcom and Microchip KSZ tag drivers support two different
tagging protocols in one driver. Add a configuration option for the
drivers, and then options to select the protocol.

Create a submenu for the tagging drivers.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>

v2:
tab/space cleanup
Help text wording
NET_DSA_TAG_BRCM_COMMON and NET_DSA_TAG_KZS_COMMON hidden

v3:
More tabification
Punctuation

v4:
trailler->trailer

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
3aa475e197 dsa: tag_brcm: Avoid unused symbols
It is possible that the driver is compiled with both
CONFIG_NET_DSA_TAG_BRCM and CONFIG_NET_DSA_TAG_BRCM_PREPEND disabled.
This results in warnings about unused symbols. Add some conditional
compilation to avoid this.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2
Reorder patch to before tag drivers can be modules

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
f81a43e8da dsa: Cleanup unneeded table and make tag structures static
Now that tag drivers dynamically register, we don't need the static
table. Remove it. This also means the tag driver structures can be
made static.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
3675617531 dsa: Make use of the list of tag drivers
Implement the _get and _put functions to make use of the list of tag
drivers. Also, trigger the loading of the module, based on the alias
information. The _get function takes a reference on the tag driver, so
it cannot be unloaded, and the _put function releases the reference.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2:
Make tag_driver_register void

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
4dad81ee14 dsa: Add stub tag driver put method
When a DSA switch driver is unloaded, the lock on the tag driver
should be released so the module can be unloaded. Add the needed calls,
but leave the actual release code as a stub.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
c39e2a1d71 dsa: Rename dsa_resolve_tag_protocol() to _get ready for locking
dsa_resolve_tag_protocol() is used to find the tagging driver needed
by a switch driver. When the tagging drivers become modules, it will
be necassary to take a reference on the module to prevent it being
unloaded. So rename this function to _get() to indicate it has some
locking properties.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
409065b069 dsa: Register the none tagger ops
The none tagger is special in that it does not live in a tag_*.c file,
but is within the core. Register/unregister when DSA is
loaded/unloaded.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
bdc6fe5bb1 dsa: Keep link list of tag drivers
Let the tag drivers register themselves with the DSA core, keeping
them in a linked list.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>

v2

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
d3b8c04988 dsa: Add boilerplate helper to register DSA tag driver modules
A DSA tag driver module will need to register the tag protocols it
implements with the DSA core. Add macros containing this boiler plate.

The registration/unregistration code is currently just a stub. A Later
patch will add the real implementation.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2
Fix indent of #endif
Rewrite to move list pointer into a new structure
v3
Move kdoc next to macro
Fix THIS_MODULE indentation

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
056eed2fb0 dsa: Add TAG protocol to tag ops
In order that we can match the tagging protocol a switch driver
request to the tagger, we need to know what protocol the tagger
supports. Add this information to the ops structure.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2
More tag protocol to end of structure to keep hot members at the beginning.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
f18bba50d2 dsa: Add MODULE_LICENSE to tag drivers
All the tag drivers are some variant of GPL. Add a MODULE_LICENSE()
indicating this, so the drivers can later be compiled as modules.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
0b42f03363 dsa: Add MODULE_ALIAS to taggers in preparation to become modules
When the tag drivers become modules, we will need to dynamically load
them based on what the switch drivers need. Add aliases to map between
the TAG protocol and the driver.

In order to do this, we need the tag protocol number as something
which the C pre-processor can stringinfy. Only the compiler knows the
value of an enum, CPP cannot use them. So add #defines.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
875138f81d dsa: Move tagger name into its ops structure
Rather than keep a list to map a tagger ops to a name, place the name
into the ops structure. This removes the hard coded list, a step
towards making the taggers more dynamic.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2:
Move name to end of structure, keeping the hot entries at the beginning.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:00 -04:00
Andrew Lunn
dfedd3b624 dsa: Add SPDX header to tag drivers.
Add an SPDX header, and remove the license boilerplate text.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:00 -04:00
David S. Miller
5f0d736e7f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-04-28

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Introduce BPF socket local storage map so that BPF programs can store
   private data they associate with a socket (instead of e.g. separate hash
   table), from Martin.

2) Add support for bpftool to dump BTF types. This is done through a new
   `bpftool btf dump` sub-command, from Andrii.

3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which
   was currently not supported since skb was used to lookup netns, from Stanislav.

4) Add an opt-in interface for tracepoints to expose a writable context
   for attached BPF programs, used here for NBD sockets, from Matt.

5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel.

6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to
   support tunnels such as sit. Add selftests as well, from Willem.

7) Various smaller misc fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 08:42:41 -04:00
Paolo Abeni
21f1b8a663 udp: fix GRO reception in case of length mismatch
Currently, the UDP GRO code path does bad things on some edge
conditions - Aggregation can happen even on packet with different
lengths.

Fix the above by rewriting the 'complete' condition for GRO
packets. While at it, note explicitly that we allow merging the
first packet per burst below gso_size.

Reported-by: Sean Tong <seantong114@gmail.com>
Fixes: e20cf8d3f1 ("udp: implement GRO for plain UDP sockets.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 22:07:24 -04:00
Jakub Kicinski
eb3d38d5ad net/tls: fix copy to fragments in reencrypt
Fragments may contain data from other records so we have to account
for that when we calculate the destination and max length of copy we
can perform.  Note that 'offset' is the offset within the message,
so it can't be passed as offset within the frag..

Here skb_store_bits() would have realised the call is wrong and
simply not copy data.

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 20:17:19 -04:00
Jakub Kicinski
97e1caa517 net/tls: don't copy negative amounts of data in reencrypt
There is no guarantee the record starts before the skb frags.
If we don't check for this condition copy amount will get
negative, leading to reads and writes to random memory locations.
Familiar hilarity ensues.

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 20:17:19 -04:00
Johannes Berg
ef6243acb4 genetlink: optionally validate strictly/dumps
Add options to strictly validate messages and dump messages,
sometimes perhaps validating dump messages non-strictly may
be required, so add an option for that as well.

Since none of this can really be applied to existing commands,
set the options everwhere using the following spatch:

    @@
    identifier ops;
    expression X;
    @@
    struct genl_ops ops[] = {
    ...,
     {
            .cmd = X,
    +       .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
            ...
     },
    ...
    };

For new commands one should just not copy the .validate 'opt-out'
flags and thus get strict validation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:22 -04:00
Johannes Berg
8cb081746c netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:

 1) liberal (default)
     - undefined (type >= max) & NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted
     - garbage at end of message accepted
 2) strict (opt-in)
     - NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted

Split out parsing strictness into four different options:
 * TRAILING     - check that there's no trailing data after parsing
                  attributes (in message or nested)
 * MAXTYPE      - reject attrs > max known type
 * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
 * STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
 * nla_parse           -> nla_parse_deprecated
 * nla_parse_strict    -> nla_parse_deprecated_strict
 * nlmsg_parse         -> nlmsg_parse_deprecated
 * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
 * nla_parse_nested    -> nla_parse_nested_deprecated
 * nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:21 -04:00
Michal Kubecek
f78c6032c4 net: fix two coding style issues
This is a simple cleanup addressing two coding style issues found by
checkpatch.pl in an earlier patch. It's submitted as a separate patch to
keep the original patch as it was generated by spatch.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Michal Kubecek
12ad5f65f0 ipset: drop ipset_nest_start() and ipset_nest_end()
After the previous commit, both ipset_nest_start() and ipset_nest_end() are
just aliases for nla_nest_start() and nla_nest_end() so that there is no
need to keep them.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Michal Kubecek
ae0be8de9a netlink: make nla_nest_start() add NLA_F_NESTED flag
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Jakub Kicinski
63a1c95f3f net/tls: byte swap device req TCP seq no upon setting
To avoid a sparse warning byteswap the be32 sequence number
before it's stored in the atomic value.  While at it drop
unnecessary brackets and use kernel's u64 type.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 16:52:21 -04:00
Jakub Kicinski
9e9957973c net/tls: remove old exports of sk_destruct functions
tls_device_sk_destruct being set on a socket used to indicate
that socket is a kTLS device one.  That is no longer true -
now we use sk_validate_xmit_skb pointer for that purpose.
Remove the export.  tls_device_attach() needs to be moved.

While at it, remove the dead declaration of tls_sk_destruct().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 16:52:21 -04:00
Jakub Kicinski
e49d268db9 net/tls: don't log errors every time offload can't proceed
Currently when CONFIG_TLS_DEVICE is set each time kTLS
connection is opened and the offload is not successful
(either because the underlying device doesn't support
it or e.g. it's tables are full) a rate limited error
will be printed to the logs.

There is nothing wrong with failing TLS offload.  SW
path will process the packets just fine, drop the
noisy messages.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 16:52:21 -04:00
Martin KaFai Lau
6ac99e8f23 bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
  into different bpf running context.

this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).

When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined.  Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
  some enhancement in the verifier and it is a separate topic. ]

Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).

The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production.  Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.

This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use.  The space will be allocated when the first bpf
prog has created data for this particular sk.

The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update).  bpf_spin_lock should be used if the inline update needs
to be protected.

BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs.  The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.

The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
particular "type" of "sk-local-storage" data can then be stored in any sk.

The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
   map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
   when the map is freed.

sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage").  When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list.  The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.

To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset.  At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have.  Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array.  Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.

The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map.  On the program side, having a few bpf_progs
running in the networking hotpath is already a lot.  The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty.  16 has enough runway to grow.

All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.

bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete().  The verifier can then enforce the
ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk.  It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.

Misc notes:
----------
1. map_get_next_key is not supported.  From the userspace syscall
   perspective,  the map has the socket fd as the key while the map
   can be shared by pinned-file or map-id.

   Since btf is enforced, the existing "ss" could be enhanced to pretty
   print the local-storage.

   Supporting a kernel defined btf with 4 tuples as the return key could
   be explored later also.

2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
   e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
   Please refer to the source code comments for the details in
   synchronization cases and considerations.

3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.

Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:

One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)

Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.

Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt.  netperf is used to drive
data with 4096 connected UDP sockets.

BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
    loaded_at 2019-04-15T13:46:39-0700  uid 0
    xlated 344B  jited 258B  memlock 4096B  map_ids 16
    btf_id 5

BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
    loaded_at 2019-04-15T13:47:54-0700  uid 0
    xlated 168B  jited 156B  memlock 4096B  map_ids 17
    btf_id 6

Here is a high-level picture on how are the objects organized:

       sk
    ┌──────┐
    │      │
    │      │
    │      │
    │*sk_bpf_storage─────▶ bpf_sk_storage
    └──────┘                 ┌───────┐
                 ┌───────────┤ list  │
                 │           │       │
                 │           │       │
                 │           │       │
                 │           └───────┘
                 │
                 │     elem
                 │  ┌────────┐
                 ├─▶│ snode  │
                 │  ├────────┤
                 │  │  data  │          bpf_map
                 │  ├────────┤        ┌─────────┐
                 │  │map_node│◀─┬─────┤  list   │
                 │  └────────┘  │     │         │
                 │              │     │         │
                 │     elem     │     │         │
                 │  ┌────────┐  │     └─────────┘
                 └─▶│ snode  │  │
                    ├────────┤  │
   bpf_map          │  data  │  │
 ┌─────────┐        ├────────┤  │
 │  list   ├───────▶│map_node│  │
 │         │        └────────┘  │
 │         │                    │
 │         │           elem     │
 └─────────┘        ┌────────┐  │
                 ┌─▶│ snode  │  │
                 │  ├────────┤  │
                 │  │  data  │  │
                 │  ├────────┤  │
                 │  │map_node│◀─┘
                 │  └────────┘
                 │
                 │
                 │          ┌───────┐
     sk          └──────────│ list  │
  ┌──────┐                  │       │
  │      │                  │       │
  │      │                  │       │
  │      │                  └───────┘
  │*sk_bpf_storage───────▶bpf_sk_storage
  └──────┘

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27 09:07:04 -07:00
Matt Mullins
e950e84336 selftests: bpf: test writable buffers in raw tps
This tests that:
  * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it
    uses either:
    * a variable offset to the tracepoint buffer, or
    * an offset beyond the size of the tracepoint buffer
  * a tracer can modify the buffer provided when attached to a writable
    tracepoint in bpf_prog_test_run

Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26 19:04:19 -07:00
Trond Myklebust
ac83228a71 SUNRPC: Use namespace of listening daemon in the client AUTH_GSS upcall
When the client needs to talk to rpc.gssd, we should ensure that the
uid argument is encoded to match the user namespace of the daemon.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 16:40:38 -04:00
Trond Myklebust
283ebe3ec4 SUNRPC: Use the client user namespace when encoding creds
When encoding AUTH_UNIX creds and AUTH_GSS upcalls, use the user namespace
of the process that created the rpc client.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 16:24:32 -04:00
David S. Miller
30e5a9a5ba Various updates, notably:
* extended key ID support (from 802.11-2016)
  * per-STA TX power control support
  * mac80211 TX performance improvements
  * HE (802.11ax) updates
  * mesh link probing support
  * enhancements of multi-BSSID support (also related to HE)
  * OWE userspace processing support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlzC5YYACgkQB8qZga/f
 l8QDWg/+N7wm+l7bTMx4hjJzZZ60n9fBvyGJx0gsnPVH8wdOiPoh/epuI04I8I4m
 pGNbGvPB9Z4z2tD56XsIQnXf88ab3R27bRupSSW1vtzVSbDhg8wQ7jg0nABrdyDS
 PgoTmDMfVERLewXdntqRANzVYGfoWSOzo1u6A0Xhys8FqxxX/eD+Vdo4dKzmeN47
 +LDfuCpInVPn0TOpFp5IJ4+B4a0dhkz2/Q1BOE7NquXVvk4X77VJohV/BgQJ04Io
 yt7mn5rzYM6j4o1XLACxUEHkXvht6h34abG0yHRnuoAEp/sdPz2jAXT4OxYqs6x0
 XdLdr8gZgkMnnYaOQef74uJ2Ku+4A1ootjXSPazA7BWX0X5GqHnET/INk2S6cQPj
 C95LYfKC0ICD0qfioBmmHx8icDGoovcaswCju2ozfqWaD4Lwr3BcesnNDFtkHD9o
 aYaTTGGSxFyr2bZWTDpv4D4H5g3V4srRJsXs+SokL54nvlwd/smUJ4PVTLomP9y2
 XswRtLdoiUsCrJy967CXfhsxnE5SRhmBQE38Jq8/pzetlRk2spvJJC5MGYF0O/nT
 0UHbrjBCFUT2s8jv+gWWabOBUovsNJlgaxFwrZ/eNVIk0DK0ERoMV3V4MktU8uza
 Y339T14kxw4wlY2z5pOmEgkxmKZbPb55dBba04JEZzz9zDTawTk=
 =JQOx
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2019-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Various updates, notably:
 * extended key ID support (from 802.11-2016)
 * per-STA TX power control support
 * mac80211 TX performance improvements
 * HE (802.11ax) updates
 * mesh link probing support
 * enhancements of multi-BSSID support (also related to HE)
 * OWE userspace processing support
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 16:05:52 -04:00
Trond Myklebust
79caa5fad4 SUNRPC: Cache cred of process creating the rpc_client
When converting kuids to AUTH_UNIX creds, etc we will want to use the
same user namespace as the process that created the rpc client.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 16:00:48 -04:00
Eric Dumazet
0e58156d70 tipc: remove rcu_read_unlock() left in tipc_udp_recv()
I forgot to remove one rcu_read_unlock() before a return statement.

Joy of mixing goto and return styles in a function :)

Fixes: 4109a2c3b9 ("tipc: tipc_udp_recv() cleanup vs rcu verbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 12:01:11 -04:00
Marcel Holtmann
4e43df38a2 genetlink: use idr_alloc_cyclic for family->id assignment
When allocating the next family->id it makes more sense to use
idr_alloc_cyclic to avoid re-using a previously used family->id as much
as possible.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 11:59:58 -04:00
David Ahern
e55449e71a ipv6: Initialize fib6_result in bpf_ipv6_fib_lookup
fib6_result is not initialized in bpf_ipv6_fib_lookup and potentially
passses garbage to the fib lookup which triggers a KASAN warning:

[  262.055450] ==================================================================
[  262.057640] BUG: KASAN: user-memory-access in fib6_rule_suppress+0x4b/0xce
[  262.059488] Read of size 8 at addr 00000a20000000b0 by task swapper/1/0
[  262.061238]
[  262.061673] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.1.0-rc5+ #56
[  262.063493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
[  262.065593] Call Trace:
[  262.066277]  <IRQ>
[  262.066848]  dump_stack+0x7e/0xbb
[  262.067764]  kasan_report+0x18b/0x1b5
[  262.069921]  __asan_load8+0x7f/0x81
[  262.070879]  fib6_rule_suppress+0x4b/0xce
[  262.071980]  fib_rules_lookup+0x275/0x2cd
[  262.073090]  fib6_lookup+0x119/0x218
[  262.076457]  bpf_ipv6_fib_lookup+0x39d/0x664
...

Initialize fib6_result to 0.

Fixes: b1d4099150 ("ipv6: Rename fib6_multipath_select and pass fib6_result")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 11:32:00 -04:00
Gustavo A. R. Silva
60747828ea net: socket: Fix missing break in switch statement
Add missing break statement in order to prevent the code from falling
through to cases SIOCGSTAMP_NEW and SIOCGSTAMPNS_NEW.

This bug was found thanks to the ongoing efforts to enable
-Wimplicit-fallthrough.

Fixes: 0768e17073 ("net: socket: implement 64-bit timestamps")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 11:28:47 -04:00
Eric Dumazet
c1c4772178 l2tp: use rcu_dereference_sk_user_data() in l2tp_udp_encap_recv()
Canonical way to fetch sk_user_data from an encap_rcv() handler called
from UDP stack in rcu protected section is to use rcu_dereference_sk_user_data(),
otherwise compiler might read it multiple times.

Fixes: d00fa9adc5 ("il2tp: fix races with tunnel socket close")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 11:18:23 -04:00
Rajkumar Manoharan
8828f81ad4 mac80211: probe unexercised mesh links
The requirement for mesh link metric refreshing, is that from one
mesh point we be able to send some data frames to other mesh points
which are not currently selected as a primary traffic path, but which
are only 1 hop away. The absence of the primary path to the chosen node
makes it necessary to apply some form of marking on a chosen packet
stream so that the packets can be properly steered to the selected node
for testing, and not by the regular mesh path lookup.

Tested-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Rajkumar Manoharan
060167729a mac80211: add option for setting control flags
Allows setting of control flags of skb cb - if needed -
when calling ieee80211_subif_start_xmit().

Tested-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Rajkumar Manoharan
5ab92e7fe4 cfg80211: add support to probe unexercised mesh link
Adding support to allow mesh HWMP to measure link metrics on unexercised
direct mesh path by sending some data frames to other mesh points which
are not currently selected as a primary traffic path but only 1 hop away.
The absence of the primary path to the chosen node makes it necessary to
apply some form of marking on a chosen packet stream so that the packets
can be properly steered to the selected node for testing, and not by the
regular mesh path lookup.

Tested-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Alexander Wetzel
387bc00225 mac80211: Set CAN_REPLACE_PTK0 for SW crypto only drivers
Mac80211 SW crypto handles replacing PTK keys correctly.

Don't trigger needless warnings or workarounds when the driver can only
use the known good SW crypto provided by mac80211.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Luca Coelho
a680fe468d nl80211: do a struct assignment to radar_chandef instead of memcpy()
We are copying one entire structure to another of the same type in
nl80211_notify_radar_detection, so it's simpler and safer to do a
struct assignment instead of memcpy().

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Alexander Wetzel
622fce8128 mac80211: Fix Extended Key ID auto activation
Only enable Extended Key ID support for drivers which are not supporting
crypto offload and also do not support A-MPDU.

While any driver using SW crypto from mac80211 is generally able to also
support Extended Key ID these drivers are likely to mix keyIDs in
AMPDUs when rekeying.

According to IEEE 802.11-2016 "9.7.3 A-MPDU contents" this is not
allowed.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[reword comment a bit, move ! into logic expression]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Dan Carpenter
5809a5d54b cfg80211: don't pass pointer to pointer unnecessarily
The cfg80211_merge_profile() and ieee802_11_find_bssid_profile() are
a bit cleaner if we just pass the merged_ie pointer instead of a pointer
to the pointer.

This isn't a functional change, it's just a clean up.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Ashok Raj Nagarajan
ba905bf432 mac80211: store tx power value from user to station
This patch introduce a new driver callback drv_sta_set_txpwr. This API will
copy the transmit power value passed from user space and call the driver
callback to set the tx power for the station.

Co-developed-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Ashok Raj Nagarajan <arnagara@codeaurora.org>
Signed-off-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Ashok Raj Nagarajan
e96d1cd263 cfg80211: Add support to set tx power for a station associated
This patch adds support to set transmit power setting type and transmit
power level attributes to NL80211_CMD_SET_STATION in order to facilitate
adjusting the transmit power level of a station associated to the AP.

The added attributes allow selection of automatic and limited transmit
power level, with the level defined in dBm format.

Co-developed-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Ashok Raj Nagarajan <arnagara@codeaurora.org>
Signed-off-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Johannes Berg
dbd50a851c mac80211: only allocate one queue when using iTXQs
There's no need to allocate than one queue in the iTXQs case
now that we no longer use ndo_select_queue to assign the AC.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Gustavo A. R. Silva
391d132cbe nl80211: Use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = kzalloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL)

Notice that, in this case, variable size_of_regd is not necessary,
hence it is removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Gustavo A. R. Silva
9f8c7136e8 cfg80211: Use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = kzalloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL)

Notice that, in this case, variable size_of_regd is not necessary,
hence it is removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Julius Niedworok
276d9e82e0 mac80211: debugfs option to force TX status frames
At Technical University of Munich we use MAC 802.11 TX status frames to
perform several measurements in MAC 802.11 setups.

With ath based drivers this was possible until commit d94a461d7a
("ath9k: use ieee80211_tx_status_noskb where possible") as the driver
ignored the IEEE80211_TX_CTL_REQ_TX_STATUS flag and always delivered
tx_status frames. Since that commit, this behavior was changed and the
driver now adheres to IEEE80211_TX_CTL_REQ_TX_STATUS.

Due to performance reasons, IEEE80211_TX_CTL_REQ_TX_STATUS is not set for
data frames from interfaces in managed mode. Hence, frames that are sent
from a managed mode interface do never deliver tx_status frames. This
remains true even if a monitor mode interface (the measurement interface)
is added to the same ieee80211 physical device. Thus, there is no
possibility for receiving tx_status frames for frames sent on an interface
in managed mode, if the driver adheres to IEEE80211_TX_CTL_REQ_TX_STATUS.

In order to force delivery of tx_status frames for research and debugging
purposes, implement a debugfs option force_tx_status for ieee80211 physical
devices. When this option is set for a physical device,
IEEE80211_TX_CTL_REQ_TX_STATUS is enabled in all packets sent from that
device. This option can be set via
/sys/kernel/debug/ieee80211/<dev>/force_tx_status. The default is disabled.

Co-developed-by: Charlie Groh <ga58taw@mytum.de>
Signed-off-by: Charlie Groh <ga58taw@mytum.de>
Signed-off-by: Julius Niedworok <julius.n@gmx.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Colin Ian King
5e28042091 cfg80211: remove redundant zero check on variable 'changed'
The zero check on variable changed is redundant as it must be
between 1 and 3 at the end of the proceeding if statement block.
Remove the redundant check.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sergey Matyukevich
5dc8cdce1d mac80211/cfg80211: update bss channel on channel switch
FullMAC STAs have no way to update bss channel after CSA channel switch
completion. As a result, user-space tools may provide inconsistent
channel info. For instance, consider the following two commands:
$ sudo iw dev wlan0 link
$ sudo iw dev wlan0 info
The latter command gets channel info from the hardware, so most probably
its output will be correct. However the former command gets channel info
from scan cache, so its output will contain outdated channel info.
In fact, current bss channel info will not be updated until the
next [re-]connect.

Note that mac80211 STAs have a workaround for this, but it requires
access to internal cfg80211 data, see ieee80211_chswitch_work:

	/* XXX: shouldn't really modify cfg80211-owned data! */
	ifmgd->associated->channel = sdata->csa_chandef.chan;

This patch suggests to convert mac80211 workaround into cfg80211 behavior
and to update current bss channel in cfg80211_ch_switch_notify.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
c2b17948fc mac80211: minstrel_ht: automatically calculate rate duration shift
A per-group shift was added to reduce the size of the per-rate transmit
duration field to u16 without sacrificing a lot of precision
This patch changes the macros to automatically calculate the best value for
this shift based on the lowest rate within the group.
This simplifies adding more groups and slightly improves accuracy for some of
the existing groups.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
7f2e12e1bf mac80211: minstrel_ht: add support for rates with 4 spatial streams
This is needed for the upcoming driver for MT7615 4x4 802.11ac chipsets

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
1974da8b31 mac80211: when using iTXQ, select the queue in ieee80211_subif_start_xmit
When using iTXQ, the network stack does not need the real queue number, since
mac80211 is using its internal queues anyway. In that case we can defer
selecting the queue and remove a redundant station lookup in the tx path to save
some CPU cycles.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Alexander Wetzel
96fc6efb9a mac80211: IEEE 802.11 Extended Key ID support
Add support for Extended Key ID as defined in IEEE 802.11-2016.

 - Implement the nl80211 API for Extended Key ID
 - Extend mac80211 API to allow drivers to support Extended Key ID
 - Enable Extended Key ID by default for drivers only supporting SW
   crypto (e.g. mac80211_hwsim)
 - Allow unicast Tx usage to be supressed (IEEE80211_KEY_FLAG_NO_AUTO_TX)
 - Select the decryption key based on the MPDU keyid
 - Enforce existing assumptions in the code that rekeys don't change the
   cipher

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[remove module parameter]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Alexander Wetzel
6cdd3979a2 nl80211/cfg80211: Extended Key ID support
Add support for IEEE 802.11-2016 "Extended Key ID for Individually
Addressed Frames".

Extend cfg80211 and nl80211 to allow pairwise keys to be installed for
Rx only, enable Tx separately and allow Key ID 1 for pairwise keys.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[use NLA_POLICY_RANGE() for NL80211_KEY_MODE]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Alexander Wetzel
092c4098f2 mac80211: Optimize tailroom_needed update checks
Optimize/cleanup the delay tailroom checks and adds one missing tailroom
update.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
8dbb000ee7 mac80211: set NETIF_F_LLTX when using intermediate tx queues
When using iTXQ, tx sequence number allocation and statistics are run at
dequeue time. Because of that, it is safe to enable NETIF_F_LLTX, which
allows tx handlers to run on multiple CPUs in parallel.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
ded4698b58 mac80211: run late dequeue late tx handlers without holding fq->lock
Reduces lock contention on enqueue/dequeue of iTXQ packets

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
f2af2df800 mac80211: calculate hash for fq without holding fq->lock in itxq enqueue
Reduces lock contention on enqueue/dequeue of iTXQ packets

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
ef618b1bd6 mac80211: mesh: drop redundant rcu_read_lock/unlock calls
The callers of these functions are all within RCU locked sections

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sara Sharon
5023b14cf4 mac80211: support profile split between elements
Since an element is limited to 255 octets, a profile may be split
split to several elements. Support the split as defined in the 11ax
draft 3.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sara Sharon
fe806e4992 cfg80211: support profile split between elements
Since an element is limited to 255 octets, a profile may be split
split to several elements. Support the split as defined in the 11ax
draft 3. Detect legacy split and print a net-rate limited warning,
since there is no ROI in supporting this probably non-existent
split.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sara Sharon
671042a4fb mac80211: support non-inheritance element
Subelement profile may specify element IDs it doesn't inherit
from the management frame. Support it.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sara Sharon
f7dacfb114 cfg80211: support non-inheritance element
Subelement profile may specify element IDs it doesn't inherit
from the management frame. Support it.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sara Sharon
5bd9d1082d cfg80211: don't skip multi-bssid index element
When creating the IEs for the nontransmitted BSS, the index
element is skipped. However, we need to get DTIM values from
it, so don't skip it.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
David S. Miller
89eb6e0967 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2019-04-25

Here's the main bluetooth-next pull request for the 5.2 kernel.

 - Added support for Mediatek SDIO controllers
 - Added support for Broadcom BCM2076B1 UART controller
 - Added support for Marvel SD8987 chipset
 - Fix buffer overflow bug in hidp protocol
 - Various other smaller fixes & improvements

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 01:39:38 -04:00
David S. Miller
8b44836583 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easy cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25 23:52:29 -04:00
Stanislav Fomichev
118c8e9ae6 bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_type
target_fd is target namespace. If there is a flow dissector BPF program
attached to that namespace, its (single) id is returned.

v5:
* drop net ref right after rcu unlock (Daniel Borkmann)

v4:
* add missing put_net (Jann Horn)

v3:
* add missing inline to skb_flow_dissector_prog_query static def
  (kbuild test robot <lkp@intel.com>)

v2:
* don't sleep in rcu critical section (Jakub Kicinski)
* check input prog_cnt (exit early)

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-25 23:49:06 +02:00
Kimberly Brown
be0d692657 net-sysfs: Replace ktype default_attrs field with groups
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace the default_attrs fields in rx_queue_ktype
and netdev_queue_ktype with default_groups. Use the ATTRIBUTE_GROUPS
macro to create rx_queue_default_groups and netdev_queue_default_groups.

This patch was tested by verifying that the sysfs files for the
attributes in the default groups were created.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25 22:06:11 +02:00
Chuck Lever
2cfd11f16f xprtrdma: Remove stale comment
The comment hasn't been accurate for several years.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:42:04 -04:00
Chuck Lever
b8fe677fd0 xprtrdma: Update comments that reference ib_drain_qp
Commit e1ede312f1 ("xprtrdma: Fix helper that drains the
transport") replaced the ib_drain_qp() call, so update documenting
comments to reflect current operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:40:31 -04:00
Chuck Lever
5f2311f5bd xprtrdma: Remove pr_err() call sites from completion handlers
Clean up: rely on the trace points instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:40:09 -04:00
Chuck Lever
86c4ccd9b9 xprtrdma: Eliminate struct rpcrdma_create_data_internal
Clean up.

Move the remaining field in rpcrdma_create_data_internal so the
structure can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:39:40 -04:00
Chuck Lever
94087e978e xprtrdma: Aggregate the inline settings in struct rpcrdma_ep
Clean up.

The inline settings are actually a characteristic of the endpoint,
and not related to the device. They are also modified after the
transport instance is created, so they do not belong in the cdata
structure either.

Lastly, let's use names that are more natural to RDMA than to NFS:
inline_write -> inline_send and inline_read -> inline_recv. The
/proc files retain their names to avoid breaking user space.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:39:16 -04:00
Chuck Lever
fd5951742d xprtrdma: Remove rpcrdma_create_data_internal::rsize and wsize
Clean up.

xprt_rdma_max_inline_{read,write} cannot be set to large values
by virtue of proc_dointvec_minmax. The current maximum is
RPCRDMA_MAX_INLINE, which is much smaller than RPCRDMA_MAX_SEGS *
PAGE_SIZE.

The .rsize and .wsize fields are otherwise unused in the current
code base, and thus can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:38:30 -04:00
Chuck Lever
1f7d1c73c5 SUNRPC: Update comments based on recent changes
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:38:02 -04:00
Chuck Lever
f19bd0bbd3 xprtrdma: Eliminate rpcrdma_ia::ri_device
Clean up.

Since commit 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and
Receive buffers"), a pointer to the device is now saved in each
regbuf when it is DMA mapped.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:21:18 -04:00
Chuck Lever
c209e49cea xprtrdma: More Send completion batching
Instead of using a fixed number, allow the amount of Send completion
batching to vary based on the client's maximum credit limit.

- A larger default gives a small boost to IOPS throughput

- Reducing it based on max_requests gives a safe result when the
  max credit limit is cranked down (eg. when the device has a small
  max_qp_wr).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:20:58 -04:00
Chuck Lever
dbcc53a52d xprtrdma: Clean up sendctx functions
Minor clean-ups I've stumbled on since sendctx was merged last year.
In particular, making Send completion processing more efficient
appears to have a measurable impact on IOPS throughput.

Note: test_and_clear_bit() returns a value, thus an explicit memory
barrier is not necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:20:21 -04:00
Chuck Lever
17e4c443c0 xprtrdma: Trace marshaling failures
Record an event when rpcrdma_marshal_req returns a non-zero return
value to help track down why an xprt close might have occurred.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:19:56 -04:00
Chuck Lever
4ba02e8d0e xprtrdma: Increase maximum number of backchannel requests
Reflects the change introduced in commit 067c469671 ("NFSv4.1:
Bump the default callback session slot count to 16").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:19:26 -04:00
Chuck Lever
3f9c7e7693 xprtrdma: Backchannel can use GFP_KERNEL allocations
The Receive handler runs in process context, thus can use on-demand
GFP_KERNEL allocations instead of pre-allocation.

This makes the xprtrdma backchannel independent of the number of
backchannel session slots provisioned by the Upper Layer protocol.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:18:58 -04:00
Chuck Lever
d2832af38d xprtrdma: Clean up regbuf helpers
For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action

Also rpcrdma_regbuf_alloc and rpcrdma_regbuf_free no longer have any
callers outside of verbs.c, and can thus be made static.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:18:41 -04:00
Chuck Lever
0f665ceb71 xprtrdma: De-duplicate "allocate new, free old regbuf"
Clean up by providing an API to do this common task.

At this point, the difference between rpcrdma_get_sendbuf and
rpcrdma_get_recvbuf has become tiny. These can be collapsed into a
single helper.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:02:32 -04:00
Chuck Lever
bb93a1ae2b xprtrdma: Allocate req's regbufs at xprt create time
Allocating an rpcrdma_req's regbufs at xprt create time enables
a pair of micro-optimizations:

First, if these regbufs are always there, we can eliminate two
conditional branches from the hot xprt_rdma_allocate path.

Second, by allocating a 1KB buffer, it places a lower bound on the
size of these buffers, without adding yet another conditional
branch. The lower bound reduces the number of hardway re-
allocations. In fact, for some workloads it completely eliminates
hardway allocations.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:02:11 -04:00
Chuck Lever
8cec3dba76 xprtrdma: rpcrdma_regbuf alignment
Allocate the struct rpcrdma_regbuf separately from the I/O buffer
to better guarantee the alignment of the I/O buffer and eliminate
the wasted space between the rpcrdma_regbuf metadata and the buffer
itself.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:01:27 -04:00
Chuck Lever
23146500b3 xprtrdma: Clean up rpcrdma_create_rep() and rpcrdma_destroy_rep()
For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 15:00:06 -04:00
Chuck Lever
1769e6a816 xprtrdma: Clean up rpcrdma_create_req()
Eventually, I'd like to invoke rpcrdma_create_req() during the
call_reserve step. Memory allocation there probably needs to use
GFP_NOIO. Therefore a set of GFP flags needs to be passed in.

As an additional clean up, just return a pointer or NULL, because
the only error return code here is -ENOMEM.

Lastly, clean up the function names to be consistent with the
pattern: "rpcrdma" _ object-type _ action

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:59:44 -04:00
Chuck Lever
b2ca473b92 xprtrdma: Fix an frwr_map recovery nit
After a DMA map failure in frwr_map, mark the MR so that recycling
won't attempt to DMA unmap it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: e2f34e2671 ("xprtrdma: Yet another double DMA-unmap")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:59:25 -04:00
Chuck Lever
52db6f9a0c SUNRPC: Avoid digging into the ATOMIC pool
Page allocation requests made when the SPARSE_PAGES flag is set are
allowed to fail, and are not critical. No need to spend a rare
resource.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:58:35 -04:00
Trond Myklebust
ae6ec91847 SUNRPC: Add the 'softerr' rpc_client flag
Add the 'softerr' rpc client flag that sets the RPC_TASK_TIMEOUT
flag on all new rpc tasks that are attached to that rpc client.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
0729d995f2 SUNRPC: Ensure to ratelimit the "server not responding" syslog messages
In particular, the timeout messages can be very noisy, so we ought to
ratelimit them in order to avoid spamming the syslog.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
da953063bd SUNRPC: Start the first major timeout calculation at task creation
When calculating the major timeout for a new task, when we know that the
connection has been broken, use the task->tk_start to ensure that we also
take into account the time spent waiting for a slot or session slot. This
ensures that we fail over soft requests relatively quickly once the
connection has actually been broken, and the first requests have
started to fail.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
e4ec48d3cc SUNRPC: Make "no retrans timeout" soft tasks behave like softconn for timeouts
If a soft NFSv4 request is sent, then we don't need it to time out unless
the connection breaks. The reason is that as long as the connection is
unbroken, the protocol states that the server is not allowed to drop the
request.  IOW: as long as the connection remains unbroken, the client may
assume that all transmitted RPC requests are being processed by the server,
and that retransmissions and timeouts of those requests are unwarranted.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
5ad64b36dd SUNRPC: Add tracking of RPC level errors
Add variables to track RPC level errors so that we can distinguish
between issue that arose in the RPC transport layer as opposed to
those arising from the reply message.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
9e910bff74 SUNRPC: Ensure that the transport layer respect major timeouts
Ensure that when in the transport layer, we don't sleep past
a major timeout.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
431235818b SUNRPC: Declare RPC timers as TIMER_DEFERRABLE
Don't wake idle CPUs only for the purpose of servicing an RPC
queue timeout.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
24a9d9a21e SUNRPC: Simplify queue timeouts using timer_reduce()
Simplify the setting of queue timeouts by using the timer_reduce()
function.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
5efd1876e6 SUNRPC: Fix up tracking of timeouts
Add a helper to ensure that debugfs and friends print out the
correct current task timeout value.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
6b2e685627 SUNRPC: Add function rpc_sleep_on_timeout()
Clean up the RPC task sleep interfaces by replacing the task->tk_timeout
'hidden parameter' to rpc_sleep_on() with a new function that takes an
absolute timeout.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
8357a9b60f SUNRPC: Remove unused argument 'action' from rpc_sleep_on_priority()
None of the callers set the 'action' argument, so let's just remove it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:12 -04:00
Trond Myklebust
87150aaed9 SUNRPC: Refactor rpc_sleep_on()
rpc_sleep_on() does not need to set the task->tk_callback under the
queue lock, so move that out.
Also refactor the check for whether the task is active.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:12 -04:00
Trond Myklebust
8ba6a92d01 SUNRPC: Refactor xprt_request_wait_receive()
Convert the transport callback to actually put the request to sleep
instead of just setting a timeout. This is in preparation for
rpc_sleep_on_timeout().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:12 -04:00
Trond Myklebust
9e6fa0bb84 SUNRPC: Refactor rpc_restart_call/rpc_restart_call_prepare
Clean up.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:12 -04:00
Trond Myklebust
ae67bd3821 SUNRPC: Fix up task signalling
The RPC_TASK_KILLED flag should really not be set from another context
because it can clobber data in the struct task when task->tk_flags is
changed non-atomically.
Let's therefore swap out RPC_TASK_KILLED with an atomic flag, and add
a function to set that flag and safely wake up the task.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:12 -04:00
Marcel Holtmann
d5bb334a8e Bluetooth: Align minimum encryption key size for LE and BR/EDR connections
The minimum encryption key size for LE connections is 56 bits and to
align LE with BR/EDR, enforce 56 bits of minimum encryption key size for
BR/EDR connections as well.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: stable@vger.kernel.org
2019-04-25 06:18:07 -07:00
Eric Biggers
877b5691f2 crypto: shash - remove shash_desc::flags
The flags field in 'struct shash_desc' never actually does anything.
The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP.
However, no shash algorithm ever sleeps, making this flag a no-op.

With this being the case, inevitably some users who can't sleep wrongly
pass MAY_SLEEP.  These would all need to be fixed if any shash algorithm
actually started sleeping.  For example, the shash_ahash_*() functions,
which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP
from the ahash API to the shash API.  However, the shash functions are
called under kmap_atomic(), so actually they're assumed to never sleep.

Even if it turns out that some users do need preemption points while
hashing large buffers, we could easily provide a helper function
crypto_shash_update_large() which divides the data into smaller chunks
and calls crypto_shash_update() and cond_resched() for each chunk.  It's
not necessary to have a flag in 'struct shash_desc', nor is it necessary
to make individual shash algorithms aware of this at all.

Therefore, remove shash_desc::flags, and document that the
crypto_shash_*() functions can be called from any context.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-25 15:38:12 +08:00
Linus Torvalds
cd8dead0c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Just the usual assortment of small'ish fixes:

   1) Conntrack timeout is sometimes not initialized properly, from
      Alexander Potapenko.

   2) Add a reasonable range limit to tcp_min_rtt_wlen to avoid
      undefined behavior. From ZhangXiaoxu.

   3) des1 field of descriptor in stmmac driver is initialized with the
      wrong variable. From Yue Haibing.

   4) Increase mlxsw pci sw reset timeout a little bit more, from Ido
      Schimmel.

   5) Match IOT2000 stmmac devices more accurately, from Su Bao Cheng.

   6) Fallback refcount fix in TLS code, from Jakub Kicinski.

   7) Fix max MTU check when using XDP in mlx5, from Maxim Mikityanskiy.

   8) Fix recursive locking in team driver, from Hangbin Liu.

   9) Fix tls_set_device_offload_Rx() deadlock, from Jakub Kicinski.

  10) Don't use napi_alloc_frag() outside of softiq context of socionext
      driver, from Ilias Apalodimas.

  11) MAC address increment overflow in ncsi, from Tao Ren.

  12) Fix a regression in 8K/1M pool switching of RDS, from Zhu Yanjun.

  13) ipv4_link_failure has to validate the headers that are actually
      there because RAW sockets can pass in arbitrary garbage, from Eric
      Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  ipv4: add sanity checks in ipv4_link_failure()
  net/rose: fix unbound loop in rose_loopback_timer()
  rxrpc: fix race condition in rxrpc_input_packet()
  net: rds: exchange of 8K and 1M pool
  net: vrf: Fix operation not supported when set vrf mac
  net/ncsi: handle overflow when incrementing mac address
  net: socionext: replace napi_alloc_frag with the netdev variant on init
  net: atheros: fix spelling mistake "underun" -> "underrun"
  spi: ST ST95HF NFC: declare missing of table
  spi: Micrel eth switch: declare missing of table
  net: stmmac: move stmmac_check_ether_addr() to driver probe
  netfilter: fix nf_l4proto_log_invalid to log invalid packets
  netfilter: never get/set skb->tstamp
  netfilter: ebtables: CONFIG_COMPAT: drop a bogus WARN_ON
  Documentation: decnet: remove reference to CONFIG_DECNET_ROUTE_FWMARK
  dt-bindings: add an explanation for internal phy-mode
  net/tls: don't leak IV and record seq when offload fails
  net/tls: avoid potential deadlock in tls_set_device_offload_rx()
  selftests/net: correct the return value for run_afpackettests
  team: fix possible recursive locking when add slaves
  ...
2019-04-24 16:18:59 -07:00
Eric Dumazet
20ff83f10f ipv4: add sanity checks in ipv4_link_failure()
Before calling __ip_options_compile(), we need to ensure the network
header is a an IPv4 one, and that it is already pulled in skb->head.

RAW sockets going through a tunnel can end up calling ipv4_link_failure()
with total garbage in the skb, or arbitrary lengthes.

syzbot report :

BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline]
BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204

CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 check_memory_region_inline mm/kasan/generic.c:185 [inline]
 check_memory_region+0x123/0x190 mm/kasan/generic.c:191
 memcpy+0x38/0x50 mm/kasan/common.c:133
 memcpy include/linux/string.h:355 [inline]
 __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
 __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695
 ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204
 dst_link_failure include/net/dst.h:427 [inline]
 vti6_xmit net/ipv6/ip6_vti.c:514 [inline]
 vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553
 __netdev_start_xmit include/linux/netdevice.h:4414 [inline]
 netdev_start_xmit include/linux/netdevice.h:4423 [inline]
 xmit_one net/core/dev.c:3292 [inline]
 dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308
 __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3911
 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527
 neigh_output include/net/neighbour.h:508 [inline]
 ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229
 ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317
 NF_HOOK_COND include/linux/netfilter.h:278 [inline]
 ip_output+0x21f/0x670 net/ipv4/ip_output.c:405
 dst_output include/net/dst.h:444 [inline]
 NF_HOOK include/linux/netfilter.h:289 [inline]
 raw_send_hdrinc net/ipv4/raw.c:432 [inline]
 raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663
 inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg+0xdd/0x130 net/socket.c:661
 sock_write_iter+0x27c/0x3e0 net/socket.c:988
 call_write_iter include/linux/fs.h:1866 [inline]
 new_sync_write+0x4c7/0x760 fs/read_write.c:474
 __vfs_write+0xe4/0x110 fs/read_write.c:487
 vfs_write+0x20c/0x580 fs/read_write.c:549
 ksys_write+0x14f/0x2d0 fs/read_write.c:599
 __do_sys_write fs/read_write.c:611 [inline]
 __se_sys_write fs/read_write.c:608 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:608
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458c29
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29
RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4
R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff

The buggy address belongs to the page:
page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1fffc0000000000()
raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
 ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
                         ^
 ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00
 ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fixes: ed0de45a10 ("ipv4: recompile ip options in ipv4_link_failure")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24 14:40:41 -07:00
Eric Dumazet
0453c68245 net/rose: fix unbound loop in rose_loopback_timer()
This patch adds a limit on the number of skbs that fuzzers can queue
into loopback_queue. 1000 packets for rose loopback seems more than enough.

Then, since we now have multiple cpus in most linux hosts,
we also need to limit the number of skbs rose_loopback_timer()
can dequeue at each round.

rose_loopback_queue() can be drop-monitor friendly, calling
consume_skb() or kfree_skb() appropriately.

Finally, use mod_timer() instead of del_timer() + add_timer()

syzbot report was :

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    0-...!: (10499 ticks this GP) idle=536/1/0x4000000000000002 softirq=103291/103291 fqs=34
rcu:     (t=10500 jiffies g=140321 q=323)
rcu: rcu_preempt kthread starved for 10426 jiffies! g140321 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
rcu: RCU grace-period kthread stack dump:
rcu_preempt     I29168    10      2 0x80000000
Call Trace:
 context_switch kernel/sched/core.c:2877 [inline]
 __schedule+0x813/0x1cc0 kernel/sched/core.c:3518
 schedule+0x92/0x180 kernel/sched/core.c:3562
 schedule_timeout+0x4db/0xfd0 kernel/time/timer.c:1803
 rcu_gp_fqs_loop kernel/rcu/tree.c:1971 [inline]
 rcu_gp_kthread+0x962/0x17b0 kernel/rcu/tree.c:2128
 kthread+0x357/0x430 kernel/kthread.c:253
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
NMI backtrace for cpu 0
CPU: 0 PID: 7632 Comm: kworker/0:4 Not tainted 5.1.0-rc5+ #172
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events iterate_cleanup_work
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
 nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
 rcu_dump_cpu_stacks+0x183/0x1cf kernel/rcu/tree.c:1223
 print_cpu_stall kernel/rcu/tree.c:1360 [inline]
 check_cpu_stall kernel/rcu/tree.c:1434 [inline]
 rcu_pending kernel/rcu/tree.c:3103 [inline]
 rcu_sched_clock_irq.cold+0x500/0xa4a kernel/rcu/tree.c:2544
 update_process_times+0x32/0x80 kernel/time/timer.c:1635
 tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:161
 tick_sched_timer+0x47/0x130 kernel/time/tick-sched.c:1271
 __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
 __hrtimer_run_queues+0x33e/0xde0 kernel/time/hrtimer.c:1451
 hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1035 [inline]
 smp_apic_timer_interrupt+0x120/0x570 arch/x86/kernel/apic/apic.c:1060
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807
RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50 kernel/kcov.c:95
Code: 89 25 b4 6e ec 08 41 bc f4 ff ff ff e8 cd 5d ea ff 48 c7 05 9e 6e ec 08 00 00 00 00 e9 a4 e9 ff ff 90 90 90 90 90 90 90 90 90 <55> 48 89 e5 48 8b 75 08 65 48 8b 04 25 00 ee 01 00 65 8b 15 c8 60
RSP: 0018:ffff8880ae807ce0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
RAX: ffff88806fd40640 RBX: dffffc0000000000 RCX: ffffffff863fbc56
RDX: 0000000000000100 RSI: ffffffff863fbc1d RDI: ffff88808cf94228
RBP: ffff8880ae807d10 R08: ffff88806fd40640 R09: ffffed1015d00f8b
R10: ffffed1015d00f8a R11: 0000000000000003 R12: ffff88808cf941c0
R13: 00000000fffff034 R14: ffff8882166cd840 R15: 0000000000000000
 rose_loopback_timer+0x30d/0x3f0 net/rose/rose_loopback.c:91
 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
 expire_timers kernel/time/timer.c:1362 [inline]
 __run_timers kernel/time/timer.c:1681 [inline]
 __run_timers kernel/time/timer.c:1649 [inline]
 run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
 __do_softirq+0x266/0x95a kernel/softirq.c:293
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1027

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24 14:39:26 -07:00
Eric Dumazet
032be5f19a rxrpc: fix race condition in rxrpc_input_packet()
After commit 5271953cad ("rxrpc: Use the UDP encap_rcv hook"),
rxrpc_input_packet() is directly called from lockless UDP receive
path, under rcu_read_lock() protection.

It must therefore use RCU rules :

- udp_sk->sk_user_data can be cleared at any point in this function.
  rcu_dereference_sk_user_data() is what we need here.

- Also, since sk_user_data might have been set in rxrpc_open_socket()
  we must observe a proper RCU grace period before kfree(local) in
  rxrpc_lookup_local()

v4: @local can be NULL in xrpc_lookup_local() as reported by kbuild test robot <lkp@intel.com>
        and Julia Lawall <julia.lawall@lip6.fr>, thanks !

v3,v2 : addressed David Howells feedback, thanks !

syzbot reported :

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 19236 Comm: syz-executor703 Not tainted 5.1.0-rc6 #79
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__lock_acquire+0xbef/0x3fb0 kernel/locking/lockdep.c:3573
Code: 00 0f 85 a5 1f 00 00 48 81 c4 10 01 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 4a 21 00 00 49 81 7d 00 20 54 9c 89 0f 84 cf f4
RSP: 0018:ffff88809d7aef58 EFLAGS: 00010002
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000026 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88809d7af090 R08: 0000000000000001 R09: 0000000000000001
R10: ffffed1015d05bc7 R11: ffff888089428600 R12: 0000000000000000
R13: 0000000000000130 R14: 0000000000000001 R15: 0000000000000001
FS:  00007f059044d700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004b6040 CR3: 00000000955ca000 CR4: 00000000001406f0
Call Trace:
 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:4211
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:152
 skb_queue_tail+0x26/0x150 net/core/skbuff.c:2972
 rxrpc_reject_packet net/rxrpc/input.c:1126 [inline]
 rxrpc_input_packet+0x4a0/0x5536 net/rxrpc/input.c:1414
 udp_queue_rcv_one_skb+0xaf2/0x1780 net/ipv4/udp.c:2011
 udp_queue_rcv_skb+0x128/0x730 net/ipv4/udp.c:2085
 udp_unicast_rcv_skb.isra.0+0xb9/0x360 net/ipv4/udp.c:2245
 __udp4_lib_rcv+0x701/0x2ca0 net/ipv4/udp.c:2301
 udp_rcv+0x22/0x30 net/ipv4/udp.c:2482
 ip_protocol_deliver_rcu+0x60/0x8f0 net/ipv4/ip_input.c:208
 ip_local_deliver_finish+0x23b/0x390 net/ipv4/ip_input.c:234
 NF_HOOK include/linux/netfilter.h:289 [inline]
 NF_HOOK include/linux/netfilter.h:283 [inline]
 ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:255
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish+0x1e1/0x300 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:289 [inline]
 NF_HOOK include/linux/netfilter.h:283 [inline]
 ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0x115/0x1a0 net/core/dev.c:4987
 __netif_receive_skb+0x2c/0x1c0 net/core/dev.c:5099
 netif_receive_skb_internal+0x117/0x660 net/core/dev.c:5202
 napi_frags_finish net/core/dev.c:5769 [inline]
 napi_gro_frags+0xade/0xd10 net/core/dev.c:5843
 tun_get_user+0x2f24/0x3fb0 drivers/net/tun.c:1981
 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2027
 call_write_iter include/linux/fs.h:1866 [inline]
 do_iter_readv_writev+0x5e1/0x8e0 fs/read_write.c:681
 do_iter_write fs/read_write.c:957 [inline]
 do_iter_write+0x184/0x610 fs/read_write.c:938
 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1002
 do_writev+0x15e/0x370 fs/read_write.c:1037
 __do_sys_writev fs/read_write.c:1110 [inline]
 __se_sys_writev fs/read_write.c:1107 [inline]
 __x64_sys_writev+0x75/0xb0 fs/read_write.c:1107
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 5271953cad ("rxrpc: Use the UDP encap_rcv hook")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24 14:05:09 -07:00
Zhu Yanjun
4b9fc71462 net: rds: exchange of 8K and 1M pool
Before the commit 490ea5967b ("RDS: IB: move FMR code to its own file"),
when the dirty_count is greater than 9/10 of max_items of 8K pool,
1M pool is used, Vice versa. After the commit 490ea5967b ("RDS: IB: move
FMR code to its own file"), the above is removed. When we make the
following tests.

Server:
  rds-stress -r 1.1.1.16 -D 1M

Client:
  rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M

The following will appear.
"
connecting to 1.1.1.16:4000
negotiated options, tasks will start in 2 seconds
Starting up..header from 1.1.1.166:4001 to id 4001 bogus
..
tsks  tx/s  rx/s tx+rx K/s  mbi K/s  mbo K/s tx us/c  rtt us
cpu %
   1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
   1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
   1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
   1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
   1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
...
"
So this exchange between 8K and 1M pool is added back.

Fixes: commit 490ea5967b ("RDS: IB: move FMR code to its own file")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24 12:07:08 -07:00
Vlad Buslov
c049d56eb2 net: sched: flower: refactor reoffload for concurrent access
Recent changes that introduced unlocked flower did not properly account for
case when reoffload is initiated concurrently with filter updates. To fix
the issue, extend flower with 'hw_filters' list that is used to store
filters that don't have 'skip_hw' flag set. Filter is added to the list
when it is inserted to hardware and only removed from it after being
unoffloaded from all drivers that parent block is attached to. This ensures
that concurrent reoffload can still access filter that is being deleted and
prevents race condition when driver callback can be removed when filter is
no longer accessible trough idr, but is still present in hardware.

Refactor fl_change() to respect new filter reference counter and to release
filter reference with __fl_put() in case of error, instead of directly
deallocating filter memory. This allows for concurrent access to filter
from fl_reoffload() and protects it with reference counting. Refactor
fl_reoffload() to iterate over hw_filters list instead of idr. Implement
fl_get_next_hw_filter() helper function that is used to iterate over
hw_filters list with reference counting and skips filters that are being
concurrently deleted.

Fixes: 9214919006 ("net: sched: flower: set unlocked flag for flower proto ops")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24 11:53:28 -07:00
Eric Dumazet
4109a2c3b9 tipc: tipc_udp_recv() cleanup vs rcu verbs
First thing tipc_udp_recv() does is to use rcu_dereference_sk_user_data(),
and this is really hinting we already own rcu_read_lock() from the caller
(UDP stack).

No need to add another rcu_read_lock()/rcu_read_unlock() pair.

Also use rcu_dereference() instead of rcu_dereference_rtnl()
in the data path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24 11:46:54 -07:00
Trond Myklebust
e6667c73a2 SUNRPC: rsi_parse() should use the current user namespace
rsi_parse() is part of a downcall, so we must assume that the uids
and gids are encoded using the current user namespace.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:35 -04:00
Trond Myklebust
ccfe51a516 SUNRPC: Fix the server AUTH_UNIX userspace mappings
gid_parse() is part of a downcall, so uids and gids should be assumed
encoded using the current user namespace.
svcauth_unix_accept() is, on the other hand, decoding uids and gids from
the wire, so we assume those are encoded to match the user namespace of
the server process.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:35 -04:00
Trond Myklebust
1237d3545c SUNRPC: Temporary sockets should inherit the cred from their parent
Temporary sockets should inherit the credential (and hence the user
namespace) from the parent listener transport.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:35 -04:00
Trond Myklebust
4df493a260 SUNRPC: Cache the process user cred in the RPC server listener
In order to be able to interpret uids and gids correctly in knfsd, we
should cache the user namespace of the process that created the RPC
server's listener. To do so, we refcount the credential of that process.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:35 -04:00
Trond Myklebust
642ee6b209 SUNRPC: Allow further customisation of RPC program registration
Add a callback to allow customisation of the rpcbind registration.
When clients have the ability to turn on and off version support,
we want to allow them to also prevent registration of those
versions with the rpc portmapper.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:35 -04:00
Trond Myklebust
4532608d71 SUNRPC: Clean up generic dispatcher code
Simplify the generic server dispatcher.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:34 -04:00
Trond Myklebust
8e5b67731d SUNRPC: Add a callback to initialise server requests
Add a callback to help initialise server requests before they are
processed. This will allow us to clean up the NFS server version
support, and to make it container safe.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:34 -04:00
Trond Myklebust
83dd59a0b9 SUNRPC/nfs: Fix return value for nfs4_callback_compound()
RPC server procedures are normally expected to return a __be32 encoded
status value of type 'enum rpc_accept_stat', however at least one function
wants to return an authentication status of type 'enum rpc_auth_stat'
in the case where authentication fails.
This patch adds functionality to allow this.

Fixes: a4e187d83d ("NFS: Don't drop CB requests with invalid principals")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:34 -04:00
NeilBrown
9d69338c8c sunrpc/cache: handle missing listeners better.
If no handler (such as rpc.mountd) has opened
a cache 'channel', the sunrpc cache responds to
all lookup requests with -ENOENT.  This is particularly
important for the auth.unix.gid cache which is
optional.

If the channel was open briefly and an upcall was written to it,
this upcall remains pending even when the handler closes the
channel.  When an upcall is pending, the code currently
doesn't check if there are still listeners, it only performs
that check before sending an upcall.

As the cache treads a recently closes channel (closed less than
30 seconds ago) as "potentially still open", there is a
reasonable sized window when a request can become pending
in a closed channel, and thereby block lookups indefinitely.

This can easily be demonstrated by running
  cat /proc/net/rpc/auth.unix.gid/channel

and then trying to mount an NFS filesystem from this host.  It
will block indefinitely (unless mountd is run with --manage-gids,
or krb5 is used).

When cache_check() finds that an upcall is pending, it should
perform the "cache_listeners_exist()" exist test.  If no
listeners do exist, the request should be negated.

With this change in place, there can still be a 30second wait on
mount, until the cache gives up waiting for a handler to come
back, but this is much better than an indefinite wait.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-24 09:46:34 -04:00
Balakrishna Godavarthi
e668eb1e15 Bluetooth: hci_core: Don't stop BT if the BD address missing in dts
When flag HCI_QUIRK_USE_BDADDR_PROPERTY is set, we will read the
bluetooth address from dts. If the bluetooth address node is missing
from the dts we will enable it controller UNCONFIGURED state.
This patch enables the normal flow even if the BD address is missing
from the dts tree.

Signed-off-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Tested-by: Harish Bandi <c-hbandi@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-04-24 09:12:47 +02:00
David Ahern
a65120bae4 ipv6: Use result arg in fib_lookup_arg consistently
arg.result is sometimes used as fib6_result and sometimes used to
hold the rt6_info. Add rt6_info to fib6_result and make the use
of arg.result consistent through ipv6 rules.

The rt6 entry is filled in for lookups returning a dst_entry, but not
for direct fib_lookups that just want a fib6_info.

Fixes: effda4dd97 ("ipv6: Pass fib6_result to fib lookups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 21:53:11 -07:00
David Ahern
b2f97f7de2 ipv6: fib6_rule_action_alt needs to return -EAGAIN
fib rule actions should return -EAGAIN for the rules to continue to the
next one. A recent change overwrote err with the lookup always returning
0 (future change will make it more like IPv4) which means the rules
stopped at the first (e.g., local table lookup only). Catch and reset err
to -EAGAIN.

Fixes: effda4dd97 ("ipv6: Pass fib6_result to fib lookups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 21:52:33 -07:00
Tao Ren
1c5c12ee30 net/ncsi: handle overflow when incrementing mac address
Previously BMC's MAC address is calculated by simply adding 1 to the
last byte of network controller's MAC address, and it produces incorrect
result when network controller's MAC address ends with 0xFF.

The problem can be fixed by calling eth_addr_inc() function to increment
MAC address; besides, the MAC address is also validated before assigning
to BMC.

Fixes: cb10c7c0df ("net/ncsi: Add NCSI Broadcom OEM command")
Signed-off-by: Tao Ren <taoren@fb.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 21:15:15 -07:00
Andre Guedes
6e734c82be net: sched: taprio: Fix taprio_dequeue()
In case we don't have 'guard' or 'budget' to transmit the skb, we should
continue traversing the qdisc list since the remaining guard/budget
might be enough to transmit a skb from other children qdiscs.

Fixes: 5a781ccbd1 (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:52:32 -07:00
Andre Guedes
2684d1b75f net: sched: taprio: Fix taprio_peek()
While traversing taprio's children qdisc list, if the gate is closed for
a given traffic class, we should continue traversing the list since the
remaining qdiscs may have skb ready for transmission.

This patch also takes this opportunity and changes the function to use
the TAPRIO_ALL_GATES_OPEN macro instead of the magic number '-1'.

Fixes: 5a781ccbd1 (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:52:32 -07:00
Andre Guedes
5175aafe71 net: sched: taprio: Remove should_restart_cycle()
The 'entry' argument from should_restart_cycle() cannot be NULL since it
is already checked by the caller so the WARN_ON() within should_
restart_cycle() could be removed.  By doing that, that function becomes
a dummy wrapper on list_is_last() so this patch simply gets rid of it
and call list_is_last() within advance_sched() instead.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:52:32 -07:00
Andre Guedes
8599099f0c net: sched: taprio: Refactor taprio_get_start_time()
This patch does a code refactoring to taprio_get_start_time() function
to improve readability and report error properly.

If 'base' time is later than 'now', the start time is equal to 'base'
and taprio_get_start_time() is done. That's the natural case so we move
that code to the beginning of the function. Also, if 'cycle' calculation
is zero, something went really wrong with taprio and we should log that
internal error properly.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:52:32 -07:00
Andre Guedes
59ab87f6eb net: sched: taprio: Remove pointless variable assigment
This patch removes a pointless variable assigment in taprio_change().
The 'err' variable is not used from this assignment to the next one so
this patch removes it.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:52:32 -07:00
David Ahern
ecc5663cce net: Change nhc_flags to unsigned char
nhc_flags holds the RTNH_F flags for a given nexthop (fib{6}_nh).
All of the RTNH_F_ flags fit in an unsigned char, and since the API to
userspace (rtnh_flags and lower byte of rtm_flags) is 1 byte it can not
grow. Make nhc_flags in fib_nh_common an unsigned char and shrink the
size of the struct by 8, from 56 to 48 bytes.

Update the flags arguments for up netdevice events and fib_nexthop_info
which determines the RTNH_F flags to return on a dump/event. The RTNH_F
flags are passed in the lower byte of rtm_flags which is an unsigned int
so use a temp variable for the flags to fib_nexthop_info and combine
with rtm_flags in the caller.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:44:18 -07:00
David Ahern
ffa8ce54be lwtunnel: Pass encap and encap type attributes to lwtunnel_fill_encap
Currently, lwtunnel_fill_encap hardcodes the encap and encap type
attributes as RTA_ENCAP and RTA_ENCAP_TYPE, respectively. The nexthop
objects want to re-use this code but the encap attributes passed to
userspace as NHA_ENCAP and NHA_ENCAP_TYPE. Since that is the only
difference, change lwtunnel_fill_encap to take the attribute type as
an input.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:42:29 -07:00
Stephen Rothwell
c98f4822ed net: fix sparc64 compilation of sock_gettstamp
net/core/sock.c: In function 'sock_gettstamp':
net/core/sock.c:3007:23: error: expected '}' before ';' token
    .tv_sec = ts.tv_sec;
                       ^
net/core/sock.c:3011:4: error: expected ')' before 'return'
    return -EFAULT;
    ^~~~~~
net/core/sock.c:3013:2: error: expected expression before '}' token
  }
  ^

Fixes: c7cbdbf29f ("net: rework SIOCGSTAMP ioctl handling")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 18:35:17 -07:00
Eric Dumazet
f05713e091 ipv6: convert fib6_ref to refcount_t
We suspect some issues involving fib6_ref 0 -> 1 transitions might
cause strange syzbot reports.

Lets convert fib6_ref to refcount_t to catch them earlier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 17:19:48 -07:00
Eric Dumazet
5ea715289a ipv6: broadly use fib6_info_hold() helper
Instead of using atomic_inc(), prefer fib6_info_hold()
so that upcoming refcount_t conversion is simpler.

Only fib6_info_alloc() is using atomic_set() since we
just allocated a new object.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 17:19:48 -07:00
Eric Dumazet
b027055022 ipv6: fib6_info_destroy_rcu() cleanup
We do not need to clear f6i->rt6i_exception_bucket right before
freeing f6i.

Note that f6i->rt6i_exception_bucket is properly protected by
f6i->exception_bucket_flushed being set to one in rt6_flush_exceptions()
under the protection of rt6_exception_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 17:19:48 -07:00
Willem de Bruijn
1b00e0dfe7 bpf: update skb->protocol in bpf_skb_net_grow
Some tunnels, like sit, change the network protocol of packet.
If so, update skb->protocol to match the new type.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-24 01:32:26 +02:00
Linus Torvalds
12a54b150f Fix miscellaneous nfsd bugs, in NFSv4.1 callbacks, NFSv4.1
lock-notification callbacks, NFSv3 readdir encoding, and the
 cache/upcall code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcv3EfAAoJECebzXlCjuG+j8IQAIMJFCpMTSu3lw0X9BjJFwe/
 TELdvH4eXm+xtnz0Rbp4RTmlNgt/ox1ySp+sPQqcDutJcU6ImtJS1JY5BmoSU80p
 2ZTRzVOusSS1dP6EDwnYQlR3rI3ObCdDYtkjfboqDd4UJM0hvuCrdQols3DRZF1/
 gBMUDNk8Sch64JMs8EcpItmNWgZ1QDmL6vbKsYDB8HWfBcFkNOs/Q1OecSD7F9Dh
 k+GojPb4pXRCgN8heeFr2H2mXxJSDOPyyybBSQOtbv58rT7IXTpVvAt+WHpjy917
 ljOSD/EuCY+IvX6oY8xPrly9a3t4Mh4eQQFj1EZGSlNucqDgsi1B421QcxGZSkF4
 WZkTRC5BppairSx4dnW96Nx0BbCysj2dvZ0MOnUa59YRwRbyrnHKCv221SzFBidI
 0slxNBYW7Rw2ViFJ6fBPjsjH0XWM9Hq4wC84kj8XXNeDDh2KzXPtUtUzp3iDpKQg
 jP0UiEYnt6OXWvHOH0hfh52Mds48iIUSRmKL4huDT8Ijnt9KBZxnTunClJt6g1SH
 MsIttRsZ2fLo/M5SlWzvQ7XNoKNn/s49Z3SZjqYEKqL9aMwuMoMcIgolqlh5oFPu
 j3cqprDNXuij+2pl35IuQRbk57G2DZ1maShJ5n8I7l0uBe6DMX7T+4d4rHy75Tu9
 otqz4WYU2Yjnnmv5Kp/H
 =9hKY
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.1-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfixes from Bruce Fields:
 "Fix miscellaneous nfsd bugs, in NFSv4.1 callbacks, NFSv4.1
  lock-notification callbacks, NFSv3 readdir encoding, and the
  cache/upcall code"

* tag 'nfsd-5.1-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: wake blocked file lock waiters before sending callback
  nfsd: wake waiters blocked on file_lock before deleting it
  nfsd: Don't release the callback slot unless it was actually held
  nfsd/nfsd3_proc_readdir: fix buffer count and page pointers
  sunrpc: don't mark uninitialised items as VALID.
2019-04-23 13:40:55 -07:00
Gustavo A. R. Silva
5bec1fb866 Bluetooth: Use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-04-23 19:31:05 +02:00
Jaganath Kanakkassery
cd9151b618 Bluetooth: Fix incorrect pointer arithmatic in ext_adv_report_evt
In ext_adv_report_event rssi comes before data (not after data as
in legacy adv_report_evt) so "+ 1" is not required in the ptr arithmatic
to point to next report.

Signed-off-by: Jaganath Kanakkassery <jaganath.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-04-23 19:23:20 +02:00
Dan Carpenter
f57c4bbf34 6lowpan: Off by one handling ->nexthdr
NEXTHDR_MAX is 255.  What happens here is that we take a u8 value
"hdr->nexthdr" from the network and then look it up in
lowpan_nexthdr_nhcs[].  The problem is that if hdr->nexthdr is 0xff then
we read one element beyond the end of the array so the array needs to
be one element larger.

Fixes: 92aa7c65d2 ("6lowpan: add generic nhc layer interface")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Acked-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-04-23 19:09:58 +02:00
Young Xiao
a1616a5ac9 Bluetooth: hidp: fix buffer overflow
Struct ca is copied from userspace. It is not checked whether the "name"
field is NULL terminated, which allows local users to obtain potentially
sensitive information from kernel stack memory, via a HIDPCONNADD command.

This vulnerability is similar to CVE-2011-1079.

Signed-off-by: Young Xiao <YangX92@hotmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
2019-04-23 19:04:38 +02:00
Stanislav Fomichev
02ee065836 bpf/flow_dissector: don't adjust nhoff by ETH_HLEN in BPF_PROG_TEST_RUN
Now that we use skb-less flow dissector let's return true nhoff and
thoff. We used to adjust them by ETH_HLEN because that's how it was
done in the skb case. For VLAN tests that looks confusing: nhoff is
pointing to vlan parts :-\

Warning, this is an API change for BPF_PROG_TEST_RUN! Feel free to drop
if you think that it's too late at this point to fix it.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:35 +02:00
Stanislav Fomichev
c43f1255b8 net: pass net_device argument to the eth_get_headlen
Update all users of eth_get_headlen to pass network device, fetch
network namespace from it and pass it down to the flow dissector.
This commit is a noop until administrator inserts BPF flow dissector
program.

Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev
9b52e3f267 flow_dissector: handle no-skb use case
When called without skb, gather all required data from the
__skb_flow_dissect's arguments and use recently introduces
no-skb mode of bpf flow dissector.

Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev
3cbf4ffba5 net: plumb network namespace into __skb_flow_dissect
This new argument will be used in the next patches for the
eth_get_headlen use case. eth_get_headlen calls flow dissector
with only data (without skb) so there is currently no way to
pull attached BPF flow dissector program. With this new argument,
we can amend the callers to explicitly pass network namespace
so we can use attached BPF program.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev
7b8a130432 bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
Now that we have bpf_flow_dissect which can work on raw data,
use it when doing BPF_PROG_TEST_RUN for flow dissector.

Simplifies bpf_prog_test_run_flow_dissector and allows us to
test no-skb mode.

Note, that previously, with bpf_flow_dissect_skb we used to call
eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called
skb_reset_network_header. That means flow_keys->nhoff would be
initialized to 0 (skb_network_offset) in init_flow_keys.
Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need
to undo it once the dissection is done to preserve the existing behavior.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev
089b19a920 flow_dissector: switch kernel context to struct bpf_flow_dissector
struct bpf_flow_dissector has a small subset of sk_buff fields that
flow dissector BPF program is allowed to access and an optional
pointer to real skb. Real skb is used only in bpf_skb_load_bytes
helper to read non-linear data.

The real motivation for this is to be able to call flow dissector
from eth_get_headlen context where we don't have an skb and need
to dissect raw bytes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:33 +02:00
Fugang Duan
bbb69b37be Bluetooth: Add return check for L2CAP security level set
Add return check for security level set for socket interface since
stack will check the return value.

Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-04-23 18:09:07 +02:00
Luiz Augusto von Dentz
ba8f5289f7 Bluetooth: Fix not initializing L2CAP tx_credits
l2cap_le_flowctl_init was reseting the tx_credits which works only for
outgoing connection since that set the tx_credits on the response, for
incoming connections that was not the case which leaves the channel
without any credits causing it to be suspended.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.20+
2019-04-23 18:09:07 +02:00
Johannes Berg
5178791474 mac80211: don't attempt to rename ERR_PTR() debugfs dirs
We need to dereference the directory to get its parent to
be able to rename it, so it's clearly not safe to try to
do this with ERR_PTR() pointers. Skip in this case.

It seems that this is most likely what was causing the
report by syzbot, but I'm not entirely sure as it didn't
come with a reproducer this time.

Cc: stable@vger.kernel.org
Reported-by: syzbot+4ece1a28b8f4730547c9@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-23 13:47:05 +02:00
Sriram R
8772eed9a9 cfg80211: Notify previous user request during self managed wiphy registration
Commit c82c06ce43d3("cfg80211: Notify all User Hints To self managed wiphys")
notified all new user hints to self managed wiphy's after device registration.
But it didn't do this for anything other than cell base hints done before
registration.

This needs to be done during wiphy registration of a self managed device also,
so that the previous user settings are retained.

Fixes: c82c06ce43 ("cfg80211: Notify all User Hints To self managed wiphys")
Signed-off-by: Sriram R <srirrama@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-23 13:45:30 +02:00
Bhagavathi Perumal S
f1267cf3c0 mac80211: Fix kernel panic due to use of txq after free
The txq of vif is added to active_txqs list for ATF TXQ scheduling
in the function ieee80211_queue_skb(), but it was not properly removed
before freeing the txq object. It was causing use after free of the txq
objects from the active_txqs list, result was kernel panic
due to invalid memory access.

Fix kernel invalid memory access by properly removing txq object
from active_txqs list before free the object.

Signed-off-by: Bhagavathi Perumal S <bperumal@codeaurora.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-23 13:45:30 +02:00
Florian Westphal
bb9cd077e2 xfrm: remove unneeded export_symbols
None of them have any external callers, make them static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Florian Westphal
c53ac41e37 xfrm: remove decode_session indirection from afinfo_policy
No external dependencies, might as well handle this directly.
xfrm_afinfo_policy is now 40 bytes on x86_64.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Florian Westphal
2e8b4aa816 xfrm: remove init_path indirection from afinfo_policy
handle this directly, its only used by ipv6.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Florian Westphal
f24ea52873 xfrm: remove tos indirection from afinfo_policy
Only used by ipv4, we can read the fl4 tos value directly instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Ido Schimmel
697cd36cda bridge: Fix possible use-after-free when deleting bridge port
When a bridge port is being deleted, do not dereference it later in
br_vlan_port_event() as it can result in a use-after-free [1] if the RCU
callback was executed before invoking the function.

[1]
[  129.638551] ==================================================================
[  129.646904] BUG: KASAN: use-after-free in br_vlan_port_event+0x53c/0x5fd
[  129.654406] Read of size 8 at addr ffff8881e4aa1ae8 by task ip/483
[  129.663008] CPU: 0 PID: 483 Comm: ip Not tainted 5.1.0-rc5-custom-02265-ga946bd73daac #1383
[  129.672359] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[  129.682484] Call Trace:
[  129.685242]  dump_stack+0xa9/0x10e
[  129.689068]  print_address_description.cold.2+0x9/0x25e
[  129.694930]  kasan_report.cold.3+0x78/0x9d
[  129.704420]  br_vlan_port_event+0x53c/0x5fd
[  129.728300]  br_device_event+0x2c7/0x7a0
[  129.741505]  notifier_call_chain+0xb5/0x1c0
[  129.746202]  rollback_registered_many+0x895/0xe90
[  129.793119]  unregister_netdevice_many+0x48/0x210
[  129.803384]  rtnl_delete_link+0xe1/0x140
[  129.815906]  rtnl_dellink+0x2a3/0x820
[  129.844166]  rtnetlink_rcv_msg+0x397/0x910
[  129.868517]  netlink_rcv_skb+0x137/0x3a0
[  129.882013]  netlink_unicast+0x49b/0x660
[  129.900019]  netlink_sendmsg+0x755/0xc90
[  129.915758]  ___sys_sendmsg+0x761/0x8e0
[  129.966315]  __sys_sendmsg+0xf0/0x1c0
[  129.988918]  do_syscall_64+0xa4/0x470
[  129.993032]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  129.998696] RIP: 0033:0x7ff578104b58
...
[  130.073811] Allocated by task 479:
[  130.077633]  __kasan_kmalloc.constprop.5+0xc1/0xd0
[  130.083008]  kmem_cache_alloc_trace+0x152/0x320
[  130.088090]  br_add_if+0x39c/0x1580
[  130.092005]  do_set_master+0x1aa/0x210
[  130.096211]  do_setlink+0x985/0x3100
[  130.100224]  __rtnl_newlink+0xc52/0x1380
[  130.104625]  rtnl_newlink+0x6b/0xa0
[  130.108541]  rtnetlink_rcv_msg+0x397/0x910
[  130.113136]  netlink_rcv_skb+0x137/0x3a0
[  130.117538]  netlink_unicast+0x49b/0x660
[  130.121939]  netlink_sendmsg+0x755/0xc90
[  130.126340]  ___sys_sendmsg+0x761/0x8e0
[  130.130645]  __sys_sendmsg+0xf0/0x1c0
[  130.134753]  do_syscall_64+0xa4/0x470
[  130.138864]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[  130.146195] Freed by task 0:
[  130.149421]  __kasan_slab_free+0x125/0x170
[  130.154016]  kfree+0xf3/0x310
[  130.157349]  kobject_put+0x1a8/0x4c0
[  130.161363]  rcu_core+0x859/0x19b0
[  130.165175]  __do_softirq+0x250/0xa26
[  130.170956] The buggy address belongs to the object at ffff8881e4aa1ae8
                which belongs to the cache kmalloc-1k of size 1024
[  130.184972] The buggy address is located 0 bytes inside of
                1024-byte region [ffff8881e4aa1ae8, ffff8881e4aa1ee8)

Fixes: 9c0ec2e718 ("bridge: support binding vlan dev link state to vlan member bridge ports")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Cc: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Mike Manning <mmanning@vyatta.att-mail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:17:47 -07:00
Ido Schimmel
f2ad1a522e net: devlink: Add extack to shared buffer operations
Add extack to shared buffer set operations, so that meaningful error
messages could be propagated to the user.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Paul Gortmaker
15253b4a71 net: strparser: make it explicitly non-modular
The Kconfig currently controlling compilation of this code is:

net/strparser/Kconfig:config STREAM_PARSER
net/strparser/Kconfig:  def_bool n

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.  For
clarity, we change the fcn name mod_init to dev_init at the same time.

We replace module.h with init.h and export.h ; the latter since this
file exports some syms.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:54 -07:00
Paul Gortmaker
3557b3fdee net: bpfilter: dont use module_init in non-modular code
The Kconfig controlling this code is:

bpfilter/Kconfig:menuconfig BPFILTER
bpfilter/Kconfig:   bool "BPF based packet filtering framework (BPFILTER)"

Since it isn't a module, we shouldn't use module_init().  Instead we
use device_initcall() - which is exactly what module_init() defaults
to for non-modular code/builds.

We don't remove <linux/module.h> from the includes since this file does
a request_module() and hence is a valid user of that header file, even
though it is not modular itself.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:54 -07:00
Paul Gortmaker
9628495d50 cgroup: net: remove left over MODULE_LICENSE tag
The Kconfig currently controlling compilation of this code is:

net/Kconfig:config CGROUP_NET_PRIO
net/Kconfig:    bool "Network priority cgroup"

...meaning that it currently is not being built as a module by anyone,
as module support was discontinued in 2014.

We delete the MODULE_LICENSE tag since all that information is already
contained at the top of the file in the comments.

We don't delete module.h from the includes since it was no longer there
to begin with.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Rosen, Rami" <rami.rosen@intel.com>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:54 -07:00
David Ahern
3c618c1dbb net: Rename net/nexthop.h net/rtnh.h
The header contains rtnh_ macros so rename the file accordingly.
Allows a later patch to use the nexthop.h name for the new
nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:47:25 -07:00
David S. Miller
2843ba2ec7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-04-22

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) allow stack/queue helpers from more bpf program types, from Alban.

2) allow parallel verification of root bpf programs, from Alexei.

3) introduce bpf sysctl hook for trusted root cases, from Andrey.

4) recognize var/datasec in btf deduplication, from Andrii.

5) cpumap performance optimizations, from Jesper.

6) verifier prep for alu32 optimization, from Jiong.

7) libbpf xsk cleanup, from Magnus.

8) other various fixes and cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:35:55 -07:00
David S. Miller
acced9d2b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree:

1) Add a selftest for icmp packet too big errors with conntrack, from
   Florian Westphal.

2) Validate inner header in ICMP error message does not lie to us
   in conntrack, also from Florian.

3) Initialize ct->timeout to calm down KASAN, from Alexander Potapenko.

4) Skip ICMP error messages from tunnels in IPVS, from Julian Anastasov.

5) Use a hash to expose conntrack and expectation ID, from Florian Westphal.

6) Prevent shift wrap in nft_chain_parse_hook(), from Dan Carpenter.

7) Fix broken ICMP ID randomization with NAT, also from Florian.

8) Remove WARN_ON in ebtables compat that is reached via syzkaller,
   from Florian Westphal.

9) Fix broken timestamps since fb420d5d91 ("tcp/fq: move back to
   CLOCK_MONOTONIC"), from Florian.

10) Fix logging of invalid packets in conntrack, from Andrei Vagin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:23:55 -07:00
Andrei Vagin
d48668052b netfilter: fix nf_l4proto_log_invalid to log invalid packets
It doesn't log a packet if sysctl_log_invalid isn't equal to protonum
OR sysctl_log_invalid isn't equal to IPPROTO_RAW. This sentence is
always true. I believe we need to replace OR to AND.

Cc: Florian Westphal <fw@strlen.de>
Fixes: c4f3db1595 ("netfilter: conntrack: add and use nf_l4proto_log_invalid")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-22 10:38:50 +02:00
Florian Westphal
916f6efae6 netfilter: never get/set skb->tstamp
setting net.netfilter.nf_conntrack_timestamp=1 breaks xmit with fq
scheduler.  skb->tstamp might be "refreshed" using ktime_get_real(),
but fq expects CLOCK_MONOTONIC.

This patch removes all places in netfilter that check/set skb->tstamp:

1. To fix the bogus "start" time seen with conntrack timestamping for
   outgoing packets, never use skb->tstamp and always use current time.
2. In nfqueue and nflog, only use skb->tstamp for incoming packets,
   as determined by current hook (prerouting, input, forward).
3. xt_time has to use system clock as well rather than skb->tstamp.
   We could still use skb->tstamp for prerouting/input/foward, but
   I see no advantage to make this conditional.

Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-22 10:34:30 +02:00
Florian Westphal
7caa56f006 netfilter: ebtables: CONFIG_COMPAT: drop a bogus WARN_ON
It means userspace gave us a ruleset where there is some other
data after the ebtables target but before the beginning of the next rule.

Fixes: 81e675c227 ("netfilter: ebtables: add CONFIG_COMPAT support")
Reported-by: syzbot+659574e7bcc7f7eb4df7@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-22 10:34:30 +02:00
Fuqian Huang
966cddef20 net: ax25: fix misuse of %x
Pointers should be printed with %p or %px rather than
cast to long type and printed with %8.8lx.
Change %8.8lx to %p to print the pointer.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-21 10:37:26 -07:00
Jakub Kicinski
12c7686111 net/tls: don't leak IV and record seq when offload fails
When device refuses the offload in tls_set_device_offload_rx()
it calls tls_sw_free_resources_rx() to clean up software context
state.

Unfortunately, tls_sw_free_resources_rx() does not free all
the state tls_set_sw_offload() allocated - it leaks IV and
sequence number buffers.  All other code paths which lead to
tls_sw_release_resources_rx() (which tls_sw_free_resources_rx()
calls) free those right before the call.

Avoid the leak by moving freeing of iv and rec_seq into
tls_sw_release_resources_rx().

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-20 20:36:51 -07:00
Jakub Kicinski
62ef81d563 net/tls: avoid potential deadlock in tls_set_device_offload_rx()
If device supports offload, but offload fails tls_set_device_offload_rx()
will call tls_sw_free_resources_rx() which (unhelpfully) releases
and reacquires the socket lock.

For a small fix release and reacquire the device_offload_lock.

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-20 20:35:28 -07:00
Linus Torvalds
9e5de623a0 NFS client bugfixes for Linux 5.1
Highlights include:
 
 Bugfixes:
 - Fix a regression in which an RPC call can be tagged with an error despite
   the transmission being successful
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcu3IgAAoJEA4mA3inWBJcYEAP/jb/O5l2ANmUyMaxew6JMTBH
 ULxTyqM8R6B0eifWpBzFjoQjEhywyY6Y3WtY5SnPszHEAR9xZdkLtfTQRySNKOAG
 l19VbhIlSKFbBOvXAOXkcdaaIsk3XPl66Drk9SD3DMa5RYdBcF0Kc7bUfF1N4dS7
 SS9T97AelHSpMJLpy7CxMMccIXj/Z/65nHNtKlbG7IRqKBdcM6mJeiut+6R/CUv9
 NTq+f+ZgnJYsra1oIWbnn5CvQc0c0owl//NIqhos5GFwOAdfdKxNcEuDMcy7QM06
 QW+T51omJgQuLwP3EOsOge6apo8Y8FlmxeCCraLGRDKHUlvTJzqtBpO35bJSsJSE
 6qjpS8xu2luFmIDGumQib/xVYzVJJjmOJCMUYM6x+XV0TUhJ2l7WASmt7M+1OJCT
 j7MJqLBdl2rCM+iNa2tDWLHUElsLOqBCqyFHvngR45yYzyCfUPUM3J+BMij/3yGA
 7dO4R2Ii346Iv8iafxr2FfOIB24OxkkDb+I/cVI7GmzbyCTixKaCjrDcJVlsMJCf
 rGfZXrep2zrP6AmompOCxUeU3RzJZ4RJRdJ3gTm3aFmqxGlPA5U1MR7MIkRSN4MB
 HGShI3jW/Ph9ZoYj7nZ+afyu0V8WJDDtdyqDs/k7HCKeFv2/TL/XxMjlas4n0puB
 yhvwt1iW0ZO5f5RlJ6vy
 =3T3x
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.1-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfix from Trond Myklebust:
 "Fix a regression in which an RPC call can be tagged with an error
  despite the transmission being successful"

* tag 'nfs-for-5.1-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: Ignore queue transmission errors on successful transmission
2019-04-20 12:55:23 -07:00
Eric Dumazet
d7cc399e12 tcp: properly reset skb->truesize for tx recycling
tcp sendmsg() and sendpage() normally advance skb->data_len
and skb->truesize by the payload added to an skb.

But sendmsg(fd, ..., MSG_ZEROCOPY) has to account for whole pages,
even if a single byte of payload is used in the page.

This means that we can not assume skb->truesize can be adjusted
by skb->data_len. We must instead overwrite its value.

Otherwise skb->truesize is too big and can hit socket sndbuf limit,
especially if the skb is recycled multiple times :/

Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 16:42:33 -07:00
Tung Nguyen
42e5425aa0 tipc: introduce new socket option TIPC_SOCK_RECVQ_USED
When using TIPC_SOCK_RECVQ_DEPTH for getsockopt(), it returns the
number of buffers in receive socket buffer which is not so helpful
for user space applications.

This commit introduces the new option TIPC_SOCK_RECVQ_USED which
returns the current allocated bytes of the receive socket buffer.
This helps user space applications dimension its buffer usage to
avoid buffer overload issue.

Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 14:59:05 -07:00
Arnd Bergmann
0768e17073 net: socket: implement 64-bit timestamps
The 'timeval' and 'timespec' data structures used for socket timestamps
are going to be redefined in user space based on 64-bit time_t in future
versions of the C library to deal with the y2038 overflow problem,
which breaks the ABI definition.

Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not
use the _IOR() macro to encode the size of the transferred data, so it
remains ambiguous whether the application uses the old or new layout.

The best workaround I could find is rather ugly: we redefine the command
code based on the size of the respective data structure with a ternary
operator. This lets it get evaluated as late as possible, hopefully after
that structure is visible to the caller. We cannot use an #ifdef here,
because inux/sockios.h might have been included before any libc header
that could determine the size of time_t.

The ioctl implementation now interprets the new command codes as always
referring to the 64-bit structure on all architectures, while the old
architecture specific command code still refers to the old architecture
specific layout. The new command number is only used when they are
actually different.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 14:07:40 -07:00
Arnd Bergmann
c7cbdbf29f net: rework SIOCGSTAMP ioctl handling
The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
socket protocol handlers, and all of those end up calling the same
sock_get_timestamp()/sock_get_timestampns() helper functions, which
results in a lot of duplicate code.

With the introduction of 64-bit time_t on 32-bit architectures, this
gets worse, as we then need four different ioctl commands in each
socket protocol implementation.

To simplify that, let's add a new .gettstamp() operation in
struct proto_ops, and move ioctl implementation into the common
sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
through.

We can reuse the sock_get_timestamp() implementation, but generalize
it so it can deal with both native and compat mode, as well as
timeval and timespec structures.

Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 14:07:40 -07:00
Mike Manning
8e1acd4fc5 bridge: update vlan dev link state for bridge netdev changes
If vlan bridge binding is enabled, then the link state of a vlan device
that is an upper device of the bridge tracks the state of bridge ports
that are members of that vlan. But this can only be done when the link
state of the bridge is up. If it is down, then the link state of the
vlan devices must also be down. This is to maintain existing behavior
for when STP is enabled and there are no live ports, in which case the
link state for the bridge and any vlan devices is down.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 13:58:17 -07:00
Mike Manning
80900acd3a bridge: update vlan dev state when port added to or deleted from vlan
If vlan bridge binding is enabled, then the link state of a vlan device
that is an upper device of the bridge should track the state of bridge
ports that are members of that vlan. So if a bridge port becomes or
stops being a member of a vlan, then update the link state of the
vlan device if necessary.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 13:58:17 -07:00
Mike Manning
9c0ec2e718 bridge: support binding vlan dev link state to vlan member bridge ports
In the case of vlan filtering on bridges, the bridge may also have the
corresponding vlan devices as upper devices. A vlan bridge binding mode
is added to allow the link state of the vlan device to track only the
state of the subset of bridge ports that are also members of the vlan,
rather than that of all bridge ports. This mode is set with a vlan flag
rather than a bridge sysfs so that the 8021q module is aware that it
should not set the link state for the vlan device.

If bridge vlan is configured, the bridge device event handling results
in the link state for an upper device being set, if it is a vlan device
with the vlan bridge binding mode enabled. This also sets a
vlan_bridge_binding flag so that subsequent UP/DOWN/CHANGE events for
the ports in that bridge result in a link state update of the vlan
device if required.

The link state of the vlan device is up if there is at least one bridge
port that is a vlan member that is admin & oper up, otherwise its oper
state is IF_OPER_LOWERLAYERDOWN.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 13:58:17 -07:00
Mike Manning
76052d8c4f vlan: do not transfer link state in vlan bridge binding mode
In vlan bridge binding mode, the link state is no longer transferred
from the lower device. Instead it is set by the bridge module according
to the state of bridge ports that are members of the vlan.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 13:58:17 -07:00
Mike Manning
8c8b3458d0 vlan: support binding link state to vlan member bridge ports
In the case of vlan filtering on bridges, the bridge may also have the
corresponding vlan devices as upper devices. Currently the link state
of vlan devices is transferred from the lower device. So this is up if
the bridge is in admin up state and there is at least one bridge port
that is up, regardless of the vlan that the port is a member of.

The link state of the vlan device may need to track only the state of
the subset of ports that are also members of the corresponding vlan,
rather than that of all ports.

Add a flag to specify a vlan bridge binding mode, by which the link
state is no longer automatically transferred from the lower device,
but is instead determined by the bridge ports that are members of the
vlan.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 13:58:17 -07:00
Jakub Kicinski
23bddf692d net/sched: taprio: fix build without 64bit div
Recent changes to taprio did not use the correct div64 helpers,
leading to:

net/sched/sch_taprio.o: In function `taprio_dequeue':
sch_taprio.c:(.text+0x34a): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `advance_sched':
sch_taprio.c:(.text+0xa0b): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `taprio_init':
sch_taprio.c:(.text+0x1450): undefined reference to `__divdi3'
/home/jkicinski/devel/linux/Makefile:1032: recipe for target 'vmlinux' failed

Use math64 helpers.

Fixes: 7b9eba7ba0 ("net/sched: taprio: fix picos_per_byte miscalculation")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18 17:06:15 -07:00
Jakub Kicinski
503c018801 l2tp: fix set but not used variable
GCC complains:

net/l2tp/l2tp_ppp.c: In function ‘pppol2tp_ioctl’:
net/l2tp/l2tp_ppp.c:1073:6: warning: variable ‘val’ set but not used [-Wunused-but-set-variable]
  int val;
      ^~~

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18 17:06:15 -07:00
Stephen Suryaputra
0bc1998544 ipv6: Add rate limit mask for ICMPv6 messages
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.

There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().

Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.

v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
    isn't defined, expand the description in ip-sysctl.txt and remove
    unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
    pointer to it is needed because of the way proc_do_large_bitmap work.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18 16:58:37 -07:00
Jakub Kicinski
9188d5ca45 net/tls: fix refcount adjustment in fallback
Unlike atomic_add(), refcount_add() does not deal well
with a negative argument.  TLS fallback code reallocates
the skb and is very likely to shrink the truesize, leading to:

[  189.513254] WARNING: CPU: 5 PID: 0 at lib/refcount.c:81 refcount_add_not_zero_checked+0x15c/0x180
 Call Trace:
  refcount_add_checked+0x6/0x40
  tls_enc_skb+0xb93/0x13e0 [tls]

Once wmem_allocated count saturates the application can no longer
send data on the socket.  This is similar to Eric's fixes for GSO,
TCP:
commit 7ec318feee ("tcp: gso: avoid refcount_t warning from tcp_gso_segment()")
and UDP:
commit 575b65bc5b ("udp: avoid refcount_t saturation in __udp_gso_segment()").

Unlike the GSO case, for TLS fallback it's likely that the skb has
shrunk, so the "likely" annotation is the other way around (likely
branch being "sub").

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18 16:51:03 -07:00
Colin Ian King
d5f6db3538 net: ipv6: addrlabel: fix spelling mistake "requewst" -> "request"
There is a spelling mistake in a NL_SET_ERR_MSG_MOD error message,
fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18 10:44:17 -07:00
Ingo Molnar
94e4dcc75a Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM commits from Paul E. McKenney:

 - An LKMM commit adding support for synchronize_srcu_expedited()
 - A couple of straggling RCU flavor consolidation updates
 - Documentation updates.
 - Miscellaneous fixes
 - SRCU updates
 - RCU CPU stall-warning updates
 - Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18 14:42:24 +02:00
David Ahern
b8fb1ab461 net ipv6: Prevent neighbor add if protocol is disabled on device
Disabling IPv6 on an interface removes existing entries but nothing prevents
new entries from being manually added. To that end, add a new neigh_table
operation, allow_add, that is called on RTM_NEWNEIGH to see if neighbor
entries are allowed on a given device. If IPv6 is disabled on the device,
allow_add returns false and passes a message back to the user via extack.

  $ echo 1 > /proc/sys/net/ipv6/conf/eth1/disable_ipv6
  $ ip -6 neigh add fe80::4c88:bff:fe21:2704 dev eth1 lladdr de:ad:be:ef:01:01
  Error: IPv6 is disabled on this device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:19:07 -07:00
David Ahern
7d21fec904 ipv6: Add fib6_type and fib6_flags to fib6_result
Add the fib6_flags and fib6_type to fib6_result. Update the lookup helpers
to set them and update post fib lookup users to use the version from the
result.

This allows nexthop objects to have blackhole nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:11:30 -07:00
David Ahern
effda4dd97 ipv6: Pass fib6_result to fib lookups
Change fib6_lookup and fib6_table_lookup to take a fib6_result and set
f6i and nh rather than returning a fib6_info. For now both always
return 0.

A later patch set can make these more like the IPv4 counterparts and
return EINVAL, EACCESS, etc based on fib6_type.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:47 -07:00
David Ahern
8ff2e5b26c ipv6: Pass fib6_result to fib6_table_lookup tracepoint
Change fib6_table_lookup tracepoint to take the fib6_result and use
the fib6_info and fib6_nh from it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:47 -07:00
David Ahern
b7bc4b6a62 ipv6: Pass fib6_result to rt6_select and find_rr_leaf
Pass fib6_result to rt6_select. Instead of returning the fib entry, it
will set f6i and nh based on the lookup.

find_rr_leaf is changed to remove the match option in favor of taking
fib6_result and having __find_rr_leaf set f6i in the result.

In the process, update fib6_info references in __find_rr_leaf to f6i names.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:47 -07:00
David Ahern
75ef7389dd ipv6: Pass fib6_result to rt6_device_match
Pass fib6_result to rt6_device_match with f6i set. rt6_device_match
updates f6i in the result if it finds a better match and sets nh.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:47 -07:00
David Ahern
b748f26092 ipv6: Pass fib6_result to ip6_mtu_from_fib6 and fib6_mtu
Change ip6_mtu_from_fib6 and fib6_mtu to take a fib6_result over a
fib6_info. Update both to use the fib6_nh from fib6_result.

Since the signature of ip6_mtu_from_fib6 is already changing, add const
to daddr and saddr.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:46 -07:00
David Ahern
5012f0a594 ipv6: Pass fib6_result to rt6_insert_exception
Update rt6_insert_exception to take a fib6_result over a fib6_info.
Change ort to f6i from the fib6_result and rename to better reflect
what it references (a fib6_info).

Since this function is already getting changed, update the comments
to reference fib6_info variables rather than the older rt6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:46 -07:00
David Ahern
0d16158149 ipv6: Pass fib6_result to ip6_rt_get_dev_rcu and ip6_rt_copy_init
Now that all callers are update to have a fib6_result, pass it down
to ip6_rt_get_dev_rcu, ip6_rt_copy_init, and ip6_rt_init_dst.

In the process, change ort to f6i in ip6_rt_copy_init to make it
clear it is a reference to a fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:46 -07:00
David Ahern
db3fedee0c ipv6: Pass fib6_result to pcpu route functions
Update ip6_rt_pcpu_alloc, rt6_get_pcpu_route and rt6_make_pcpu_route
to a fib6_result over a fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:46 -07:00
David Ahern
9b6b35abfb ipv6: Pass fib6_result to ip6_create_rt_rcu
Change ip6_create_rt_rcu to take fib6_result over a fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:46 -07:00
David Ahern
85bd05deb3 ipv6: Pass fib6_result to ip6_rt_cache_alloc
Change ip6_rt_cache_alloc to take a fib6_result over a fib6_info.

Since ip6_rt_cache_alloc is only the caller, update the
rt6_is_gw_or_nonexthop helper to take fib6_result.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:46 -07:00
David Ahern
7e4b512875 ipv6: Pass fib6_result to rt6_find_cached_rt
Simplify rt6_find_cached_rt for the fast path cases and pass fib6_result
to rt6_find_cached_rt. Rename the local return variable to ret to maintain
consisting with fib6_result name.

Update the comment in rt6_find_cached_rt to reference the new names in
a fib6_info vs the old name when fib entries were an rt6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:08:51 -07:00
David Ahern
b1d4099150 ipv6: Rename fib6_multipath_select and pass fib6_result
Add 'struct fib6_result' to hold the fib entry and fib6_nh from a fib
lookup as separate entries, similar to what IPv4 now has with fib_result.

Rename fib6_multipath_select to fib6_select_path, pass fib6_result to
it, and set f6i and nh in the result once a path selection is done.
Call fib6_select_path unconditionally for path selection which means
moving the sibling and oif check to fib6_select_path. To handle the two
different call paths (2 only call multipath_select if flowi6_oif == 0 and
the other always calls it), add a new have_oif_match that controls the
sibling walk if relevant.

Update callers of fib6_multipath_select accordingly and have them use the
fib6_info and fib6_nh from the result.

This is needed for multipath nexthop objects where a single f6i can
point to multiple fib6_nh (similar to IPv4).

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:08:51 -07:00
Jesper Dangaard Brouer
ba0509b688 net: core: introduce build_skb_around
The function build_skb() also have the responsibility to allocate and clear
the SKB structure. Introduce a new function build_skb_around(), that moves
the responsibility of allocation and clearing to the caller. This allows
caller to use kmem_cache (slab/slub) bulk allocation API.

Next patch use this function combined with kmem_cache_alloc_bulk.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-17 19:09:24 -07:00
ZhangXiaoxu
19fad20d15 ipv4: set the tcp_min_rtt_wlen range from 0 to one day
There is a UBSAN report as below:
UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56
signed integer overflow:
2147483647 * 1000 cannot be represented in type 'int'
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e #1
Call Trace:
 <IRQ>
 dump_stack+0x8c/0xba
 ubsan_epilogue+0x11/0x60
 handle_overflow+0x12d/0x170
 ? ttwu_do_wakeup+0x21/0x320
 __ubsan_handle_mul_overflow+0x12/0x20
 tcp_ack_update_rtt+0x76c/0x780
 tcp_clean_rtx_queue+0x499/0x14d0
 tcp_ack+0x69e/0x1240
 ? __wake_up_sync_key+0x2c/0x50
 ? update_group_capacity+0x50/0x680
 tcp_rcv_established+0x4e2/0xe10
 tcp_v4_do_rcv+0x22b/0x420
 tcp_v4_rcv+0xfe8/0x1190
 ip_protocol_deliver_rcu+0x36/0x180
 ip_local_deliver+0x15b/0x1a0
 ip_rcv+0xac/0xd0
 __netif_receive_skb_one_core+0x7f/0xb0
 __netif_receive_skb+0x33/0xc0
 netif_receive_skb_internal+0x84/0x1c0
 napi_gro_receive+0x2a0/0x300
 receive_buf+0x3d4/0x2350
 ? detach_buf_split+0x159/0x390
 virtnet_poll+0x198/0x840
 ? reweight_entity+0x243/0x4b0
 net_rx_action+0x25c/0x770
 __do_softirq+0x19b/0x66d
 irq_exit+0x1eb/0x230
 do_IRQ+0x7a/0x150
 common_interrupt+0xf/0xf
 </IRQ>

It can be reproduced by:
  echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen

Fixes: f672258391 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 13:57:11 -07:00
Trond Myklebust
a7b1a4839f SUNRPC: Ignore queue transmission errors on successful transmission
If a request transmission fails due to write space or slot unavailability
errors, but the queued task then gets transmitted before it has time to
process the error in call_transmit_status() or call_bc_transmit_status(),
we need to suppress the transmission error code to prevent it from leaking
out of the RPC layer.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2019-04-17 16:07:28 -04:00
David S. Miller
6b0a7f84ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict resolution of af_smc.c from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 11:26:25 -07:00
Linus Torvalds
2a3a028fc6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle init flow failures properly in iwlwifi driver, from Shahar S
    Matityahu.

 2) mac80211 TXQs need to be unscheduled on powersave start, from Felix
    Fietkau.

 3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau.

 4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed.

 5) Avoid checksum complete with XDP in mlx5, also from Saeed.

 6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon.

 7) Partial sent TLS record leak fix from Jakub Kicinski.

 8) Reject zero size iova range in vhost, from Jason Wang.

 9) Allow pending work to complete before clcsock release from Karsten
    Graul.

10) Fix XDP handling max MTU in thunderx, from Matteo Croce.

11) A lot of protocols look at the sa_family field of a sockaddr before
    validating it's length is large enough, from Tetsuo Handa.

12) Don't write to free'd pointer in qede ptp error path, from Colin Ian
    King.

13) Have to recompile IP options in ipv4_link_failure because it can be
    invoked from ARP, from Stephen Suryaputra.

14) Doorbell handling fixes in qed from Denis Bolotin.

15) Revert net-sysfs kobject register leak fix, it causes new problems.
    From Wang Hai.

16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva.

17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay
    Aleksandrov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits)
  socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW
  tcp: tcp_grow_window() needs to respect tcp_space()
  ocelot: Clean up stats update deferred work
  ocelot: Don't sleep in atomic context (irqs_disabled())
  net: bridge: fix netlink export of vlan_stats_per_port option
  qed: fix spelling mistake "faspath" -> "fastpath"
  tipc: set sysctl_tipc_rmem and named_timeout right range
  tipc: fix link established but not in session
  net: Fix missing meta data in skb with vlan packet
  net: atm: Fix potential Spectre v1 vulnerabilities
  net/core: work around section mismatch warning for ptp_classifier
  net: bridge: fix per-port af_packet sockets
  bnx2x: fix spelling mistake "dicline" -> "decline"
  route: Avoid crash from dereferencing NULL rt->from
  MAINTAINERS: normalize Woojung Huh's email address
  bonding: fix event handling for stacked bonds
  Revert "net-sysfs: Fix memory leak in netdev_register_kobject"
  rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check
  qed: Fix the DORQ's attentions handling
  qed: Fix missing DORQ attentions
  ...
2019-04-17 09:57:45 -07:00
Arnd Bergmann
e6986423d2 socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW
It looks like the new socket options only work correctly
for native execution, but in case of compat mode fall back
to the old behavior as we ignore the 'old_timeval' flag.

Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the
same way in compat and native 32-bit mode.

Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Fixes: a9beb86ae6 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 21:52:22 -07:00
Eric Dumazet
50ce163a72 tcp: tcp_grow_window() needs to respect tcp_space()
For some reason, tcp_grow_window() correctly tests if enough room
is present before attempting to increase tp->rcv_ssthresh,
but does not prevent it to grow past tcp_space()

This is causing hard to debug issues, like failing
the (__tcp_select_window(sk) >= tp->rcv_wnd) test
in __tcp_ack_snd_check(), causing ACK delays and possibly
slow flows.

Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
after about 60 round trips, when the active side no longer sends
immediate acks.

This bug predates git history.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 21:47:39 -07:00
Nikolay Aleksandrov
600bea7dba net: bridge: fix netlink export of vlan_stats_per_port option
Since the introduction of the vlan_stats_per_port option the netlink
export of it has been broken since I made a typo and used the ifla
attribute instead of the bridge option to retrieve its state.
Sysfs export is fine, only netlink export has been affected.

Fixes: 9163a0fc1f ("net: bridge: add support for per-port vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 21:40:29 -07:00
Jie Liu
4bcd4ec101 tipc: set sysctl_tipc_rmem and named_timeout right range
We find that sysctl_tipc_rmem and named_timeout do not have the right minimum
setting. sysctl_tipc_rmem should be larger than zero, like sysctl_tcp_rmem.
And named_timeout as a timeout setting should be not less than zero.

Fixes: cc79dd1ba9 ("tipc: change socket buffer overflow control to respect sk_rcvbuf")
Fixes: a5325ae5b8 ("tipc: add name distributor resiliency queue")
Signed-off-by: Jie Liu <liujie165@huawei.com>
Reported-by: Qiang Ning <ningqiang1@huawei.com>
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 21:32:02 -07:00
Tuong Lien
f7a937801b tipc: fix link established but not in session
According to the link FSM, when a link endpoint got RESET_MSG (- a
traditional one without the stopping bit) from its peer, it moves to
PEER_RESET state and raises a LINK_DOWN event which then resets the
link itself. Its state will become ESTABLISHING after the reset event
and the link will be re-established soon after this endpoint starts to
send ACTIVATE_MSG to the peer.

There is no problem with this mechanism, however the link resetting has
cleared the link 'in_session' flag (along with the other important link
data such as: the link 'mtu') that was correctly set up at the 1st step
(i.e. when this endpoint received the peer RESET_MSG). As a result, the
link will become ESTABLISHED, but the 'in_session' flag is not set, and
all STATE_MSG from its peer will be dropped at the link_validate_msg().
It means the link not synced and will sooner or later face a failure.

Since the link reset action is obviously needed for a new link session
(this is also true in the other situations), the problem here is that
the link is re-established a bit too early when the link endpoints are
not really in-sync yet. The commit forces a resync as already done in
the previous commit 91986ee166 ("tipc: fix link session and
re-establish issues") by simply varying the link 'peer_session' value
at the link_reset().

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 21:31:26 -07:00
Yuya Kusakabe
d85e8be2a5 net: Fix missing meta data in skb with vlan packet
skb_reorder_vlan_header() should move XDP meta data with ethernet header
if XDP meta data exists.

Fixes: de8f3a83b0 ("bpf: add meta pointer for direct access")
Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com>
Signed-off-by: Takeru Hayasaka <taketarou2@gmail.com>
Co-developed-by: Takeru Hayasaka <taketarou2@gmail.com>
Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 21:29:38 -07:00
Gustavo A. R. Silva
899537b735 net: atm: Fix potential Spectre v1 vulnerabilities
arg is controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

net/atm/lec.c:715 lec_mcast_attach() warn: potential spectre issue 'dev_lec' [r] (local cap)

Fix this by sanitizing arg before using it to index dev_lec.

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 21:01:45 -07:00
Ard Biesheuvel
ad910c7c01 net/core: work around section mismatch warning for ptp_classifier
The routine ptp_classifier_init() uses an initializer for an
automatic struct type variable which refers to an __initdata
symbol. This is perfectly legal, but may trigger a section
mismatch warning when running the compiler in -fpic mode, due
to the fact that the initializer may be emitted into an anonymous
.data section thats lack the __init annotation. So work around it
by using assignments instead.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 20:46:17 -07:00
Nikolay Aleksandrov
3b2e2904de net: bridge: fix per-port af_packet sockets
When the commit below was introduced it changed two visible things:
 - the skb was no longer passed through the protocol handlers with the
   original device
 - the skb was passed up the stack with skb->dev = bridge

The first change broke af_packet sockets on bridge ports. For example we
use them for hostapd which listens for ETH_P_PAE packets on the ports.
We discussed two possible fixes:
 - create a clone and pass it through NF_HOOK(), act on the original skb
   based on the result
 - somehow signal to the caller from the okfn() that it was called,
   meaning the skb is ok to be passed, which this patch is trying to
   implement via returning 1 from the bridge link-local okfn()

Note that we rely on the fact that NF_QUEUE/STOLEN would return 0 and
drop/error would return < 0 thus the okfn() is called only when the
return was 1, so we signal to the caller that it was called by preserving
the return value from nf_hook().

Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16 20:30:40 -07:00
Magnus Karlsson
f63666de2b xsk: fix XDP socket ring buffer memory ordering
The ring buffer code of XDP sockets is missing a memory barrier on the
consumer side between the load of the data and the write that signals
that it is ok for the producer to put new data into the buffer. On
architectures that does not guarantee that stores are not reordered
with older loads, the producer might put data into the ring before the
consumer had the chance to read it. As IA does guarantee this
ordering, it would only need a compiler barrier here, but there are no
primitives in Linux for this specific case (hinder writes to be ordered
before older reads) so I had to add a smp_mb() here which will
translate into a run-time synch operation on IA.

Added a longish comment in the code explaining what each barrier in
the ring implementation accomplishes and what would happen if we
removed one of them.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-16 20:13:10 -07:00
Viet Hoang Tran
725721a650 bpf: allow clearing all sock_ops callback flags
The helper function bpf_sock_ops_cb_flags_set() can be used to both
set and clear the sock_ops callback flags. However, its current
behavior is not consistent. BPF program may clear a flag if more than
one were set, or replace a flag with another one, but cannot clear all
flags.

This patch also updates the documentation to clarify the ability to
clear flags of this helper function.

Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-16 19:24:20 -07:00
Willem de Bruijn
43537b8e2d bpf: reserve flags in bpf_skb_net_shrink
The ENCAP flags in bpf_skb_adjust_room are ignored on decap with
bpf_skb_net_shrink. Reserve these bits for future use.

Fixes: 868d523535 ("bpf: add bpf_skb_adjust_room encap flags")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-16 09:56:20 +02:00
Murali Karicheri
ee2c46f353 net: hsr: add tx stats for master interface
Add tx stats to hsr interface. Without this
ifconfig for hsr interface doesn't show tx packet stats.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:22:02 -07:00
Murali Karicheri
3271273388 net: hsr: fix debugfs path to support multiple interfaces
Fix the path of hsr debugfs root directory to use the net device
name so that it can work with multiple interfaces. While at it,
also fix some typos.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:22:02 -07:00
Murali Karicheri
9c5f8a19b2 net: hsr: fix naming of file and functions
Fix the file name and functions to match with existing implementation.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:22:01 -07:00
Xin Long
9dde27de3e sctp: implement memory accounting on rx path
sk_forward_alloc's updating is also done on rx path, but to be consistent
we change to use sk_mem_charge() in sctp_skb_set_owner_r().

In sctp_eat_data(), it's not enough to check sctp_memory_pressure only,
which doesn't work for mem_cgroup_sockets_enabled, so we change to use
sk_under_memory_pressure().

When it's under memory pressure, sk_mem_reclaim() and sk_rmem_schedule()
should be called on both RENEGE or CHUNK DELIVERY path exit the memory
pressure status as soon as possible.

Note that sk_rmem_schedule() is using datalen to make things easy there.

Reported-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:36:51 -07:00
Xin Long
1033990ac5 sctp: implement memory accounting on tx path
Now when sending packets, sk_mem_charge() and sk_mem_uncharge() have been
used to set sk_forward_alloc. We just need to call sk_wmem_schedule() to
check if the allocated should be raised, and call sk_mem_reclaim() to
check if the allocated should be reduced when it's under memory pressure.

If sk_wmem_schedule() returns false, which means no memory is allowed to
allocate, it will block and wait for memory to become available.

Note different from tcp, sctp wait_for_buf happens before allocating any
skb, so memory accounting check is done with the whole msg_len before it
too.

Reported-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:36:51 -07:00
Jonathan Lemon
9c69a13205 route: Avoid crash from dereferencing NULL rt->from
When __ip6_rt_update_pmtu() is called, rt->from is RCU dereferenced, but is
never checked for null - rt6_flush_exceptions() may have removed the entry.

[ 1913.989004] RIP: 0010:ip6_rt_cache_alloc+0x13/0x170
[ 1914.209410] Call Trace:
[ 1914.214798]  <IRQ>
[ 1914.219226]  __ip6_rt_update_pmtu+0xb0/0x190
[ 1914.228649]  ip6_tnl_xmit+0x2c2/0x970 [ip6_tunnel]
[ 1914.239223]  ? ip6_tnl_parse_tlv_enc_lim+0x32/0x1a0 [ip6_tunnel]
[ 1914.252489]  ? __gre6_xmit+0x148/0x530 [ip6_gre]
[ 1914.262678]  ip6gre_tunnel_xmit+0x17e/0x3c7 [ip6_gre]
[ 1914.273831]  dev_hard_start_xmit+0x8d/0x1f0
[ 1914.283061]  sch_direct_xmit+0xfa/0x230
[ 1914.291521]  __qdisc_run+0x154/0x4b0
[ 1914.299407]  net_tx_action+0x10e/0x1f0
[ 1914.307678]  __do_softirq+0xca/0x297
[ 1914.315567]  irq_exit+0x96/0xa0
[ 1914.322494]  smp_apic_timer_interrupt+0x68/0x130
[ 1914.332683]  apic_timer_interrupt+0xf/0x20
[ 1914.341721]  </IRQ>

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:31:59 -07:00
Wang Hai
8ed633b9ba Revert "net-sysfs: Fix memory leak in netdev_register_kobject"
This reverts commit 6b70fc94af.

The reverted bugfix will cause another issue.
Reported by syzbot+6024817a931b2830bc93@syzkaller.appspotmail.com.
See https://syzkaller.appspot.com/x/log.txt?x=1737671b200000 for
details.

Signed-off-by: Wang Hai <wanghai26@huawei.com>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:10:27 -07:00
David S. Miller
95337b9821 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Remove the broute pseudo hook, implement this from the bridge
   prerouting hook instead. Now broute becomes real table in ebtables,
   from Florian Westphal. This also includes a size reduction patch for the
   bridge control buffer area via squashing boolean into bitfields and
   a selftest.

2) Add OS passive fingerprint version matching, from Fernando Fernandez.

3) Support for gue encapsulation for IPVS, from Jacky Hu.

4) Add support for NAT to the inet family, from Florian Westphal.
   This includes support for masquerade, redirect and nat extensions.

5) Skip interface lookup in flowtable, use device in the dst object.

6) Add jiffies64_to_msecs() and use it, from Li RongQing.

7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King.

8) Statify several functions, patches from YueHaibing and Florian Westphal.

9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing.

10) Merge route extension to core, also from Florian.

11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian.

12) Merge ip/ip6 masquerade extensions, from Florian. This includes
    netdevice notifier unification.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 12:07:35 -07:00
Stephen Rothwell
dc2f4189dc bridge: only include nf_queue.h if needed
After merging the netfilter-next tree, today's linux-next build (powerpc
ppc44x_defconfig) failed like this:

In file included from net/bridge/br_input.c:19:
include/net/netfilter/nf_queue.h:16:23: error: field 'state' has incomplete type
  struct nf_hook_state state;
                       ^~~~~

Fixes: 971502d77f ("bridge: netfilter: unroll NF_HOOK helper in bridge input path")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 18:47:36 +02:00
Florian Westphal
e54d152765 xfrm: kconfig: make xfrm depend on inet
when CONFIG_INET is not enabled:
net/xfrm/xfrm_output.c: In function ‘xfrm4_tunnel_encap_add’:
net/xfrm/xfrm_output.c:234:2: error: implicit declaration of function ‘ip_select_ident’ [-Werror=implicit-function-declaration]
ip_select_ident(dev_net(dst->dev), skb, NULL);

XFRM only supports ipv4 and ipv6 so change dependency to INET and place
user-visible options (pfkey sockets, migrate support and the like)
under 'if INET' guard as well.

Fixes: 1de7083006 ("xfrm: remove output2 indirection from xfrm_mode")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-15 11:09:20 +02:00
Florian Westphal
5bdac418f3 netfilter: nat: fix icmp id randomization
Sven Auhagen reported that a 2nd ping request will fail if 'fully-random'
mode is used.

Reason is that if no proto information is given, min/max are both 0,
so we set the icmp id to 0 instead of chosing a random value between
0 and 65535.

Update test case as well to catch this, without fix this yields:
[..]
ERROR: cannot ping ns1 from ns2 with ip masquerade fully-random (attempt 2)
ERROR: cannot ping ns1 from ns2 with ipv6 masquerade fully-random (attempt 2)

... becaus 2nd ping clashes with existing 'id 0' icmp conntrack and gets
dropped.

Fixes: 203f2e7820 ("netfilter: nat: remove l4proto->unique_tuple")
Reported-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 07:31:50 +02:00
Dan Carpenter
33d1c01817 netfilter: nf_tables: prevent shift wrap in nft_chain_parse_hook()
I believe that "hook->num" can be up to UINT_MAX.  Shifting more than
31 bits would is undefined in C but in practice it would lead to shift
wrapping.  That would lead to an array overflow in nf_tables_addchain():

	ops->hook       = hook.type->hooks[ops->hooknum];

Fixes: fe19c04ca1 ("netfilter: nf_tables: remove nhooks field from struct nft_af_info")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 07:31:50 +02:00
Florian Westphal
3c79107631 netfilter: ctnetlink: don't use conntrack/expect object addresses as id
else, we leak the addresses to userspace via ctnetlink events
and dumps.

Compute an ID on demand based on the immutable parts of nf_conn struct.

Another advantage compared to using an address is that there is no
immediate re-use of the same ID in case the conntrack entry is freed and
reallocated again immediately.

Fixes: 3583240249 ("[NETFILTER]: nf_conntrack_expect: kill unique ID")
Fixes: 7f85f91472 ("[NETFILTER]: nf_conntrack: kill unique ID")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 07:31:44 +02:00
Eric Dumazet
69f23a09da rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check
Jakub forgot to either use nlmsg_len() or nlmsg_msg_size(),
allowing KMSAN to detect a possible uninit-value in rtnl_stats_get

BUG: KMSAN: uninit-value in rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997
CPU: 0 PID: 10428 Comm: syz-executor034 Not tainted 5.1.0-rc2+ #24
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x173/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:619
 __msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310
 rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997
 rtnetlink_rcv_msg+0x115b/0x1550 net/core/rtnetlink.c:5192
 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2485
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5210
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1925
 sock_sendmsg_nosec net/socket.c:622 [inline]
 sock_sendmsg net/socket.c:632 [inline]
 ___sys_sendmsg+0xdb3/0x1220 net/socket.c:2137
 __sys_sendmsg net/socket.c:2175 [inline]
 __do_sys_sendmsg net/socket.c:2184 [inline]
 __se_sys_sendmsg+0x305/0x460 net/socket.c:2182
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2182
 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
 entry_SYSCALL_64_after_hwframe+0x63/0xe7

Fixes: 51bc860d4a ("rtnetlink: stats: validate attributes in get as well as dumps")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-14 14:10:08 -07:00
Eric Dumazet
c543cb4a5f ipv4: ensure rcu_read_lock() in ipv4_link_failure()
fib_compute_spec_dst() needs to be called under rcu protection.

syzbot reported :

WARNING: suspicious RCU usage
5.1.0-rc4+ #165 Not tainted
include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by swapper/0/0:
 #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline]
 #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162
 __in_dev_get_rcu include/linux/inetdevice.h:220 [inline]
 fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294
 spec_dst_fill net/ipv4/ip_options.c:245 [inline]
 __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343
 ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195
 dst_link_failure include/net/dst.h:427 [inline]
 arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297
 neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995
 neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081
 call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
 expire_timers kernel/time/timer.c:1362 [inline]
 __run_timers kernel/time/timer.c:1681 [inline]
 __run_timers kernel/time/timer.c:1649 [inline]
 run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
 __do_softirq+0x266/0x95a kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807

Fixes: ed0de45a10 ("ipv4: recompile ip options in ipv4_link_failure")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-14 13:43:17 -07:00
Julian Anastasov
0261ea1bd1 ipvs: do not schedule icmp errors from tunnels
We can receive ICMP errors from client or from
tunneling real server. While the former can be
scheduled to real server, the latter should
not be scheduled, they are decapsulated only when
existing connection is found.

Fixes: 6044eeffaf ("ipvs: attempt to schedule icmp packets")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13 14:52:57 +02:00
Alexander Potapenko
8176c83327 netfilter: conntrack: initialize ct->timeout
KMSAN started reporting an error when accessing ct->timeout for the
first time without initialization:

 BUG: KMSAN: uninit-value in __nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765
 ...
 dump_stack+0x173/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:624
 __msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310
 __nf_ct_refresh_acct+0x1ae/0x470 net/netfilter/nf_conntrack_core.c:1765
 nf_ct_refresh_acct ./include/net/netfilter/nf_conntrack.h:201
 nf_conntrack_udp_packet+0xb44/0x1040 net/netfilter/nf_conntrack_proto_udp.c:122
 nf_conntrack_handle_packet net/netfilter/nf_conntrack_core.c:1605
 nf_conntrack_in+0x1250/0x26c9 net/netfilter/nf_conntrack_core.c:1696
 ...
 Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:205
 kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:159
 kmsan_kmalloc+0xa9/0x130 mm/kmsan/kmsan_hooks.c:173
 kmem_cache_alloc+0x554/0xb10 mm/slub.c:2789
 __nf_conntrack_alloc+0x16f/0x690 net/netfilter/nf_conntrack_core.c:1342
 init_conntrack+0x6cb/0x2490 net/netfilter/nf_conntrack_core.c:1421

Signed-off-by: Alexander Potapenko <glider@google.com>
Fixes: cc16921351 ("netfilter: conntrack: avoid same-timeout update")
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13 14:52:57 +02:00
Florian Westphal
1025ce7521 netfilter: conntrack: don't set related state for different outer address
Luca Moro says:
 ------
The issue lies in the filtering of ICMP and ICMPv6 errors that include an
inner IP datagram.
For these packets, icmp_error_message() extract the ICMP error and inner
layer to search of a known state.
If a state is found the packet is tagged as related (IP_CT_RELATED).

The problem is that there is no correlation check between the inner and
outer layer of the packet.
So one can encapsulate an error with an inner layer matching a known state,
while its outer layer is directed to a filtered host.
In this case the whole packet will be tagged as related.
This has various implications from a rule bypass (if a rule to related
trafic is allow), to a known state oracle.

Unfortunately, we could not find a real statement in a RFC on how this case
should be filtered.
The closest we found is RFC5927 (Section 4.3) but it is not very clear.

A possible fix would be to check that the inner IP source is the same than
the outer destination.

We believed this kind of attack was not documented yet, so we started to
write a blog post about it.
You can find it attached to this mail (sorry for the extract quality).
It contains more technical details, PoC and discussion about the identified
behavior.
We discovered later that
https://www.gont.com.ar/papers/filtering-of-icmp-error-messages.pdf
described a similar attack concept in 2004 but without the stateful
filtering in mind.
 -----

This implements above suggested fix:
In icmp(v6) error handler, take outer destination address, then pass
that into the common function that does the "related" association.

After obtaining the nf_conn of the matching inner-headers connection,
check that the destination address of the opposite direction tuple
is the same as the outer address and only set RELATED if thats the case.

Reported-by: Luca Moro <luca.moro@synacktiv.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13 14:52:57 +02:00
Stephen Suryaputra
ed0de45a10 ipv4: recompile ip options in ipv4_link_failure
Recompile IP options since IPCB may not be valid anymore when
ipv4_link_failure is called from arp_error_report.

Refer to the commit 3da1ed7ac3 ("net: avoid use IPCB in cipso_v4_error")
and the commit before that (9ef6b42ad6) for a similar issue.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 17:23:46 -07:00
David Ahern
1deeb6408c ipv6: Remove flowi6_oif compare from __ip6_route_redirect
In the review of 0b34eb0043 ("ipv6: Refactor __ip6_route_redirect"),
Martin noted that the flowi6_oif compare is moved to the new helper and
should be removed from __ip6_route_redirect. Fix the oversight.

Fixes: 0b34eb0043 ("ipv6: Refactor __ip6_route_redirect")
Reported-by: Martin Lau <kafai@fb.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 17:05:38 -07:00
Jeffrey Altman
1a2391c30c rxrpc: Fix detection of out of order acks
The rxrpc packet serial number cannot be safely used to compute out of
order ack packets for several reasons:

 1. The allocation of serial numbers cannot be assumed to imply the order
    by which acks are populated and transmitted.  In some rxrpc
    implementations, delayed acks and ping acks are transmitted
    asynchronously to the receipt of data packets and so may be transmitted
    out of order.  As a result, they can race with idle acks.

 2. Serial numbers are allocated by the rxrpc connection and not the call
    and as such may wrap independently if multiple channels are in use.

In any case, what matters is whether the ack packet provides new
information relating to the bounds of the window (the firstPacket and
previousPacket in the ACK data).

Fix this by discarding packets that appear to wind back the window bounds
rather than on serial number procession.

Fixes: 298bc15b20 ("rxrpc: Only take the rwind and mtu values from latest ACK")
Signed-off-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
David Howells
39ce675575 rxrpc: Trace received connection aborts
Trace received calls that are aborted due to a connection abort, typically
because of authentication failure.  Without this, connection aborts don't
show up in the trace log.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
Marc Dionne
8e8715aaa9 rxrpc: Allow errors to be returned from rxrpc_queue_packet()
Change rxrpc_queue_packet()'s signature so that it can return any error
code it may encounter when trying to send the packet.

This allows the caller to eventually do something in case of error - though
it should be noted that the packet has been queued and a resend is
scheduled.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
Marc Dionne
4611da30d6 rxrpc: Make rxrpc_kernel_check_life() indicate if call completed
Make rxrpc_kernel_check_life() pass back the life counter through the
argument list and return true if the call has not yet completed.

Suggested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
Marc Dionne
56d282d9db rxrpc: Clear socket error
When an ICMP or ICMPV6 error is received, the error will be attached
to the socket (sk_err) and the report function will get called.
Clear any pending error here by calling sock_error().

This would cause the following attempt to use the socket to fail with
the error code stored by the ICMP error, resulting in unexpected errors
with various side effects depending on the context.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jonathan Billings <jsbillin@umich.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
Karsten Graul
7a62725a50 net/smc: improve smc_conn_create reason codes
Rework smc_conn_create() to always return a valid DECLINE reason code.
This removes the need to translate the return codes on 4 different
places and allows to easily add more detailed return codes by changing
smc_conn_create() only.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Karsten Graul
9aa68d298c net/smc: improve smc_listen_work reason codes
Rework smc_listen_work() to provide improved reason codes when an
SMC connection is declined. This allows better debugging on user side.
This also adds 3 more detailed reason codes in smc_clc.h to indicate
what type of device was not found (ism or rdma or both), or if ism
cannot talk to the peer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Karsten Graul
228bae05be net/smc: code cleanup smc_listen_work
In smc_listen_work() the variables rc and reason_code are defined which
have the same meaning. Eliminate reason_code in favor of the shorter
name rc. No functional changes.
Rename the functions smc_check_ism() and smc_check_rdma() into
smc_find_ism_device() and smc_find_rdma_device() to make there purpose
more clear. No functional changes.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Karsten Graul
fba7e8ef51 net/smc: cleanup of get vlan id
The vlan_id of the underlying CLC socket was retrieved two times
during processing of the listen handshaking. Change this to get the
vlan id one time in connect and in listen processing, and reuse the id.
And add a new CLC DECLINE return code for the case when the retrieval
of the vlan id failed.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Karsten Graul
bc36d2fc93 net/smc: consolidate function parameters
During initialization of an SMC socket a lot of function parameters need
to get passed down the function call path. Consolidate the parameters
in a helper struct so there are less enough parameters to get all passed
by register.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Karsten Graul
598866974c net/smc: check for ip prefix and subnet
The check for a matching ip prefix and subnet was only done for SMC-R
in smc_listen_rdma_check() but not when an SMC-D connection was
possible. Rename the function into smc_listen_prfx_check() and move its
call to a place where it is called for both SMC variants.
And add a new CLC DECLINE reason for the case when the IP prefix or
subnet check fails so the reason for the failing SMC connection can be
found out more easily.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Karsten Graul
4ada81fddf net/smc: fallback to TCP after connect problems
Correct the CLC decline reason codes for internal problems to not have
the sign bit set, negative reason codes are interpreted as not eligible
for TCP fallback.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Ursula Braun
50717a37db net/smc: nonblocking connect rework
For nonblocking sockets move the kernel_connect() from the connect
worker into the initial smc_connect part to return kernel_connect()
errors other than -EINPROGRESS to user space.

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:50:56 -07:00
Tetsuo Handa
bddc028a4f udpv6: Check address length before reading address family
KMSAN will complain if valid address length passed to udpv6_pre_connect()
is shorter than sizeof("struct sockaddr"->sa_family) bytes.

(This patch is bogus if it is guaranteed that udpv6_pre_connect() is
always called after checking "struct sockaddr"->sa_family. In that case,
we want a comment why we don't need to check valid address length here.)

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
Tetsuo Handa
ba024f2574 bpf: Check address length before reading address family
KMSAN will complain if valid address length passed to bpf_bind() is
shorter than sizeof("struct sockaddr"->sa_family) bytes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
Tetsuo Handa
c68e747d0a llc: Check address length before reading address field
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_llc) bytes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
Tetsuo Handa
bd7d46ddca Bluetooth: Check address length before reading address field
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_sco) bytes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
Tetsuo Handa
a9107a14a9 rxrpc: Check address length before reading srx_service field
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_rxrpc) bytes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
Tetsuo Handa
d852be8477 net: netlink: Check address length before reading groups field
KMSAN will complain if valid address length passed to bind() is shorter
than sizeof(struct sockaddr_nl) bytes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
Tetsuo Handa
175f7c1f01 sctp: Check address length before reading address family
KMSAN will complain if valid address length passed to connect() is shorter
than sizeof("struct sockaddr"->sa_family) bytes.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
Tetsuo Handa
dd3ac9a684 net/rds: Check address length before reading address family
syzbot is reporting uninitialized value at rds_connect() [1] and
rds_bind() [2]. This is because syzbot is passing ulen == 0 whereas
these functions expect that it is safe to access sockaddr->family field
in order to determine minimal address length for validation.

[1] https://syzkaller.appspot.com/bug?id=f4e61c010416c1e6f0fa3ffe247561b60a50ad71
[2] https://syzkaller.appspot.com/bug?id=a4bf9e41b7e055c3823fdcd83e8c58ca7270e38f

Reported-by: syzbot <syzbot+0049bebbf3042dbd2e8f@syzkaller.appspotmail.com>
Reported-by: syzbot <syzbot+915c9f99f3dbc4bd6cd1@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 10:25:03 -07:00
David Miller
013b96ec64 sctp: Pass sk_buff_head explicitly to sctp_ulpq_tail_event().
Now the SKB list implementation assumption can be removed.

And now that we know that the list head is always non-NULL
we can remove the code blocks dealing with that as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:33:31 -07:00
David Miller
178ca044aa sctp: Make sctp_enqueue_event tak an skb list.
Pass this, instead of an event.  Then everything trickles down and we
always have events a non-empty list.

Then we needs a list creating stub to place into .enqueue_event for sctp_stream_interleave_1.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:33:31 -07:00
David Miller
5e8f641db6 sctp: Use helper for sctp_ulpq_tail_event() when hooked up to ->enqueue_event
This way we can make sure events sent this way to
sctp_ulpq_tail_event() are on a list as well.  Now all such code paths
are fully covered.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:33:31 -07:00
David Miller
925b937422 sctp: Always pass skbs on a list to sctp_ulpq_tail_event().
This way we can simplify the logic and remove assumptions
about the implementation of skb lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:33:31 -07:00
David Miller
0eff105243 sctp: Remove superfluous test in sctp_ulpq_reasm_drain().
Inside the loop, we always start with event non-NULL.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:33:31 -07:00
Vlad Buslov
9994677c96 net: sched: flower: fix filter net reference counting
Fix net reference counting in fl_change() and remove redundant call to
tcf_exts_get_net() from __fl_delete(). __fl_put() already tries to get net
before releasing exts and deallocating a filter, so this code caused flower
classifier to obtain net twice per filter that is being deleted.

Implementation of __fl_delete() called tcf_exts_get_net() to pass its
result as 'async' flag to fl_mask_put(). However, 'async' flag is redundant
and only complicates fl_mask_put() implementation. This functionality seems
to be copied from filter cleanup code, where it was added by Cong with
following explanation:

    This patchset tries to fix the race between call_rcu() and
    cleanup_net() again. Without holding the netns refcnt the
    tc_action_net_exit() in netns workqueue could be called before
    filter destroy works in tc filter workqueue. This patchset
    moves the netns refcnt from tc actions to tcf_exts, without
    breaking per-netns tc actions.

This doesn't apply to flower mask, which doesn't call any tc action code
during cleanup. Simplify fl_mask_put() by removing the flag parameter and
always use tcf_queue_work() to free mask objects.

Fixes: 061775583e ("net: sched: flower: introduce reference counting for filters")
Fixes: 1f17f7742e ("net: sched: flower: insert filter to ht before offloading it to hw")
Fixes: 05cd271fd6 ("cls_flower: Support multiple masks per priority")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:32:27 -07:00
Eric Dumazet
e305845096 dctcp: more accurate tracking of packets delivery
After commit e21db6f69a ("tcp: track total bytes delivered with ECN CE marks")
core TCP stack does a very good job tracking ECN signals.

The "sender's best estimate of CE information" Yuchung mentioned in his
patch is indeed the best we can do.

DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic,
and use the existing best estimate.

This solves some problems, since current DCTCP logic does not deal with losses
and/or GRO or ack aggregation very well.

This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss
(this should have been tp->mss_cache), and a 64 bit divide.

Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for
every ACK could be done differently, calling it only once per RTT.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Abdul Kabbani <akabbani@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:31:03 -07:00
David S. Miller
bb23581b9b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-04-12

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Improve BPF verifier scalability for large programs through two
   optimizations: i) remove verifier states that are not useful in pruning,
   ii) stop walking parentage chain once first LIVE_READ is seen. Combined
   gives approx 20x speedup. Increase limits for accepting large programs
   under root, and add various stress tests, from Alexei.

2) Implement global data support in BPF. This enables static global variables
   for .data, .rodata and .bss sections to be properly handled which allows
   for more natural program development. This also opens up the possibility
   to optimize program workflow by compiling ELFs only once and later only
   rewriting section data before reload, from Daniel and with test cases and
   libbpf refactoring from Joe.

3) Add config option to generate BTF type info for vmlinux as part of the
   kernel build process. DWARF debug info is converted via pahole to BTF.
   Latter relies on libbpf and makes use of BTF deduplication algorithm which
   results in 100x savings compared to DWARF data. Resulting .BTF section is
   typically about 2MB in size, from Andrii.

4) Add BPF verifier support for stack access with variable offset from
   helpers and add various test cases along with it, from Andrey.

5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header
   so that L2 encapsulation can be used for tc tunnels, from Alan.

6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that
   users can define a subset of allowed __sk_buff fields that get fed into
   the test program, from Stanislav.

7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up
   various UBSAN warnings in bpftool, from Yonghong.

8) Generate a pkg-config file for libbpf, from Luca.

9) Dump program's BTF id in bpftool, from Prashant.

10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP
    program, from Magnus.

11) kallsyms related fixes for the case when symbols are not present in
    BPF selftests and samples, from Daniel
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 17:00:05 -07:00
Florian Westphal
223fd0adfa bridge: broute: make broute a real ebtables table
This makes broute a normal ebtables table, hooking at PREROUTING.
The broute hook is removed.

It uses skb->cb to signal to bridge rx handler that the skb should be
routed instead of being bridged.

This change is backwards compatible with ebtables as no userspace visible
parts are changed.

This means we can also remove the !ops test in ebt_register_table,
it was only there for broute table sake.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12 01:47:50 +02:00
Florian Westphal
971502d77f bridge: netfilter: unroll NF_HOOK helper in bridge input path
Replace NF_HOOK() based invocation of the netfilter hooks with a private
copy of nf_hook_slow().

This copy has one difference: it can return the rx handler value expected
by the stack, i.e. RX_HANDLER_CONSUMED or RX_HANDLER_PASS.

This is needed by the next patch to invoke the ebtables
"broute" table via the standard netfilter hooks rather than the custom
"br_should_route_hook" indirection that is used now.

When the skb is to be "brouted", we must return RX_HANDLER_PASS from the
bridge rx input handler, but there is no way to indicate this via
NF_HOOK(), unless perhaps by some hack such as exposing bridge_cb in the
netfilter core or a percpu flag.

  text    data     bss     dec   filename
  3369      56       0    3425   net/bridge/br_input.o.before
  3458      40       0    3498   net/bridge/br_input.o.after

This allows removal of the "br_should_route_hook" in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12 01:47:39 +02:00
Florian Westphal
f12064d1b4 bridge: reduce size of input cb to 16 bytes
Reduce size of br_input_skb_cb from 24 to 16 bytes by
using bitfield for those values that can only be 0 or 1.

igmp is the igmp type value, so it needs to be at least u8.

Furthermore, the bridge currently relies on step-by-step initialization
of br_input_skb_cb fields as the skb passes through the stack.

Explicitly zero out the bridge input cb instead, this avoids having to
review/validate that no BR_INPUT_SKB_CB(skb)->foo test can see a
'random' value from previous protocol cb.

AFAICS all current fields are always set up before they are read again,
so this is not a bug fix.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12 01:47:27 +02:00
Stanislav Fomichev
947e8b595b bpf: explicitly prohibit ctx_{in, out} in non-skb BPF_PROG_TEST_RUN
This should allow us later to extend BPF_PROG_TEST_RUN for non-skb case
and be sure that nobody is erroneously setting ctx_{in,out}.

Fixes: b0b9395d86 ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-12 00:53:00 +02:00
David Ahern
0b34eb0043 ipv6: Refactor __ip6_route_redirect
Move the nexthop evaluation of a fib entry to a helper that can be
leveraged for each fib6_nh in a multipath nexthop object.

In the move, 'continue' statements means the helper returns false
(loop should continue) and 'break' means return true (found the entry
of interest).

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:07 -07:00
David Ahern
0c59d00675 ipv6: Refactor rt6_device_match
Move the device and gateway checks in the fib6_next loop to a helper
that can be called per fib6_nh entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:07 -07:00
David Ahern
d83009d462 ipv6: Move fib6_multipath_select down in ip6_pol_route
Move the siblings and fib6_multipath_select after the null entry check
since a null entry can not have siblings.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
David Ahern
af52a52cba ipv6: Be smarter with null_entry handling in ip6_pol_route_lookup
Clean up the fib6_null_entry handling in ip6_pol_route_lookup.
rt6_device_match can return fib6_null_entry, but fib6_multipath_select
can not. Consolidate the fib6_null_entry handling and on the final
null_entry check set rt and goto out - no need to defer to a second
check after rt6_find_cached_rt.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
David Ahern
30c15f0338 ipv6: Refactor find_rr_leaf
find_rr_leaf has 3 loops over fib_entries calling find_match. The loops
are very similar with differences in start point and whether the metric
is evaluated:
    1. start at rr_head, no extra loop compare, check fib metric
    2. start at leaf, compare rt against rr_head, check metric
    3. start at cont (potential saved point from earlier loops), no
       extra loop compare, no metric check

Create 1 loop that is called 3 different times. This will make a
later change with multipath nexthop objects much simpler.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
David Ahern
28679ed104 ipv6: Refactor find_match
find_match primarily needs a fib6_nh (and fib6_flags which it passes
through to rt6_score_route). Move fib6_check_expired up to the call
sites so find_match is only called for relevant entries. Remove the
match argument which is mostly a pass through and use the return
boolean to decide if match gets set in the call sites.

The end result is a helper that can be called per fib6_nh struct
which is needed once fib entries reference nexthop objects that
have more than one fib6_nh.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
David Ahern
702cea5685 ipv6: Pass fib6_nh and flags to rt6_score_route
rt6_score_route only needs the fib6_flags and nexthop data. Change
it accordingly. Allows re-use later for nexthop based fib6_nh.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
David Ahern
cc3a86c802 ipv6: Change rt6_probe to take a fib6_nh
rt6_probe sends probes for gateways in a nexthop. As such it really
depends on a fib6_nh, not a fib entry. Move last_probe to fib6_nh and
update rt6_probe to a fib6_nh struct.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
David Ahern
6e1809a564 ipv6: Remove rt6_check_dev
rt6_check_dev is a simpler helper with only 1 caller. Fold the code
into rt6_score_route.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
David Ahern
1ba9a89517 ipv6: Only call rt6_check_neigh for nexthop with gateway
Change rt6_check_neigh to take a fib6_nh instead of a fib entry.
Move the check on fib_flags and whether the nexthop has a gateway
up to the one caller.

Remove the inline from the definition as well. Not necessary.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
Colin Ian King
62720b12d2 dns: remove redundant zero length namelen check
The zero namelen check is redundant as it has already been checked
for zero at the start of the function.  Remove the redundant check.

Addresses-Coverity: ("Logically Dead Code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:01:08 -07:00
YueHaibing
d3706566ae net: netrom: Fix error cleanup path of nr_proto_init
Syzkaller report this:

BUG: unable to handle kernel paging request at fffffbfff830524b
PGD 237fe8067 P4D 237fe8067 PUD 237e64067 PMD 1c9716067 PTE 0
Oops: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 5.0.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:__list_add_valid+0x21/0xe0 lib/list_debug.c:23
Code: 8b 0c 24 e9 17 fd ff ff 90 55 48 89 fd 48 8d 7a 08 53 48 89 d3 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 0f 85 8b 00 00 00 48 8b 53 08 48 39 f2 75 35 48 89 f2
RSP: 0018:ffff8881ea2278d0 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffffffffc1829250 RCX: 1ffff1103d444ef4
RDX: 1ffffffff830524b RSI: ffffffff85659300 RDI: ffffffffc1829258
RBP: ffffffffc1879250 R08: fffffbfff0acb269 R09: fffffbfff0acb269
R10: ffff8881ea2278f0 R11: fffffbfff0acb268 R12: ffffffffc1829250
R13: dffffc0000000000 R14: 0000000000000008 R15: ffffffffc187c830
FS:  00007fe0361df700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff830524b CR3: 00000001eb39a001 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 __list_add include/linux/list.h:60 [inline]
 list_add include/linux/list.h:79 [inline]
 proto_register+0x444/0x8f0 net/core/sock.c:3375
 nr_proto_init+0x73/0x4b3 [netrom]
 ? 0xffffffffc1628000
 ? 0xffffffffc1628000
 do_one_initcall+0xbc/0x47d init/main.c:887
 do_init_module+0x1b5/0x547 kernel/module.c:3456
 load_module+0x6405/0x8c10 kernel/module.c:3804
 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe0361dec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000003
RBP: 00007fe0361dec70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0361df6bc
R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
Modules linked in: netrom(+) ax25 fcrypt pcbc af_alg arizona_ldo1 v4l2_common videodev media v4l2_dv_timings hdlc ide_cd_mod snd_soc_sigmadsp_regmap snd_soc_sigmadsp intel_spi_platform intel_spi mtd spi_nor snd_usbmidi_lib usbcore lcd ti_ads7950 hi6421_regulator snd_soc_kbl_rt5663_max98927 snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_rt5663 snd_soc_core snd_pcm_dmaengine snd_compress snd_soc_rl6231 mac80211 rtc_rc5t583 spi_slave_time leds_pwm hid_gt683r hid industrialio_triggered_buffer kfifo_buf industrialio ir_kbd_i2c rc_core led_class_flash dwc_xlgmac snd_ymfpci gameport snd_mpu401_uart snd_rawmidi snd_ac97_codec snd_pcm ac97_bus snd_opl3_lib snd_timer snd_seq_device snd_hwdep snd soundcore iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan
 bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ide_pci_generic piix aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ide_core psmouse input_leds i2c_piix4 serio_raw intel_agp intel_gtt ata_generic agpgart pata_acpi parport_pc rtc_cmos parport floppy sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: rxrpc]
Dumping ftrace buffer:
   (ftrace buffer empty)
CR2: fffffbfff830524b
---[ end trace 039ab24b305c4b19 ]---

If nr_proto_init failed, it may forget to call proto_unregister,
tiggering this issue.This patch rearrange code of nr_proto_init
to avoid such issues.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 13:59:49 -07:00
Alan Maguire
58dfc900fa bpf: add layer 2 encap support to bpf_skb_adjust_room
commit 868d523535 ("bpf: add bpf_skb_adjust_room encap flags")
introduced support to bpf_skb_adjust_room for GSO-friendly GRE
and UDP encapsulation.

For GSO to work for skbs, the inner headers (mac and network) need to
be marked.  For L3 encapsulation using bpf_skb_adjust_room, the mac
and network headers are identical.  Here we provide a way of specifying
the inner mac header length for cases where L2 encap is desired.  Such
an approach can support encapsulated ethernet headers, MPLS headers etc.
For example to convert from a packet of form [eth][ip][tcp] to
[eth][ip][udp][inner mac][ip][tcp], something like the following could
be done:

	headroom = sizeof(iph) + sizeof(struct udphdr) + inner_maclen;

	ret = bpf_skb_adjust_room(skb, headroom, BPF_ADJ_ROOM_MAC,
				  BPF_F_ADJ_ROOM_ENCAP_L4_UDP |
				  BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 |
				  BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen));

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-11 22:50:56 +02:00
Jon Maloy
909620ff72 tipc: use standard write_lock & unlock functions when creating node
In the function tipc_node_create() we protect the peer capability field
by using the node rw_lock. However, we access the lock directly instead
of using the dedicated functions for this, as we do everywhere else in
node.c. This cosmetic spot is fixed here.

Fixes: 40999f11ce ("tipc: make link capability update thread safe")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 13:42:35 -07:00
Stanislav Fomichev
c695865c5c bpf: fix missing bpf_check_uarg_tail_zero in BPF_PROG_TEST_RUN
Commit b0b9395d86 ("bpf: support input __sk_buff context in
BPF_PROG_TEST_RUN") started using bpf_check_uarg_tail_zero in
BPF_PROG_TEST_RUN. However, bpf_check_uarg_tail_zero is not defined
for !CONFIG_BPF_SYSCALL:

net/bpf/test_run.c: In function ‘bpf_ctx_init’:
net/bpf/test_run.c:142:9: error: implicit declaration of function ‘bpf_check_uarg_tail_zero’ [-Werror=implicit-function-declaration]
   err = bpf_check_uarg_tail_zero(data_in, max_size, size);
         ^~~~~~~~~~~~~~~~~~~~~~~~

Let's not build net/bpf/test_run.c when CONFIG_BPF_SYSCALL is not set.

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: b0b9395d86 ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-11 21:50:20 +02:00
Trond Myklebust
af6b61d7ef Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
This reverts commit 009a82f643.

The ability to optimise here relies on compiler being able to optimise
away tail calls to avoid stack overflows. Unfortunately, we are seeing
reports of problems, so let's just revert.

Reported-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-04-11 15:41:14 -04:00
Chuck Lever
e1ede312f1 xprtrdma: Fix helper that drains the transport
We want to drain only the RQ first. Otherwise the transport can
deadlock on ->close if there are outstanding Send completions.

Fixes: 6d2d0ee27c ("xprtrdma: Replace rpcrdma_receive_wq ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-04-11 15:23:48 -04:00
Florian Westphal
610a43149c netfilter: nf_nat_masquerade: unify ipv4/6 notifier registration
Only reason for having two different register functions was because of
ipt_MASQUERADE and ip6t_MASQUERADE being two different modules.

Previous patch merged those into xt_MASQUERADE, so we can merge this too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:34 +02:00
Florian Westphal
adf82accc5 netfilter: x_tables: merge ip and ipv6 masquerade modules
No need to have separate modules for this.
before:
 text    data   bss    dec  filename
 2038    1168     0   3206  net/ipv4/netfilter/ipt_MASQUERADE.ko
 1526    1024     0   2550  net/ipv6/netfilter/ip6t_MASQUERADE.ko
after:
 text    data   bss    dec  filename
 2521    1296     0   3817  net/netfilter/xt_MASQUERADE.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:29 +02:00
Florian Westphal
bf8981a2aa netfilter: nf_nat: merge ip/ip6 masquerade headers
Both are now implemented by nf_nat_masquerade.c, so no need to keep
different headers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:21 +02:00
Vlad Buslov
9e35552ae1 net: sched: flower: use correct ht function to prevent duplicates
Implementation of function rhashtable_insert_fast() check if its internal
helper function __rhashtable_insert_fast() returns non-NULL pointer and
seemingly return -EEXIST in such case. However, since
__rhashtable_insert_fast() is called with NULL key pointer, it never
actually checks for duplicates, which means that -EEXIST is never returned
to the user. Use rhashtable_lookup_insert_fast() hash table API instead. In
order to verify that it works as expected and prevent the problem from
happening in future, extend tc-tests with new test that verifies that no
new filters with existing key can be inserted to flower classifier.

Fixes: 1f17f7742e ("net: sched: flower: insert filter to ht before offloading it to hw")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:33:06 -07:00
Guillaume Nault
ecce39ec10 netns: read NETNSA_NSID as s32 attribute in rtnl_net_getid()
NETNSA_NSID is signed. Use nla_get_s32() to avoid confusion.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:26:27 -07:00
Nikolay Aleksandrov
c5b493ce19 net: bridge: multicast: use rcu to access port list from br_multicast_start_querier
br_multicast_start_querier() walks over the port list but it can be
called from a timer with only multicast_lock held which doesn't protect
the port list, so use RCU to walk over it.

Fixes: c83b8fab06 ("bridge: Restart queries when last querier expires")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:13:51 -07:00
Ursula Braun
f61bca58f6 net/smc: move unhash before release of clcsock
Commit <26d92e951fe0>
("net/smc: move unhash as early as possible in smc_release()")
fixes one occurrence in the smc code, but the same pattern exists
in other places. This patch covers the remaining occurrences and
makes sure, the unhash operation is done before the smc->clcsock is
released. This avoids a potential use-after-free in smc_diag_dump().

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:04:08 -07:00
Karsten Graul
8ef659f1a8 net/smc: fix return code from FLUSH command
The FLUSH command is used to empty the pnet table. No return code is
expected from the command. Commit a9d8b0b1e3d6 added namespace support
for the pnet table and changed the FLUSH command processing to call
smc_pnet_remove_by_pnetid() to remove the pnet entries. This function
returns -ENOENT when no entry was deleted, which is now the return code
of the FLUSH command. As a result the FLUSH command will return an error
when the pnet table is already empty.
Restore the expected behavior and let FLUSH always return 0.

Fixes: a9d8b0b1e3d6 ("net/smc: add pnet table namespace support")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:04:08 -07:00
Ursula Braun
07603b2308 net/smc: propagate file from SMC to TCP socket
fcntl(fd, F_SETOWN, getpid()) selects the recipient of SIGURG signals
that are delivered when out-of-band data arrives on socket fd.
If an SMC socket program makes use of such an fcntl() call, it fails
in case of fallback to TCP-mode. In case of fallback the traffic is
processed with the internal TCP socket. Propagating field "file" from the
SMC socket to the internal TCP socket fixes the issue.

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:04:08 -07:00
Kangjie Lu
e183d4e414 net/smc: fix a NULL pointer dereference
In case alloc_ordered_workqueue fails, the fix returns NULL
to avoid NULL pointer dereference.

Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:04:08 -07:00
Karsten Graul
fd57770dd1 net/smc: wait for pending work before clcsock release_sock
When the clcsock is already released using sock_release() and a pending
smc_listen_work accesses the clcsock than that will fail. Solve this
by canceling and waiting for the work to complete first. Because the
work holds the sock_lock it must make sure that the lock is not hold
before the new helper smc_clcsock_release() is invoked. And before the
smc_listen_work starts working check if the parent listen socket is
still valid, otherwise stop the work early.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:04:08 -07:00
Stanislav Fomichev
b0b9395d86 bpf: support input __sk_buff context in BPF_PROG_TEST_RUN
Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN:
* ctx_in/ctx_size_in - input context
* ctx_out/ctx_size_out - output context

The intended use case is to pass some meta data to the test runs that
operate on skb (this has being brought up on recent LPC).

For programs that use bpf_prog_test_run_skb, support __sk_buff input and
output. Initially, from input __sk_buff, copy _only_ cb and priority into
skb, all other non-zero fields are prohibited (with EINVAL).
If the user has set ctx_out/ctx_size_out, copy the potentially modified
__sk_buff back to the userspace.

We require all fields of input __sk_buff except the ones we explicitly
support to be set to zero. The expectation is that in the future we might
add support for more fields and we want to fail explicitly if the user
runs the program on the kernel where we don't yet support them.

The API is intentionally vague (i.e. we don't explicitly add __sk_buff
to bpf_attr, but ctx_in) to potentially let other test_run types use
this interface in the future (this can be xdp_md for xdp types for
example).

v4:
  * don't copy more than allowed in bpf_ctx_init [Martin]

v3:
  * handle case where ctx_in is NULL, but ctx_out is not [Martin]
  * convert size==0 checks to ptr==NULL checks and add some extra ptr
    checks [Martin]

v2:
  * Addressed comments from Martin Lau

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-11 10:21:40 +02:00
Lorenzo Bianconi
526bb57a6a net: fou: remove redundant code in gue_udp_recv
Remove not useful protocol version check in gue_udp_recv since just
gue version 0 can hit that code. Moreover remove duplicated hdrlen
computation

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 00:08:51 -07:00
Lorenzo Bianconi
988dc4a9a3 net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recv
gue tunnels run iptunnel_pull_offloads on received skbs. This can
determine a possible use-after-free accessing guehdr pointer since
the packet will be 'uncloned' running pskb_expand_head if it is a
cloned gso skb (e.g if the packet has been sent though a veth device)

Fixes: a09a4c8dd1 ("tunnels: Remove encapsulation offloads on decap")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 23:02:23 -07:00
Hoang Le
d1841533e5 tipc: missing entries in name table of publications
When binding multiple services with specific type 1Ki, 2Ki..,
this leads to some entries in the name table of publications
missing when listed out via 'tipc name show'.

The problem is at identify zero last_type conditional provided
via netlink. The first is initial 'type' when starting name table
dummping. The second is continuously with zero type (node state
service type). Then, lookup function failure to finding node state
service type in next iteration.

To solve this, adding more conditional to marked as dirty type and
lookup correct service type for the next iteration instead of select
the first service as initial 'type' zero.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 22:58:09 -07:00
Simon Horman
c9d52f2169 fou: correct spelling of encapsulation
Correct spelling of encapsulation.
Found by inspection.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 22:47:40 -07:00
Si-Wei Liu
8065a779f1 failover: allow name change on IFF_UP slave interfaces
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to listen for the rename
events on failover slaves. Userspace component interacting with slaves
is expected to be changed to operate on failover master interface
instead, as the failover slave is dynamic in nature which may come and
go at any point.  The goal is to make the role of failover slaves less
relevant, and userspace components should only deal with failover master
in the long run.

Fixes: 30c8bd5aa8 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 22:12:26 -07:00
Leandro Dorileo
e0a7683d30 net/sched: cbs: fix port_rate miscalculation
The Credit Based Shaper heavily depends on link speed to calculate
the scheduling credits, we can't properly calculate the credits if the
device has failed to report the link speed.

In that case we can't dequeue packets assuming a wrong port rate that will
result into an inconsistent credit distribution.

This patch makes sure we fail to dequeue case:

1) __ethtool_get_link_ksettings() reports error or 2) the ethernet driver
failed to set the ksettings' speed value (setting link speed to
SPEED_UNKNOWN).

Additionally we properly re calculate the port rate whenever the link speed
is changed.

Fixes: 3d0bd028ff ("net/sched: Add support for HW offloading for CBS")
Signed-off-by: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Reviewed-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 19:27:43 -07:00
Leandro Dorileo
7b9eba7ba0 net/sched: taprio: fix picos_per_byte miscalculation
The Time Aware Priority Scheduler is heavily dependent to link speed,
it relies on it to calculate transmission bytes per cycle, we can't
properly calculate the so called budget if the device has failed
to report the link speed.

In that case we can't dequeue packets assuming a wrong budget.
This patch makes sure we fail to dequeue case:

1) __ethtool_get_link_ksettings() reports error or 2) the ethernet
driver failed to set the ksettings' speed value (setting link speed
to SPEED_UNKNOWN).

Additionally we re calculate the budget whenever the link speed is
changed.

Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Reviewed-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 19:27:43 -07:00
Jakub Kicinski
903f1a1877 net/tls: fix build without CONFIG_TLS_DEVICE
buildbot noticed that TLS_HW is not defined if CONFIG_TLS_DEVICE=n.
Wrap the cleanup branch into an ifdef, tls_device_free_resources_tx()
wouldn't be compiled either in this case.

Fixes: 35b71a34ad ("net/tls: don't leak partially sent record in device mode")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 17:23:26 -07:00
Jakub Kicinski
93e2125477 net: strparser: fix comment
Fix comment.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 14:12:45 -07:00
Jakub Kicinski
4a9c2e3746 net: strparser: partially revert "strparser: Call skb_unclone conditionally"
This reverts the first part of commit 4e485d06bb ("strparser: Call
skb_unclone conditionally").  To build a message with multiple
fragments we need our own root of frag_list.  We can't simply
use the frag_list of orig_skb, because it will lead to linking
all orig_skbs together creating very long frag chains, and causing
stack overflow on kfree_skb() (which is called recursively on
the frag_lists).

BUG: stack guard page was hit at 00000000d40fad41 (stack is 0000000029dde9f4..000000008cce03d5)
kernel stack overflow (double-fault): 0000 [#1] PREEMPT SMP
RIP: 0010:free_one_page+0x2b/0x490

Call Trace:
  __free_pages_ok+0x143/0x2c0
  skb_release_data+0x8e/0x140
  ? skb_release_data+0xad/0x140
  kfree_skb+0x32/0xb0

  [...]

  skb_release_data+0xad/0x140
  ? skb_release_data+0xad/0x140
  kfree_skb+0x32/0xb0
  skb_release_data+0xad/0x140
  ? skb_release_data+0xad/0x140
  kfree_skb+0x32/0xb0
  skb_release_data+0xad/0x140
  ? skb_release_data+0xad/0x140
  kfree_skb+0x32/0xb0
  skb_release_data+0xad/0x140
  ? skb_release_data+0xad/0x140
  kfree_skb+0x32/0xb0
  skb_release_data+0xad/0x140
  __kfree_skb+0xe/0x20
  tcp_disconnect+0xd6/0x4d0
  tcp_close+0xf4/0x430
  ? tcp_check_oom+0xf0/0xf0
  tls_sk_proto_close+0xe4/0x1e0 [tls]
  inet_release+0x36/0x60
  __sock_release+0x37/0xa0
  sock_close+0x11/0x20
  __fput+0xa2/0x1d0
  task_work_run+0x89/0xb0
  exit_to_usermode_loop+0x9a/0xa0
  do_syscall_64+0xc0/0xf0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Let's leave the second unclone conditional, as I'm not entirely
sure what is its purpose :)

Fixes: 4e485d06bb ("strparser: Call skb_unclone conditionally")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 13:07:02 -07:00
Jakub Kicinski
35b71a34ad net/tls: don't leak partially sent record in device mode
David reports that tls triggers warnings related to
sk->sk_forward_alloc not being zero at destruction time:

WARNING: CPU: 5 PID: 6831 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
WARNING: CPU: 5 PID: 6831 at net/ipv4/af_inet.c:160 inet_sock_destruct+0x15b/0x170

When sender fills up the write buffer and dies from
SIGPIPE.  This is due to the device implementation
not cleaning up the partially_sent_record.

This is because commit a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
moved the partial record cleanup to the SW-only path.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 13:07:02 -07:00
Jakub Kicinski
5a03bc73ab net/tls: fix the IV leaks
Commit f66de3ee2c ("net/tls: Split conf to rx + tx") made
freeing of IV and record sequence number conditional to SW
path only, but commit e8f6979981 ("net/tls: Add generic NIC
offload infrastructure") also allocates that state for the
device offload configuration.  Remember to free it.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 13:07:02 -07:00
David Ahern
d73f80f921 ipv4: Handle RTA_GATEWAY set to 0
Govindarajulu reported a regression with Network Manager which sends an
RTA_GATEWAY attribute with the address set to 0. Fixup the handling of
RTA_GATEWAY to only set fc_gw_family if the gateway address is actually
set.

Fixes: f35b794b3b ("ipv4: Prepare fib_config for IPv6 gateway")
Reported-by: Govindarajulu Varadarajan <govind.varadar@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:39:25 -07:00
Paolo Abeni
73eb628ddf Revert: "net: sched: put back q.qlen into a single location"
This revert commit 46b1c18f9d ("net: sched: put back q.qlen into
a single location").
After the previous patch, when a NOLOCK qdisc is enslaved to a
locking qdisc it switches to global stats accounting. As a consequence,
when a classful qdisc accesses directly a child qdisc's qlen, such
qdisc is not doing per CPU accounting and qlen value is consistent.

In the control path nobody uses directly qlen since commit
e5f0e8f8e4 ("net: sched: introduce and use qdisc tree flush/purge
helpers"), so we can remove the contented atomic ops from the
datapath.

v1 -> v2:
 - complete the qdisc_qstats_atomic_qlen_dec() ->
   qdisc_qstats_cpu_qlen_dec() replacement, fix build issue
 - more descriptive commit message

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
Paolo Abeni
8a53e616de net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too
Since stats updating is always consistent with TCQ_F_CPUSTATS flag,
we can disable it at qdisc creation time flipping such bit.

In my experiments, if the NOLOCK flag is cleared, per CPU stats
accounting does not give any measurable performance gain, but it
waste some memory.

Let's clear TCQ_F_CPUSTATS together with NOLOCK, when enslaving
a NOLOCK qdisc to 'lock' one.

Use stats update helper inside pfifo_fast, to cope correctly with
TCQ_F_CPUSTATS flag change.

As a side effect, q.qlen value for any child qdiscs is always
consistent for all lock classfull qdiscs.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
Paolo Abeni
9c01c9f1f2 net: sched: always do stats accounting according to TCQ_F_CPUSTATS
The core sched implementation checks independently for NOLOCK flag
to acquire/release the root spin lock and for qdisc_is_percpu_stats()
to account per CPU values in many places.

This change update the last few places checking the TCQ_F_NOLOCK to
do per CPU stats accounting according to qdisc_is_percpu_stats()
value.

The above allows to clean dev_requeue_skb() implementation a bit
and makes stats update always consistent with a single flag.

v1 -> v2:
 - do not move qdisc_is_empty definition, fix build issue

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
Paolo Abeni
b0a231a26d net: caif: avoid using qdisc_qlen()
Such helper does not cope correctly with NOLOCK qdiscs.
In the following patches we will move back qlen to per CPU
values for such qdiscs, so qdisc_qlen_sum() is not an option,
too.
Instead, use qlen only for lock qdiscs, and always set
flow off for NOLOCK qdiscs with a not empty tx queue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
Johannes Berg
e9f33a8fee mac80211: fix RX STBC override byte order
The original patch neglected to take byte order conversions
into account, fix that.

Fixes: d9bb410888 ("mac80211: allow overriding HT STBC capabilities")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-10 14:13:54 +02:00
David S. Miller
c03fd0171b Various fixes:
* iTXQ fixes from Felix
  * tracing fix - increase message length
  * fix SW_CRYPTO_CONTROL enforcement
  * WMM rule handling for regdomain intersection
  * max_interfaces in hwsim - reported by syzbot
  * clear private data in some more commands
  * a clang compiler warning fix
 
 I added a patch with two new (unused) macros for
 rate-limited printing to simplify getting the users
 into the tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlyshYoACgkQB8qZga/f
 l8Q1EQ//YO53GM06uvgdOlQkoF3D3uSLtPbJyxWcZCaJXjJ6jOOZsSrZZBLn7avf
 i80UziS34ep0zdRHyVmGe6Q/VqufWUnzd6KBNdINEDdepOw6jFlBe9Ue+7mGwBnl
 bqzJJ+rxQj1wVqMKEQMtznqCEX58ULiDKcPUPTG2Szw7moqMoTVaw7u2d3eX5mTS
 Ka/rdsA/V9v4VOkBi//JqYmlZjdq83el8r4rshI2YbNdxwljTchK5YDsJreDawzF
 X7pDjAou2EAskot2Pahz1qAEew0XUnSliTDRVTsgcsellnUHFsEGXytXQ7sp8Myy
 kpGuBdeBqJuPWwKTbHfxiQb1VudpBWuMB2x/rotszgzrqSoVWot5j6/UyDSvMPyC
 U/drsGJrN1iF/mYTn6G/r8sKPIaCxN7h02Z75IV7LBokGmz9Yt7srEjrOKt0uuK+
 l5nb/YQNcN37lkREa6zLzec0f3wDzoqw1GWL0IuQyb+cFC2+6q5iFQ8dyXnu3FdQ
 1cSJGJdnyG3bM0b42WrOcEOFt3EKmhb4UeuusoxkOd4aKuhBXRjVEzM+X6ADqTdR
 mWvk7bEfYD0UMsqhtvrYBdBfquPuy3S4QzA0IMzxjk8bhCug233KpOPcLgVFjlzh
 g7wCgfahl0TRhH7R1D3hnRUtcdsjifoa75XMAaDnWfnR9PiN6uI=
 =2EV9
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Various fixes:
 * iTXQ fixes from Felix
 * tracing fix - increase message length
 * fix SW_CRYPTO_CONTROL enforcement
 * WMM rule handling for regdomain intersection
 * max_interfaces in hwsim - reported by syzbot
 * clear private data in some more commands
 * a clang compiler warning fix

I added a patch with two new (unused) macros for
rate-limited printing to simplify getting the users
into the tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-09 10:57:10 -07:00
Paul E. McKenney
bee58fe346 net/ipv4/netfilter: Update comment from call_rcu_bh() to call_rcu()
The RCU flavors have been consolidated, so this commit replaces a
comment's mention of call_rcu_bh() with call_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: <netfilter-devel@vger.kernel.org>
Cc: <coreteam@netfilter.org>
Cc: <netdev@vger.kernel.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-09 08:06:41 -07:00
Sakari Ailus
d75f773c86 treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion
specifiers. The former are deprecated, therefore switch the current users
to use the preferred variant.

The changes have been produced by the following command:

	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

And verifying the result.

Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-pci@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-mm@kvack.org
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-04-09 14:19:06 +02:00
Alexander Wetzel
78ad234152 mac80211: Honor SW_CRYPTO_CONTROL for unicast keys in AP VLAN mode
Restore SW_CRYPTO_CONTROL operation on AP_VLAN interfaces for unicast
keys, the original override was intended to be done for group keys as
those are treated specially by mac80211 and would always have been
rejected.

Now the situation is that AP_VLAN support must be enabled by the driver
if it can support it (meaning it can support software crypto GTK TX).

Thus, also simplify the code - if we get here with AP_VLAN and non-
pairwise key, software crypto must be used (driver doesn't know about
the interface) and can be used (driver must've advertised AP_VLAN if
it also uses SW_CRYPTO_CONTROL).

Fixes: db3bdcb9c3 ("mac80211: allow AP_VLAN operation on crypto controlled devices")
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-09 13:40:32 +02:00
David S. Miller
310655b07a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-04-08 23:39:36 -07:00
Lorenzo Bianconi
2a3cabae45 net: ip6_gre: fix possible use-after-free in ip6erspan_rcv
erspan_v6 tunnels run __iptunnel_pull_header on received skbs to remove
erspan header. This can determine a possible use-after-free accessing
pkt_md pointer in ip6erspan_rcv since the packet will be 'uncloned'
running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
been sent though a veth device). Fix it resetting pkt_md pointer after
__iptunnel_pull_header

Fixes: 1d7e2ed22f ("net: erspan: refactor existing erspan code")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 16:16:47 -07:00
Lorenzo Bianconi
492b67e28e net: ip_gre: fix possible use-after-free in erspan_rcv
erspan tunnels run __iptunnel_pull_header on received skbs to remove
gre and erspan headers. This can determine a possible use-after-free
accessing pkt_md pointer in erspan_rcv since the packet will be 'uncloned'
running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
been sent though a veth device). Fix it resetting pkt_md pointer after
__iptunnel_pull_header

Fixes: 1d7e2ed22f ("net: erspan: refactor existing erspan code")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 16:16:47 -07:00
David Ahern
d15662682d ipv4: Allow ipv6 gateway with ipv4 routes
Add support for RTA_VIA and allow an IPv6 nexthop for v4 routes:
   $ ip ro add 172.16.1.0/24 via inet6 2001:db8::1 dev eth0
   $ ip ro ls
   ...
   172.16.1.0/24 via inet6 2001:db8::1 dev eth0

For convenience and simplicity, userspace can use RTA_VIA to specify
AF_INET or AF_INET6 gateway.

The common fib_nexthop_info dump function compares the gateway address
family to the nh_common family to know if the gateway should be encoded
as RTA_VIA or RTA_GATEWAY.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
19a9d136f1 ipv4: Flag fib_info with a fib_nh using IPv6 gateway
Until support is added to the offload drivers, they need to be able to
reject routes with an IPv6 gateway. To that end add a flag to fib_info
that indicates if any fib_nh has a v6 gateway. The flag allows the drivers
to efficiently know the use of a v6 gateway without walking all fib_nh
tied to a fib_info each time a route is added.

Update mlxsw and rocker to reject the routes with extack message as to why.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
1a38c43d31 ipv4: Handle ipv6 gateway in fib_good_nh
Update fib_good_nh to handle an ipv6 gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
619d182626 ipv4: Handle ipv6 gateway in fib_detect_death
Update fib_detect_death to handle an ipv6 gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
6de9c0557e ipv4: Handle ipv6 gateway in ipv4_confirm_neigh
Update ipv4_confirm_neigh to handle an ipv6 gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
6f5f68d05e bpf: Handle ipv6 gateway in bpf_ipv4_fib_lookup
Update bpf_ipv4_fib_lookup to handle an ipv6 gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
5c9f7c1dfc ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:

        nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
        if (unlikely(!neigh))
                neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.

To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:

        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
        if (unlikely(!neigh))
                neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.

Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.

ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
0353f28231 neighbor: Add skip_cache argument to neigh_output
A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
entry will exist in the v6 ndisc table and the cached header will contain
the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
use the v6 neighbor entry, neigh_output needs to skip the cached header
and just use the output callback for the neigh entry.

A future patchset can look at expanding the hh_cache to handle 2
protocols. For now, IPv6 gateways with an IPv4 route will take the
extra overhead of generating the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
717a8f5b29 ipv4: Add fib_check_nh_v6_gw
Add helper to use fib6_nh_init to validate a nexthop spec with an IPv6
gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
448d724819 ipv4: Refactor fib_check_nh
fib_check_nh is currently huge covering multiple uses cases - device only,
device + gateway, and device + gateway with ONLINK. The next patch adds
validation checks for IPv6 which only further complicates it. So, break
fib_check_nh into 2 helpers - one for gateway validation and one for device
only.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
a4ea5d43c8 ipv4: Add support to fib_config for IPv6 gateway
Add support for an IPv6 gateway to fib_config. Since a gateway is either
IPv4 or IPv6, make it a union with fc_gw4 where fc_gw_family decides
which address is in use. Update current checks on family and gw4 to
handle ipv6 as well.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
0f5f7d7bf6 ipv4: Add support to rtable for ipv6 gateway
Add support for an IPv6 gateway to rtable. Since a gateway is either
IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides
which address is in use.

When dumping the route data, encode an ipv6 nexthop using RTA_VIA.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
f35b794b3b ipv4: Prepare fib_config for IPv6 gateway
Similar to rtable, fib_config needs to allow the gateway to be either an
IPv4 or an IPv6 address. To that end, rename fc_gw to fc_gw4 to mean an
IPv4 address and add fc_gw_family. Checks on 'is a gateway set' are changed
to see if fc_gw_family is set. In the process prepare the code for a
fc_gw_family == AF_INET6.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
1550c17193 ipv4: Prepare rtable for IPv6 gateway
To allow the gateway to be either an IPv4 or IPv6 address, remove
rt_uses_gateway from rtable and replace with rt_gw_family. If
rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
to rt_gw4 to represent the IPv4 version.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
bdf0046771 net: Replace nhc_has_gw with nhc_gw_family
Allow the gateway in a fib_nh_common to be from a different address
family than the outer fib{6}_nh. To that end, replace nhc_has_gw with
nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family.
Now nhc_family is used to know if the nh_common is part of a fib_nh
or fib6_nh (used for container_of to get to route family specific data),
and nhc_gw_family represents the address family for the gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
71df5777aa ipv6: Add neighbor helpers that use the ipv6 stub
Add ipv6 helpers to handle ndisc references via the stub. Update
bpf_ipv6_fib_lookup to use __ipv6_neigh_lookup_noref_stub instead of
the open code ___neigh_lookup_noref with the stub.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
1aefd3de7b ipv6: Add fib6_nh_init and release to stubs
Add fib6_nh_init and fib6_nh_release to ipv6_stubs. If fib6_nh_init fails,
callers should not invoke fib6_nh_release, so there is no reason to have
a dummy stub for the IPv6 is not enabled case.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
Florian Westphal
3b0a081db1 netfilter: make two functions static
They have no external callers anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:28:33 +02:00
Fernando Fernandez Mancera
22c7652cda netfilter: nft_osf: Add version option support
Add version option support to the nftables "osf" expression.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:27:12 +02:00
Florian Westphal
63ce3940f3 netfilter: nft_redir: add inet support
allows to redirect both ipv4 and ipv6 with a single rule in an
inet nat table.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:03:01 +02:00
Florian Westphal
071657d2c3 netfilter: nft_masq: add inet support
This allows use of a single masquerade rule in nat inet family
to handle both ipv4 and ipv6.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:02:58 +02:00
Florian Westphal
4806e97572 netfilter: replace NF_NAT_NEEDED with IS_ENABLED(CONFIG_NF_NAT)
NF_NAT_NEEDED is true whenever nat support for either ipv4 or ipv6 is
enabled.  Now that the af-specific nat configuration switches have been
removed, IS_ENABLED(CONFIG_NF_NAT) has the same effect.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:02:52 +02:00
Florian Westphal
c1deb065cf netfilter: nf_tables: merge route type into core
very little code, so it really doesn't make sense to have extra
modules or even a kconfig knob for this.

Merge them and make functionality available unconditionally.
The merge makes inet family route support trivial, so add it
as well here.

Before:
   text	   data	    bss	    dec	    hex	filename
    835	    832	      0	   1667	    683 nft_chain_route_ipv4.ko
    870	    832	      0	   1702	    6a6	nft_chain_route_ipv6.ko
 111568	   2556	    529	 114653	  1bfdd	nf_tables.ko

After:
   text	   data	    bss	    dec	    hex	filename
 113133	   2556	    529	 116218	  1c5fa	nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:01:42 +02:00
Florian Westphal
d164385ec5 netfilter: nat: add inet family nat support
We need minimal support from the nat core for this, as we do not
want to register additional base hooks.

When an inet hook is registered, interally register ipv4 and ipv6
hooks for them and unregister those when inet hooks are removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:01:39 +02:00
Jacky Hu
84c0d5e96f ipvs: allow tunneling with gue encapsulation
ipip packets are blocked in some public cloud environments, this patch
allows gue encapsulation with the tunneling method, which would make
tunneling working in those environments.

Signed-off-by: Jacky Hu <hengqing.hu@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:57:59 +02:00
Pablo Neira Ayuso
227e1e4d0d netfilter: nf_flowtable: skip device lookup from interface index
Use the output device from the route that we cache in the flowtable
entry.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:57:02 +02:00
YueHaibing
b3dfee340a netfilter: nft_redir: Make nft_redir_dump static
Fix sparse warning:

net/netfilter/nft_redir.c:85:5:
 warning: symbol 'nft_redir_dump' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:57:02 +02:00
Colin Ian King
f7e840ee4d netfilter: nf_tables: remove unused parameter ctx
Function nf_tables_set_desc_parse parameter ctx is not being used
so remove it as it is redundant.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:56:52 +02:00
Li RongQing
3b15d09f7e time: Introduce jiffies64_to_msecs()
there is a similar helper in net/netfilter/nf_tables_api.c,
this maybe become a common request someday, so move it to
time.c

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:56:14 +02:00
Paolo Abeni
fd69c399c7 datagram: remove rendundant 'peeked' argument
After commit a297569fe0 ("net/udp: do not touch skb->peeked unless
really needed") the 'peeked' argument of __skb_try_recv_datagram()
and friends is always equal to !!'flags & MSG_PEEK'.

Since such argument is really a boolean info, and the callers have
already 'flags & MSG_PEEK' handy, we can remove it and clean-up the
code a bit.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 09:51:54 -07:00
Sunil Dutt
cb74e97758 cfg80211/nl80211: Offload OWE processing to user space in AP mode
This interface allows the host driver to offload OWE processing
to user space. This intends to support OWE (Opportunistic Wireless
Encryption) AKM by the drivers that implement SME but rely on the
user space for the cryptographic/OWE processing in AP mode. Such
drivers are not capable of processing/deriving the DH IE.

A new NL80211 command - NL80211_CMD_UPDATE_OWE_INFO is introduced
to send the request/event between the host driver and user space.

Driver shall provide the OWE info (MAC address and DH IE) of
the peer to user space for cryptographic processing of the DH IE
through the event. Accordingly, the user space shall update the
OWE info/DH IE to the driver.

Following is the sequence in AP mode for OWE authentication.

Driver passes the OWE info obtained from the peer in the
Association Request to the user space through the event
cfg80211_update_owe_info_event. User space shall process the
OWE info received and generate new OWE info. This OWE info is
passed to the driver through NL80211_CMD_UPDATE_OWE_INFO
request. Driver eventually uses this OWE info to send the
Association Response to the peer.

This OWE info in the command interface carries the IEs that include
PMKID of the peer if the PMKSA is still valid or an updated DH IE
for generating a new PMKSA with the peer.

Signed-off-by: Liangwei Dong <liangwei@codeaurora.org>
Signed-off-by: Sunil Dutt <usdutt@codeaurora.org>
Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
[remove policy initialization - no longer exists]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 14:32:23 +02:00
Narayanraddi Masti
ab60633c71 mac80211: Add support for NL80211_STA_INFO_AIRTIME_LINK_METRIC
Add support for mesh airtime link metric attribute
NL80211_STA_INFO_AIRTIME_LINK_METRIC.

Signed-off-by: Narayanraddi Masti <team.nmasti@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:58:10 +02:00
vamsi krishna
1e1b11b6a1 nl80211/cfg80211: Specify band specific min RSSI thresholds with sched scan
This commit adds the support to specify the RSSI thresholds per
band for each match set. This enhances the current behavior which
specifies a single rssi_threshold across all the bands by
introducing the rssi_threshold_per_band. These per band rssi
thresholds are referred through NL80211_BAND_* (enum nl80211_band)
variables  as attribute types. Such attributes/values per each
band are nested through NL80211_ATTR_SCHED_SCAN_MIN_RSSI.
These band specific rssi thresholds shall take precedence over
the current rssi_thold per match set.
Drivers indicate this support through
%NL80211_EXT_FEATURE_SCHED_SCAN_BAND_SPECIFIC_RSSI_THOLD.
These per band rssi attributes/values does not specify
"default RSSI filter" as done by
NL80211_SCHED_SCAN_MATCH_ATTR_RSSI to stay backward compatible.
That said, these per band rssi values have to be specified for
the corresponding matchset.

Signed-off-by: vamsi krishna <vamsin@codeaurora.org>
Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
[rebase on refactoring, add policy]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:44:40 +02:00
Johannes Berg
d39f3b4f33 nl80211: reindent some sched scan code
The sched scan code here is really deep - avoid one level
of indentation by short-circuiting the loop instead of
putting everything into the if block.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:40:47 +02:00
Felix Fietkau
2b4a669802 mac80211: make ieee80211_schedule_txq schedule empty TXQs
Currently there is no way for the driver to signal to mac80211 that it should
schedule a TXQ even if there are no packets on the mac80211 part of that queue.
This is problematic if the driver has an internal retry queue to deal with
software A-MPDU retry.

This patch changes the behavior of ieee80211_schedule_txq to always schedule
the queue, as its only user (ath9k) seems to expect such behavior already:
it calls this function on tx status and on powersave wakeup whenever its
internal retry queue is not empty.

Also add an extra argument to ieee80211_return_txq to get the same behavior.

This fixes an issue on ath9k where tx queues with packets to retry (and no
new packets in mac80211) would not get serviced.

Fixes: 89cea7493a ("ath9k: Switch to mac80211 TXQ scheduling and airtime APIs")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:31:31 +02:00
Florian Westphal
c9500d7b7d xfrm: store xfrm_mode directly, not its address
This structure is now only 4 bytes, so its more efficient
to cache a copy rather than its address.

No significant size difference in allmodconfig vmlinux.

With non-modular kernel that has all XFRM options enabled, this
series reduces vmlinux image size by ~11kb. All xfrm_mode
indirections are gone and all modes are built-in.

before (ipsec-next master):
    text      data      bss         dec   filename
21071494   7233140 11104324    39408958   vmlinux.master

after this series:
21066448   7226772 11104324    39397544   vmlinux.patched

With allmodconfig kernel, the size increase is only 362 bytes,
even all the xfrm config options removed in this series are
modular.

before:
    text      data     bss      dec   filename
15731286   6936912 4046908 26715106   vmlinux.master

after this series:
15731492   6937068  4046908  26715468 vmlinux

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:28 +02:00
Florian Westphal
4c145dce26 xfrm: make xfrm modes builtin
after previous changes, xfrm_mode contains no function pointers anymore
and all modules defining such struct contain no code except an init/exit
functions to register the xfrm_mode struct with the xfrm core.

Just place the xfrm modes core and remove the modules,
the run-time xfrm_mode register/unregister functionality is removed.

Before:

    text    data     bss      dec filename
    7523     200    2364    10087 net/xfrm/xfrm_input.o
   40003     628     440    41071 net/xfrm/xfrm_state.o
15730338 6937080 4046908 26714326 vmlinux

    7389     200    2364    9953  net/xfrm/xfrm_input.o
   40574     656     440   41670  net/xfrm/xfrm_state.o
15730084 6937068 4046908 26714060 vmlinux

The xfrm*_mode_{transport,tunnel,beet} modules are gone.

v2: replace CONFIG_INET6_XFRM_MODE_* IS_ENABLED guards with CONFIG_IPV6
    ones rather than removing them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:17 +02:00
Florian Westphal
733a5fac2f xfrm: remove afinfo pointer from xfrm_mode
Adds an EXPORT_SYMBOL for afinfo_get_rcu, as it will now be called from
ipv6 in case of CONFIG_IPV6=m.

This change has virtually no effect on vmlinux size, but it reduces
afinfo size and allows followup patch to make xfrm modes const.

v2: mark if (afinfo) tests as likely (Sabrina)
    re-fetch afinfo according to inner_mode in xfrm_prepare_input().

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:09 +02:00
Florian Westphal
1de7083006 xfrm: remove output2 indirection from xfrm_mode
similar to previous patch: no external module dependencies,
so we can avoid the indirection by placing this in the core.

This change removes the last indirection from xfrm_mode and the
xfrm4|6_mode_{beet,tunnel}.c modules contain (almost) no code anymore.

Before:
   text    data     bss     dec     hex filename
   3957     136       0    4093     ffd net/xfrm/xfrm_output.o
    587      44       0     631     277 net/ipv4/xfrm4_mode_beet.o
    649      32       0     681     2a9 net/ipv4/xfrm4_mode_tunnel.o
    625      44       0     669     29d net/ipv6/xfrm6_mode_beet.o
    599      32       0     631     277 net/ipv6/xfrm6_mode_tunnel.o
After:
   text    data     bss     dec     hex filename
   5359     184       0    5543    15a7 net/xfrm/xfrm_output.o
    171      24       0     195      c3 net/ipv4/xfrm4_mode_beet.o
    171      24       0     195      c3 net/ipv4/xfrm4_mode_tunnel.o
    172      24       0     196      c4 net/ipv6/xfrm6_mode_beet.o
    172      24       0     196      c4 net/ipv6/xfrm6_mode_tunnel.o

v2: fold the *encap_add functions into xfrm*_prepare_output
    preserve (move) output2 comment (Sabrina)
    use x->outer_mode->encap, not inner
    fix a build breakage on ppc (kbuild robot)

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:02 +02:00
Florian Westphal
b3284df1c8 xfrm: remove input2 indirection from xfrm_mode
No external dependencies on any module, place this in the core.
Increase is about 1800 byte for xfrm_input.o.

The beet helpers get added to internal header, as they can be reused
from xfrm_output.c in the next patch (kernel contains several
copies of them in the xfrm{4,6}_mode_beet.c files).

Before:
   text    data     bss     dec filename
   5578     176    2364    8118 net/xfrm/xfrm_input.o
   1180      64       0    1244 net/ipv4/xfrm4_mode_beet.o
    171      40       0     211 net/ipv4/xfrm4_mode_transport.o
   1163      40       0    1203 net/ipv4/xfrm4_mode_tunnel.o
   1083      52       0    1135 net/ipv6/xfrm6_mode_beet.o
    172      40       0     212 net/ipv6/xfrm6_mode_ro.o
    172      40       0     212 net/ipv6/xfrm6_mode_transport.o
   1056      40       0    1096 net/ipv6/xfrm6_mode_tunnel.o

After:
   text    data     bss     dec filename
   7373     200    2364    9937 net/xfrm/xfrm_input.o
    587      44       0     631 net/ipv4/xfrm4_mode_beet.o
    171      32       0     203 net/ipv4/xfrm4_mode_transport.o
    649      32       0     681 net/ipv4/xfrm4_mode_tunnel.o
    625      44       0     669 net/ipv6/xfrm6_mode_beet.o
    172      32       0     204 net/ipv6/xfrm6_mode_ro.o
    172      32       0     204 net/ipv6/xfrm6_mode_transport.o
    599      32       0     631 net/ipv6/xfrm6_mode_tunnel.o

v2: pass inner_mode to xfrm_inner_mode_encap_remove to fix
    AF_UNSPEC selector breakage (bisected by Benedict Wong)

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:55 +02:00
Florian Westphal
7613b92b1a xfrm: remove gso_segment indirection from xfrm_mode
These functions are small and we only have versions for tunnel
and transport mode for ipv4 and ipv6 respectively.

Just place the 'transport or tunnel' conditional in the protocol
specific function instead of using an indirection.

Before:
    3226       12       0     3238   net/ipv4/esp4_offload.o
    7004      492       0     7496   net/ipv4/ip_vti.o
    3339       12       0     3351   net/ipv6/esp6_offload.o
   11294      460       0    11754   net/ipv6/ip6_vti.o
    1180       72       0     1252   net/ipv4/xfrm4_mode_beet.o
     428       48       0      476   net/ipv4/xfrm4_mode_transport.o
    1271       48       0     1319   net/ipv4/xfrm4_mode_tunnel.o
    1083       60       0     1143   net/ipv6/xfrm6_mode_beet.o
     172       48       0      220   net/ipv6/xfrm6_mode_ro.o
     429       48       0      477   net/ipv6/xfrm6_mode_transport.o
    1164       48       0     1212   net/ipv6/xfrm6_mode_tunnel.o
15730428  6937008 4046908 26714344   vmlinux

After:
    3461       12       0     3473   net/ipv4/esp4_offload.o
    7000      492       0     7492   net/ipv4/ip_vti.o
    3574       12       0     3586   net/ipv6/esp6_offload.o
   11295      460       0    11755   net/ipv6/ip6_vti.o
    1180       64       0     1244   net/ipv4/xfrm4_mode_beet.o
     171       40       0      211   net/ipv4/xfrm4_mode_transport.o
    1163       40       0     1203   net/ipv4/xfrm4_mode_tunnel.o
    1083       52       0     1135   net/ipv6/xfrm6_mode_beet.o
     172       40       0      212   net/ipv6/xfrm6_mode_ro.o
     172       40       0      212   net/ipv6/xfrm6_mode_transport.o
    1056       40       0     1096   net/ipv6/xfrm6_mode_tunnel.o
15730424  6937008 4046908 26714340   vmlinux

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:47 +02:00
Florian Westphal
303c5fab12 xfrm: remove xmit indirection from xfrm_mode
There are only two versions (tunnel and transport). The ip/ipv6 versions
are only differ in sizeof(iphdr) vs ipv6hdr.

Place this in the core and use x->outer_mode->encap type to call the
correct adjustment helper.

Before:
   text   data    bss     dec      filename
15730311  6937008 4046908 26714227 vmlinux

After:
15730428  6937008 4046908 26714344 vmlinux

(about 117 byte increase)

v2: use family from x->outer_mode, not inner

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:34 +02:00
Florian Westphal
0c620e97b3 xfrm: remove output indirection from xfrm_mode
Same is input indirection.  Only exception: we need to export
xfrm_outer_mode_output for pktgen.

Increases size of vmlinux by about 163 byte:
Before:
   text    data     bss     dec      filename
15730208  6936948 4046908 26714064   vmlinux

After:
15730311  6937008 4046908 26714227   vmlinux

xfrm_inner_extract_output has no more external callers, make it static.

v2: add IS_ENABLED(IPV6) guard in xfrm6_prepare_output
    add two missing breaks in xfrm_outer_mode_output (Sabrina Dubroca)
    add WARN_ON_ONCE for 'call AF_INET6 related output function, but
    CONFIG_IPV6=n' case.
    make xfrm_inner_extract_output static

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:28 +02:00
Florian Westphal
c2d305e510 xfrm: remove input indirection from xfrm_mode
No need for any indirection or abstraction here, both functions
are pretty much the same and quite small, they also have no external
dependencies.

xfrm_prepare_input can then be made static.

With allmodconfig build, size increase of vmlinux is 25 byte:

Before:
   text   data     bss     dec      filename
15730207  6936924 4046908 26714039  vmlinux

After:
15730208  6936948 4046908 26714064 vmlinux

v2: Fix INET_XFRM_MODE_TRANSPORT name in is-enabled test (Sabrina Dubroca)
    change copied comment to refer to transport and network header,
    not skb->{h,nh}, which don't exist anymore. (Sabrina)
    make xfrm_prepare_input static (Eyal Birger)

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:21 +02:00
Florian Westphal
b45714b164 xfrm: prefer family stored in xfrm_mode struct
Now that we have the family available directly in the
xfrm_mode struct, we can use that and avoid one extra dereference.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:12 +02:00
Florian Westphal
b262a69582 xfrm: place af number into xfrm_mode struct
This will be useful to know if we're supposed to decode ipv4 or ipv6.

While at it, make the unregister function return void, all module_exit
functions did just BUG(); there is never a point in doing error checks
if there is no way to handle such error.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:13:46 +02:00
Vlad Buslov
1f17f7742e net: sched: flower: insert filter to ht before offloading it to hw
John reports:

Recent refactoring of fl_change aims to use the classifier spinlock to
avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
function was moved to before the lock is taken. This can create problems
for drivers if duplicate filters are created (commmon in ovs tc offload
due to filters being triggered by user-space matches).

Drivers registered for such filters will now receive multiple copies of
the same rule, each with a different cookie value. This means that the
drivers would need to do a full match field lookup to determine
duplicates, repeating work that will happen in flower __fl_lookup().
Currently, drivers do not expect to receive duplicate filters.

To fix this, verify that filter with same key is not present in flower
classifier hash table and insert the new filter to the flower hash table
before offloading it to hardware. Implement helper function
fl_ht_insert_unique() to atomically verify/insert a filter.

This change makes filter visible to fast path at the beginning of
fl_change() function, which means it can no longer be freed directly in
case of error. Refactor fl_change() error handling code to deallocate the
filter with rcu timeout.

Fixes: 620da48608 ("net: sched: flower: refactor fl_change")
Reported-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07 19:33:07 -07:00
NeilBrown
8f0db01800 rhashtable: use bit_spin_locks to protect hash bucket.
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.

The benefits of a bit spin_lock are:
 - no need to allocate a separate array of locks.
 - no need to have a configuration option to guide the
   choice of the size of this array
 - locking cost is often a single test-and-set in a cache line
   that will have to be loaded anyway.  When inserting at, or removing
   from, the head of the chain, the unlock is free - writing the new
   address in the bucket head implicitly clears the lock bit.
   For __rhashtable_insert_fast() we ensure this always happens
   when adding a new key.
 - even when lockings costs 2 updates (lock and unlock), they are
   in a cacheline that needs to be read anyway.

The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.

Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved.  This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.

As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.

To enhance type checking, a new struct is introduced to represent the
  pointer plus lock-bit
that is stored in the bucket-table.  This is "struct rhash_lock_head"
and is empty.  A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".

Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head.  As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-07 19:12:12 -07:00
Aaron Kramer
5150b45fd3 net: hsr: Fix node prune function for forget time expiry
HSR should forget nodes after configured node forget time expiry based
on HSR_NODE_FORGET_TIME. As part of hsr_prune_nodes(), code checks to
see if entries are to be flushed out if not heard for longer than forget
time. But currently hsr_prune_nodes() is called only once during device
creation. Restart the timer at the end of hsr_prune_nodes() so that
hsr_prune_nodes() gets called periodically and forgotten entries are
removed from node table.

Signed-off-by: Aaron Kramer <a-kramer@ti.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
fc4ecaeebd net: hsr: add debugfs support for display node list
This adds a debugfs interface to allow display the nodes learned
by the hsr master.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
0e7623bdf3 net: hsr: convert to SPDX identifier
Use SPDX-License-Identifier instead of a verbose license text.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
9f73c2bb46 net: hsr: add blank line after function declaration
Add a blank line after function declaration as suggested by
checkpatch.pl -f

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
b1b4aa9140 net: hsr: remove camel case usage in the code
Current driver code uses camel case in many places. This is
seen when ran checkpatch.pl -f on files under net/hsr. This
patch fixes the code to remove camel case usage.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
d131fcc690 net: hsr: add missing space around operator in code
This patch add missing space around operator in code. This is
seen when ran checkpatch.pl -f on files under net/hsr.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
0594778300 net: hsr: fix placement of logical operator in a multi-line statement
In a multi-line statement exceeding 80 characters, logical operator
should be at the end of a line instead of being at the start. This
is seen when ran checkpatch.pl -f on files under net/hsr. The change
is per suggestion from checkpatch.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
5fa9677803 net: hsr: remove unnecessary space after a cast
This patch removes unnecessary space after a cast. This is seen
when ran checkpatch.pl -f on files under net/hsr.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
05ca6e644d net: hsr: fix NULL checks in the code
This patch replaces all instance of NULL checks such as
    if (foo == NULL) with if (!foo)
Also
    if (foo != NULL) with if (foo)

This is seen when ran checkpatch.pl -f on files under net/hsr
and suggestion is to replace as above.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
0525fc069f net: hsr: fix lines that ends with a '('
This patch fixes function calls that ends with '(' in a line.
This is seen when ran checkpatch.pl -f option on files under
net/hsr.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
4fe25bd8c3 net: hsr: fix alignment issues in the code for functions
This patch fixes alignment issues in code for functions. This is
seen when ran checkpatch.pl -f option on files under net/hsr.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:21 -07:00
Murali Karicheri
5670342ced net: hsr: remove unnecessary paranthesis from the code
This patch fixes unnecessary paranthesis from the code. This is
seen when ran checkpatch.pl -f option on files under net/hsr.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:20 -07:00
Murali Karicheri
d4730775ed net: hsr: fix multiple blank lines in the code
This patch fixes multiple blank lines in the code. This is seen
when ran checkpatch.pl -f option for files under net/hsr

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:20 -07:00
Murali Karicheri
d595b85a6c net: hsr: fix lines exceeding 80 characters
This patch fixes lines exceeding 80 characters. This is seen
when ran checkpatch.pl with -f option for files under
net/hsr.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:32:20 -07:00
Colin Ian King
d1edc08555 tcp: remove redundant check on tskb
The non-null check on tskb is always false because it is in an else
path of a check on tskb and hence tskb is null in this code block.
This is check is therefore redundant and can be removed as well
as the label coalesc.

if (tsbk) {
        ...
} else {
        ...
        if (unlikely(!skb)) {
                if (tskb)       /* can never be true, redundant code */
                        goto coalesc;
                return;
        }
}

Addresses-Coverity: ("Logically dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 18:18:14 -07:00
Dan Carpenter
d7ee81ad09 NFC: nci: Add some bounds checking in nci_hci_cmd_received()
This is similar to commit 674d9de02a ("NFC: Fix possible memory
corruption when handling SHDLC I-Frame commands").

I'm not totally sure, but I think that commit description may have
overstated the danger.  I was under the impression that this data came
from the firmware?  If you can't trust your networking firmware, then
you're already in trouble.

Anyway, these days we add bounds checking where ever we can and we call
it kernel hardening.  Better safe than sorry.

Fixes: 11f54f2286 ("NFC: nci: Add HCI over NCI protocol support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 15:05:07 -07:00
NeilBrown
d58431eacb sunrpc: don't mark uninitialised items as VALID.
A recent commit added a call to cache_fresh_locked()
when an expired item was found.
The call sets the CACHE_VALID flag, so it is important
that the item actually is valid.
There are two ways it could be valid:
1/ If ->update has been called to fill in relevant content
2/ if CACHE_NEGATIVE is set, to say that content doesn't exist.

An expired item that is waiting for an update will be neither.
Setting CACHE_VALID will mean that a subsequent call to cache_put()
will be likely to dereference uninitialised pointers.

So we must make sure the item is valid, and we already have code to do
that in try_to_negate_entry().  This takes the hash lock and so cannot
be used directly, so take out the two lines that we need and use them.

Now cache_fresh_locked() is certain to be called only on
a valid item.

Cc: stable@kernel.org # 2.6.35
Fixes: 4ecd55ea07 ("sunrpc: fix cache_head leak due to queued request")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-04-05 19:57:24 -04:00
David S. Miller
f83f715195 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor comment merge conflict in mlx5.

Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05 14:14:19 -07:00
Linus Torvalds
0548740e53 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Several hash table refcount fixes in batman-adv, from Sven
    Eckelmann.

 2) Use after free in bpf_evict_inode(), from Daniel Borkmann.

 3) Fix mdio bus registration in ixgbe, from Ivan Vecera.

 4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni.

 5) ila rhashtable corruption fix from Herbert Xu.

 6) Don't allow upper-devices to be added to vrf devices, from Sabrina
    Dubroca.

 7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork.

 8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype,
    from Alexander Lobakin.

 9) Missing IDR checks in mlx5 driver, from Aditya Pakki.

10) Fix false connection termination in ktls, from Jakub Kicinski.

11) Work around some ASPM issues with r8169 by disabling rx interrupt
    coalescing on certain chips. From Heiner Kallweit.

12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo
    Abeni.

13) Fully initialize sockaddr_in structures in SCTP, from Xin Long.

14) Various BPF flow dissector fixes from Stanislav Fomichev.

15) Divide by zero in act_sample, from Davide Caratti.

16) Fix bridging multicast regression introduced by rhashtable
    conversion, from Nikolay Aleksandrov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
  ibmvnic: Fix completion structure initialization
  ipv6: sit: reset ip header pointer in ipip6_rcv
  net: bridge: always clear mcast matching struct on reports and leaves
  libcxgb: fix incorrect ppmax calculation
  vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x
  sch_cake: Make sure we can write the IP header before changing DSCP bits
  sch_cake: Use tc_skb_protocol() helper for getting packet protocol
  tcp: Ensure DCTCP reacts to losses
  net/sched: act_sample: fix divide by zero in the traffic path
  net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop
  net: hns: Fix sparse: some warnings in HNS drivers
  net: hns: Fix WARNING when remove HNS driver with SMMU enabled
  net: hns: fix ICMP6 neighbor solicitation messages discard problem
  net: hns: Fix probabilistic memory overwrite when HNS driver initialized
  net: hns: Use NAPI_POLL_WEIGHT for hns driver
  net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()
  flow_dissector: rst'ify documentation
  ipv6: Fix dangling pointer when ipv6 fragment
  net-gro: Fix GRO flush when receiving a GSO packet.
  flow_dissector: document BPF flow dissector environment
  ...
2019-04-04 18:07:12 -10:00
Lorenzo Bianconi
bb9bd814eb ipv6: sit: reset ip header pointer in ipip6_rcv
ipip6 tunnels run iptunnel_pull_header on received skbs. This can
determine the following use-after-free accessing iph pointer since
the packet will be 'uncloned' running pskb_expand_head if it is a
cloned gso skb (e.g if the packet has been sent though a veth device)

[  706.369655] BUG: KASAN: use-after-free in ipip6_rcv+0x1678/0x16e0 [sit]
[  706.449056] Read of size 1 at addr ffffe01b6bd855f5 by task ksoftirqd/1/=
[  706.669494] Hardware name: HPE ProLiant m400 Server/ProLiant m400 Server, BIOS U02 08/19/2016
[  706.771839] Call trace:
[  706.801159]  dump_backtrace+0x0/0x2f8
[  706.845079]  show_stack+0x24/0x30
[  706.884833]  dump_stack+0xe0/0x11c
[  706.925629]  print_address_description+0x68/0x260
[  706.982070]  kasan_report+0x178/0x340
[  707.025995]  __asan_report_load1_noabort+0x30/0x40
[  707.083481]  ipip6_rcv+0x1678/0x16e0 [sit]
[  707.132623]  tunnel64_rcv+0xd4/0x200 [tunnel4]
[  707.185940]  ip_local_deliver_finish+0x3b8/0x988
[  707.241338]  ip_local_deliver+0x144/0x470
[  707.289436]  ip_rcv_finish+0x43c/0x14b0
[  707.335447]  ip_rcv+0x628/0x1138
[  707.374151]  __netif_receive_skb_core+0x1670/0x2600
[  707.432680]  __netif_receive_skb+0x28/0x190
[  707.482859]  process_backlog+0x1d0/0x610
[  707.529913]  net_rx_action+0x37c/0xf68
[  707.574882]  __do_softirq+0x288/0x1018
[  707.619852]  run_ksoftirqd+0x70/0xa8
[  707.662734]  smpboot_thread_fn+0x3a4/0x9e8
[  707.711875]  kthread+0x2c8/0x350
[  707.750583]  ret_from_fork+0x10/0x18

[  707.811302] Allocated by task 16982:
[  707.854182]  kasan_kmalloc.part.1+0x40/0x108
[  707.905405]  kasan_kmalloc+0xb4/0xc8
[  707.948291]  kasan_slab_alloc+0x14/0x20
[  707.994309]  __kmalloc_node_track_caller+0x158/0x5e0
[  708.053902]  __kmalloc_reserve.isra.8+0x54/0xe0
[  708.108280]  __alloc_skb+0xd8/0x400
[  708.150139]  sk_stream_alloc_skb+0xa4/0x638
[  708.200346]  tcp_sendmsg_locked+0x818/0x2b90
[  708.251581]  tcp_sendmsg+0x40/0x60
[  708.292376]  inet_sendmsg+0xf0/0x520
[  708.335259]  sock_sendmsg+0xac/0xf8
[  708.377096]  sock_write_iter+0x1c0/0x2c0
[  708.424154]  new_sync_write+0x358/0x4a8
[  708.470162]  __vfs_write+0xc4/0xf8
[  708.510950]  vfs_write+0x12c/0x3d0
[  708.551739]  ksys_write+0xcc/0x178
[  708.592533]  __arm64_sys_write+0x70/0xa0
[  708.639593]  el0_svc_handler+0x13c/0x298
[  708.686646]  el0_svc+0x8/0xc

[  708.739019] Freed by task 17:
[  708.774597]  __kasan_slab_free+0x114/0x228
[  708.823736]  kasan_slab_free+0x10/0x18
[  708.868703]  kfree+0x100/0x3d8
[  708.905320]  skb_free_head+0x7c/0x98
[  708.948204]  skb_release_data+0x320/0x490
[  708.996301]  pskb_expand_head+0x60c/0x970
[  709.044399]  __iptunnel_pull_header+0x3b8/0x5d0
[  709.098770]  ipip6_rcv+0x41c/0x16e0 [sit]
[  709.146873]  tunnel64_rcv+0xd4/0x200 [tunnel4]
[  709.200195]  ip_local_deliver_finish+0x3b8/0x988
[  709.255596]  ip_local_deliver+0x144/0x470
[  709.303692]  ip_rcv_finish+0x43c/0x14b0
[  709.349705]  ip_rcv+0x628/0x1138
[  709.388413]  __netif_receive_skb_core+0x1670/0x2600
[  709.446943]  __netif_receive_skb+0x28/0x190
[  709.497120]  process_backlog+0x1d0/0x610
[  709.544169]  net_rx_action+0x37c/0xf68
[  709.589131]  __do_softirq+0x288/0x1018

[  709.651938] The buggy address belongs to the object at ffffe01b6bd85580
                which belongs to the cache kmalloc-1024 of size 1024
[  709.804356] The buggy address is located 117 bytes inside of
                1024-byte region [ffffe01b6bd85580, ffffe01b6bd85980)
[  709.946340] The buggy address belongs to the page:
[  710.003824] page:ffff7ff806daf600 count:1 mapcount:0 mapping:ffffe01c4001f600 index:0x0
[  710.099914] flags: 0xfffff8000000100(slab)
[  710.149059] raw: 0fffff8000000100 dead000000000100 dead000000000200 ffffe01c4001f600
[  710.242011] raw: 0000000000000000 0000000000380038 00000001ffffffff 0000000000000000
[  710.334966] page dumped because: kasan: bad access detected

Fix it resetting iph pointer after iptunnel_pull_header

Fixes: a09a4c8dd1 ("tunnels: Remove encapsulation offloads on decap")
Tested-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 18:34:23 -07:00
Tuong Lien
58ee86b8c7 tipc: adapt link failover for new Gap-ACK algorithm
In commit 0ae955e2656d ("tipc: improve TIPC throughput by Gap ACK
blocks"), we enhance the link transmq by releasing as many packets as
possible with the multi-ACKs from peer node. This also means the queue
is now non-linear and the peer link deferdq becomes vital.

Whereas, in the case of link failover, all messages in the link transmq
need to be transmitted as tunnel messages in such a way that message
sequentiality and cardinality per sender is preserved. This requires us
to maintain the link deferdq somehow, so that when the tunnel messages
arrive, the inner user messages along with the ones in the deferdq will
be delivered to upper layer correctly.

The commit accomplishes this by defining a new queue in the TIPC link
structure to hold the old link deferdq when link failover happens and
process it upon receipt of tunnel messages.

Also, in the case of link syncing, the link deferdq will not be purged
to avoid unnecessary retransmissions that in the worst case will fail
because the packets might have been freed on the sending side.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 18:29:25 -07:00
Tuong Lien
382f598fb6 tipc: reduce duplicate packets for unicast traffic
For unicast transmission, the current NACK sending althorithm is over-
active that forces the sending side to retransmit a packet that is not
really lost but just arrived at the receiving side with some delay, or
even retransmit same packets that have already been retransmitted
before. As a result, many duplicates are observed also under normal
condition, ie. without packet loss.

One example case is: node1 transmits 1 2 3 4 10 5 6 7 8 9, when node2
receives packet #10, it puts into the deferdq. When the packet #5 comes
it sends NACK with gap [6 - 9]. However, shortly after that, when
packet #6 arrives, it pulls out packet #10 from the deferfq, but it is
still out of order, so it makes another NACK with gap [7 - 9] and so on
... Finally, node1 has to retransmit the packets 5 6 7 8 9 a number of
times, but in fact all the packets are not lost at all, so duplicates!

This commit reduces duplicates by changing the condition to send NACK,
also restricting the retransmissions on individual packets via a timer
of about 1ms. However, it also needs to say that too tricky condition
for NACKs or too long timeout value for retransmissions will result in
performance reducing! The criterias in this commit are found to be
effective for both the requirements to reduce duplicates but not affect
performance.

The tipc_link_rcv() is also improved to only dequeue skb from the link
deferdq if it is expected (ie. its seqno <= rcv_nxt).

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 18:29:25 -07:00
Tuong Lien
9195948fbf tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!

This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:

      link deferdq: [1 2 3 4      10 11      13 14 15       20]
--> Gap ACK blocks:       <4, 5>,   <11, 1>,      <15, 4>, <20, 0>

The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.

The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.

So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.

Also, the patch is backward compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 18:29:25 -07:00
Nikolay Aleksandrov
e177163d36 net: bridge: mcast: remove unused br_ip_equal function
Since the mcast conversion to rhashtable this function has been unused, so
remove it.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:53:56 -07:00
Nikolay Aleksandrov
1515a63fc4 net: bridge: always clear mcast matching struct on reports and leaves
We need to be careful and always zero the whole br_ip struct when it is
used for matching since the rhashtable change. This patch fixes all the
places which didn't properly clear it which in turn might've caused
mismatches.

Thanks for the great bug report with reproducing steps and bisection.

Steps to reproduce (from the bug report):
ip link add br0 type bridge mcast_querier 1
ip link set br0 up

ip link add v2 type veth peer name v3
ip link set v2 master br0
ip link set v2 up
ip link set v3 up
ip addr add 3.0.0.2/24 dev v3

ip netns add test
ip link add v1 type veth peer name v1 netns test
ip link set v1 master br0
ip link set v1 up
ip -n test link set v1 up
ip -n test addr add 3.0.0.1/24 dev v1

# Multicast receiver
ip netns exec test socat
UDP4-RECVFROM:5588,ip-add-membership=224.224.224.224:3.0.0.1,fork -

# Multicast sender
echo hello | nc -u -s 3.0.0.2 224.224.224.224 5588

Reported-by: liam.mcbirnie@boeing.com
Fixes: 19e3a9c90c ("net: bridge: convert multicast to generic rhashtable")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:52:40 -07:00
Tilmans, Olivier (Nokia - BE/Antwerp)
f6fee16dbb tcp: Accept ECT on SYN in the presence of RFC8311
Linux currently disable ECN for incoming connections when the SYN
requests ECN and the IP header has ECT(0)/ECT(1) set, as some
networks were reportedly mangling the ToS byte, hence could later
trigger false congestion notifications.

RFC8311 §4.3 relaxes RFC3168's requirements such that ECT can be set
one TCP control packets (including SYNs). The main benefit of this
is the decreased probability of losing a SYN in a congested
ECN-capable network (i.e., it avoids the initial 1s timeout).
Additionally, this allows the development of newer TCP extensions,
such as AccECN.

This patch relaxes the previous check, by enabling ECN on incoming
connections using SYN+ECT if at least one bit of the reserved flags
of the TCP header is set. Such bit would indicate that the sender of
the SYN is using a newer TCP feature than what the host implements,
such as AccECN, and is thus implementing RFC8311. This enables
end-hosts not supporting such extensions to still negociate ECN, and
to have some of the benefits of using ECN on control packets.

Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Suggested-by: Bob Briscoe <research@bobbriscoe.net>
Cc: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:43:48 -07:00
Jiri Pirko
119c0b5721 net: devlink: add warning for ndo_get_port_parent_id set when not needed
Currently if the driver registers devlink port instance, he should set
the devlink port attributes as well. Then the devlink core is able to
obtain switch id itself, no need for driver to implement the ndo.
Once all drivers will implement devlink port registration, this ndo
should be removed. This warning guides new drivers to do things as
they should be done.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:42:36 -07:00
Jiri Pirko
15b04aceeb dsa: pass switch ID through devlink_port_attrs_set()
Pass the switch ID down the to devlink through devlink_port_attrs_set()
so it can be used by devlink_compat_switch_id_get(). Leave
ndo_get_port_parent_id implementation only for legacy.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:42:36 -07:00
Jiri Pirko
7e1146e8c1 net: devlink: introduce devlink_compat_switch_id_get() helper
Introduce devlink_compat_switch_id_get() helper which fills up switch_id
according to passed netdev pointer. Call it directly from
dev_get_port_parent_id() as a fallback when ndo_get_port_parent_id
is not defined for given netdev.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:42:36 -07:00
Jiri Pirko
bec5267cde net: devlink: extend port attrs for switch ID
Extend devlink_port_attrs_set() to pass switch ID for ports which are
part of switch and store it in port attrs. For other ports, this is
NULL.

Note that this allows the driver to group devlink ports into one or more
switches according to the actual topology.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:42:36 -07:00
Nikolay Aleksandrov
8dc350202d net: bridge: optimize backup_port fdb convergence
We can optimize the fdb convergence when a backup_port is present by not
immediately flushing the entries of the stopped port since traffic for
those entries will flow towards the backup_port.

There are 2 cases specifically that benefit most:
- when the stopped port comes up before the entries expire by themselves
- when there's an external entry refresh and they're kept while the
  backup_port is operating (e.g. mlag)

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:39:47 -07:00
Yuval Shaia
a0640e610f net: Remove inclusion of pci.h
This header is not in use - remove it.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:38:20 -07:00
Hoang Le
e1279ff7ae tipc: add NULL pointer check
skb somehow dequeued out of inputq before processing, it causes to
NULL pointer and kernel crashed.

Add checking skb valid before using.

Fixes: c55c8edafa ("tipc: smooth change between replicast and broadcast")
Reported-by: Tuong Lien Tong <tuong.t.lien@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:34:11 -07:00
John Hurley
95e27a4da6 net: sched: ensure tc flower reoffload takes filter ref
Recent changes to TC flower remove the requirement for rtnl lock when
accessing and modifying filters. Refcounts now ensure access and deletion
do not happen concurrently. However, the reoffload function which cycles
through all filters and replays them to registered hw drivers is not
protected.

Use the fl_get_next_filter() function to cycle the filters for reoffload
and ensure the ref taken by this function is put when done with each
filter.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:19:43 -07:00
Chris Leech
0a89eb92d8 vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x
Way back in 3c9c36bced the
ndo_fcoe_get_wwn pointer was switched from depending on CONFIG_FCOE to
CONFIG_LIBFCOE in order to allow building FCoE support into the bnx2x
driver and used by bnx2fc without including the generic software fcoe
module.

But, FCoE is generally used over an 802.1q VLAN, and the implementation
of ndo_fcoe_get_wwn in the 8021q module was not similarly changed.  The
result is that if CONFIG_FCOE is disabled, then bnz2fc cannot make a
call to ndo_fcoe_get_wwn through the 8021q interface to the underlying
bnx2x interface.  The bnx2fc driver then falls back to a potentially
different mapping of Ethernet MAC to Fibre Channel WWN, creating an
incompatibility with the fabric and target configurations when compared
to the WWNs used by pre-boot firmware and differently-configured
kernels.

So make the conditional inclusion of FCoE code in 8021q match the
conditional inclusion in netdevice.h

Signed-off-by: Chris Leech <cleech@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:18:34 -07:00
David S. Miller
5ba5780117 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-04-04

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Batch of fixes to the existing BPF flow dissector API to support
   calling BPF programs from the eth_get_headlen context (support for
   latter is planned to be added in bpf-next), from Stanislav.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 13:30:55 -07:00
Toke Høiland-Jørgensen
c87b4ecdbe sch_cake: Make sure we can write the IP header before changing DSCP bits
There is not actually any guarantee that the IP headers are valid before we
access the DSCP bits of the packets. Fix this using the same approach taken
in sch_dsmark.

Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 10:55:59 -07:00
Toke Høiland-Jørgensen
b2100cc56f sch_cake: Use tc_skb_protocol() helper for getting packet protocol
We shouldn't be using skb->protocol directly as that will miss cases with
hardware-accelerated VLAN tags. Use the helper instead to get the right
protocol number.

Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 10:55:59 -07:00
Koen De Schepper
aecfde2310 tcp: Ensure DCTCP reacts to losses
RFC8257 §3.5 explicitly states that "A DCTCP sender MUST react to
loss episodes in the same way as conventional TCP".

Currently, Linux DCTCP performs no cwnd reduction when losses
are encountered. Optionally, the dctcp_clamp_alpha_on_loss resets
alpha to its maximal value if a RTO happens. This behavior
is sub-optimal for at least two reasons: i) it ignores losses
triggering fast retransmissions; and ii) it causes unnecessary large
cwnd reduction in the future if the loss was isolated as it resets
the historical term of DCTCP's alpha EWMA to its maximal value (i.e.,
denoting a total congestion). The second reason has an especially
noticeable effect when using DCTCP in high BDP environments, where
alpha normally stays at low values.

This patch replace the clamping of alpha by setting ssthresh to
half of cwnd for both fast retransmissions and RTOs, at most once
per RTT. Consequently, the dctcp_clamp_alpha_on_loss module parameter
has been removed.

The table below shows experimental results where we measured the
drop probability of a PIE AQM (not applying ECN marks) at a
bottleneck in the presence of a single TCP flow with either the
alpha-clamping option enabled or the cwnd halving proposed by this
patch. Results using reno or cubic are given for comparison.

                          |  Link   |   RTT    |    Drop
                 TCP CC   |  speed  | base+AQM | probability
        ==================|=========|==========|============
                    CUBIC |  40Mbps |  7+20ms  |    0.21%
                     RENO |         |          |    0.19%
        DCTCP-CLAMP-ALPHA |         |          |   25.80%
         DCTCP-HALVE-CWND |         |          |    0.22%
        ------------------|---------|----------|------------
                    CUBIC | 100Mbps |  7+20ms  |    0.03%
                     RENO |         |          |    0.02%
        DCTCP-CLAMP-ALPHA |         |          |   23.30%
         DCTCP-HALVE-CWND |         |          |    0.04%
        ------------------|---------|----------|------------
                    CUBIC | 800Mbps |   1+1ms  |    0.04%
                     RENO |         |          |    0.05%
        DCTCP-CLAMP-ALPHA |         |          |   18.70%
         DCTCP-HALVE-CWND |         |          |    0.06%

We see that, without halving its cwnd for all source of losses,
DCTCP drives the AQM to large drop probabilities in order to keep
the queue length under control (i.e., it repeatedly faces RTOs).
Instead, if DCTCP reacts to all source of losses, it can then be
controlled by the AQM using similar drop levels than cubic or reno.

Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Cc: Bob Briscoe <research@bobbriscoe.net>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <borkmann@iogearbox.net>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 10:51:16 -07:00
Pablo Neira Ayuso
847d44efad net: bridge: update multicast stats from maybe_deliver()
Simplify this code by updating bridge multicast stats from
maybe_deliver().

Note that commit 6db6f0eae6 ("bridge: multicast to unicast"), in case
the port flag BR_MULTICAST_TO_UNICAST is set, never updates the previous
port pointer, therefore it is always going to be different from the
existing port in this deduplicated list iteration.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 10:49:27 -07:00
Pablo Neira Ayuso
942f146a63 net: use kfree_skb_list() from ip_do_fragment()
Just like 46cfd725c3 ("net: use kfree_skb_list() helper in more places").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 10:48:26 -07:00
Davide Caratti
fae2708174 net/sched: act_sample: fix divide by zero in the traffic path
the control path of 'sample' action does not validate the value of 'rate'
provided by the user, but then it uses it as divisor in the traffic path.
Validate it in tcf_sample_init(), and return -EINVAL with a proper extack
message in case that value is zero, to fix a splat with the script below:

 # tc f a dev test0 egress matchall action sample rate 0 group 1 index 2
 # tc -s a s action sample
 total acts 1

         action order 0: sample rate 1/0 group 1 pipe
          index 2 ref 1 bind 1 installed 19 sec used 19 sec
         Action statistics:
         Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
         backlog 0b 0p requeues 0
 # ping 192.0.2.1 -I test0 -c1 -q

 divide error: 0000 [#1] SMP PTI
 CPU: 1 PID: 6192 Comm: ping Not tainted 5.1.0-rc2.diag2+ #591
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:tcf_sample_act+0x9e/0x1e0 [act_sample]
 Code: 6a f1 85 c0 74 0d 80 3d 83 1a 00 00 00 0f 84 9c 00 00 00 4d 85 e4 0f 84 85 00 00 00 e8 9b d7 9c f1 44 8b 8b e0 00 00 00 31 d2 <41> f7 f1 85 d2 75 70 f6 85 83 00 00 00 10 48 8b 45 10 8b 88 08 01
 RSP: 0018:ffffae320190ba30 EFLAGS: 00010246
 RAX: 00000000b0677d21 RBX: ffff8af1ed9ec000 RCX: 0000000059a9fe49
 RDX: 0000000000000000 RSI: 000000000c7e33b7 RDI: ffff8af23daa0af0
 RBP: ffff8af1ee11b200 R08: 0000000074fcaf7e R09: 0000000000000000
 R10: 0000000000000050 R11: ffffffffb3088680 R12: ffff8af232307f80
 R13: 0000000000000003 R14: ffff8af1ed9ec000 R15: 0000000000000000
 FS:  00007fe9c6d2f740(0000) GS:ffff8af23da80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fff6772f000 CR3: 00000000746a2004 CR4: 00000000001606e0
 Call Trace:
  tcf_action_exec+0x7c/0x1c0
  tcf_classify+0x57/0x160
  __dev_queue_xmit+0x3dc/0xd10
  ip_finish_output2+0x257/0x6d0
  ip_output+0x75/0x280
  ip_send_skb+0x15/0x40
  raw_sendmsg+0xae3/0x1410
  sock_sendmsg+0x36/0x40
  __sys_sendto+0x10e/0x140
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x60/0x210
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [...]
  Kernel panic - not syncing: Fatal exception in interrupt

Add a TDC selftest to document that 'rate' is now being validated.

Reported-by: Matteo Croce <mcroce@redhat.com>
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Yotam Gigi <yotam.gi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 10:46:33 -07:00
David Ahern
c0a720770c ipv6: Flip to fib_nexthop_info
Export fib_nexthop_info and fib_add_nexthop for use by IPv6 code.
Remove rt6_nexthop_info and rt6_add_nexthop in favor of the IPv4
versions. Update fib_nexthop_info for IPv6 linkdown check and
RTA_GATEWAY for AF_INET6.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:50:20 -07:00
David Ahern
c236419981 ipv4: Change fib_nexthop_info and fib_add_nexthop to take fib_nh_common
With the exception of the nexthop weight, the nexthop attributes used by
fib_nexthop_info and fib_add_nexthop come from the fib_nh_common struct.
Update both to use it and change fib_nexthop_info to check the family
as needed.

nexthop weight comes from the common struct for existing use cases, but
for nexthop groups the weight is outside of the fib_nh_common to allow
the same nexthop definition to be used in multiple groups with different
weights.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:50:20 -07:00
David Ahern
b0f6019363 ipv4: Refactor nexthop attributes in fib_dump_info
Similar to ipv6, move addition of nexthop attributes to dump
message into helpers that are called for both single path and
multipath routes. Align the new helpers to the IPv6 variant
which most notably means computing the flags argument based on
settings in nh_flags.

The RTA_FLOW argument is unique to IPv4, so it is appended after
the new fib_nexthop_info helper. The intent of a later patch is to
make both fib_nexthop_info and fib_add_nexthop usable for both IPv4
and IPv6. This patch is stepping stone in that direction.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:50:20 -07:00
David Ahern
eba618abac ipv4: Add fib_nh_common to fib_result
Most of the ipv4 code only needs data from fib_nh_common. Add
fib_nh_common selection to fib_result and update users to use it.

Right now, fib_nh_common in fib_result will point to a fib_nh struct
that is embedded within a fib_info:

        fib_info  --> fib_nh
                      fib_nh
                      ...
                      fib_nh
                        ^
    fib_result->nhc ----+

Later, nhc can point to a fib_nh within a nexthop struct:

        fib_info --> nexthop --> fib_nh
                                   ^
    fib_result->nhc ---------------+

or for a nexthop group:

        fib_info --> nexthop --> nexthop --> fib_nh
                                 nexthop --> fib_nh
                                 ...
                                 nexthop --> fib_nh
                                               ^
    fib_result->nhc ---------------------------+

In all cases nhsel within fib_result will point to which leg in the
multipath route is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:50:20 -07:00
David Ahern
0af7e7c128 ipv4: Update fib_table_lookup tracepoint to take common nexthop
Update fib_table_lookup tracepoint to take a fib_nh_common struct and
dump the v6 gateway address if the nexthop uses it.

Over the years saddr has not proven useful and the output of the
tracepoint produces very long lines. Since saddr is not part of
fib_nh_common, drop it. If it needs to be added later, fib_nh which
contains saddr can be obtained from a fib_nh_common via container_of.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:50:20 -07:00
Junwei Hu
ef0efcd3bd ipv6: Fix dangling pointer when ipv6 fragment
At the beginning of ip6_fragment func, the prevhdr pointer is
obtained in the ip6_find_1stfragopt func.
However, all the pointers pointing into skb header may change
when calling skb_checksum_help func with
skb->ip_summed = CHECKSUM_PARTIAL condition.
The prevhdr pointe will be dangling if it is not reloaded after
calling __skb_linearize func in skb_checksum_help func.

Here, I add a variable, nexthdr_offset, to evaluate the offset,
which does not changes even after calling __skb_linearize func.

Fixes: 405c92f7a5 ("ipv6: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wenhao Zhang <zhangwenhao8@huawei.com>
Reported-by: syzbot+e8ce541d095e486074fc@syzkaller.appspotmail.com
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:42:20 -07:00
Steffen Klassert
0ab03f353d net-gro: Fix GRO flush when receiving a GSO packet.
Currently we may merge incorrectly a received GSO packet
or a packet with frag_list into a packet sitting in the
gro_hash list. skb_segment() may crash case because
the assumptions on the skb layout are not met.
The correct behaviour would be to flush the packet in the
gro_hash list and send the received GSO packet directly
afterwards. Commit d61d072e87 ("net-gro: avoid reorders")
sets NAPI_GRO_CB(skb)->flush in this case, but this is not
checked before merging. This patch makes sure to check this
flag and to not merge in that case.

Fixes: d61d072e87 ("net-gro: avoid reorders")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:40:52 -07:00
Gustavo A. R. Silva
936ee65ffc rxrpc: Mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch fixes the following warning:

net/rxrpc/local_object.c: In function ‘rxrpc_open_socket’:
net/rxrpc/local_object.c:175:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (ret < 0) {
      ^
net/rxrpc/local_object.c:184:2: note: here
  case AF_INET:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

Currently, GCC is expecting to find the fall-through annotations
at the very bottom of the case and on its own line. That's why
I had to add the annotation, although the intentional fall-through
is already mentioned in a few lines above.

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:37:58 -07:00
Stanislav Fomichev
2ee7fba0d6 flow_dissector: allow access only to a subset of __sk_buff fields
Use whitelist instead of a blacklist and allow only a small set of
fields that might be relevant in the context of flow dissector:
  * data
  * data_end
  * flow_keys

This is required for the eth_get_headlen case where we have only a
chunk of data to dissect (i.e. trying to read the other skb fields
doesn't make sense).

Note, that it is a breaking API change! However, we've provided
flow_keys->n_proto as a substitute for skb->protocol; and there is
no need to manually handle skb->vlan_present. So even if we
break somebody, the migration is trivial. Unfortunately, we can't
support eth_get_headlen use-case without those breaking changes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-03 16:49:48 +02:00
Stanislav Fomichev
b9e9c8599f flow_dissector: fix clamping of BPF flow_keys for non-zero nhoff
Don't allow BPF program to set flow_keys->nhoff to less than initial
value. We currently don't read the value afterwards in anything but
the tests, but it's still a good practice to return consistent
values to the test programs.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-03 16:49:48 +02:00
Stanislav Fomichev
822fe61795 net/flow_dissector: pass flow_keys->n_proto to BPF programs
This is a preparation for the next commit that would prohibit access to
the most fields of __sk_buff from the BPF programs.

Instead of requiring BPF flow dissector programs to look into skb,
pass all input data in the flow_keys.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-03 16:49:48 +02:00
Vlad Buslov
3eed52842b net: sched: don't set tunnel for decap action
Action tunnel_key doesn't have a metadata/tunnel for release(decap) action.
Drivers do not dereference entry->tunnel pointer for that action type, so
this behavior doesn't result in a crash at the moment. However, this needs
to be corrected as a preparation for updating hardware offloads API to not
rely on rtnl lock, for which flow_action code will copy the tunnel data to
temporary buffer to prevent concurrent action overwrite from
invalidating/freeing it.

Fixes: 3a7b68617d ("cls_api: add translator to flow_action representation")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-02 13:20:30 -07:00
Sheena Mira-ato
b2e54b09a3 ip6_tunnel: Match to ARPHRD_TUNNEL6 for dev type
The device type for ip6 tunnels is set to
ARPHRD_TUNNEL6. However, the ip4ip6_err function
is expecting the device type of the tunnel to be
ARPHRD_TUNNEL.  Since the device types do not
match, the function exits and the ICMP error
packet is not sent to the originating host. Note
that the device type for IPv4 tunnels is set to
ARPHRD_TUNNEL.

Fix is to expect a tunnel device type of
ARPHRD_TUNNEL6 instead.  Now the tunnel device
type matches and the ICMP error packet is sent
to the originating host.

Signed-off-by: Sheena Mira-ato <sheena.mira-ato@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-02 13:19:34 -07:00
Dan Carpenter
6d670497e0 openvswitch: use after free in __ovs_ct_free_action()
We free "ct_info->ct" and then use it on the next line when we pass it
to nf_ct_destroy_timeout().  This patch swaps the order to avoid the use
after free.

Fixes: 06bd2bdf19 ("openvswitch: Add timeout support to ct action")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-02 10:56:21 -07:00
Steffen Klassert
8742dc86d0 xfrm4: Fix uninitialized memory read in _decode_session4
We currently don't reload pointers pointing into skb header
after doing pskb_may_pull() in _decode_session4(). So in case
pskb_may_pull() changed the pointers, we read from random
memory. Fix this by putting all the needed infos on the
stack, so that we don't need to access the header pointers
after doing pskb_may_pull().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-02 08:18:39 +02:00
Florian Westphal
97cdcf37b5 net: place xmit recursion in softnet data
This fills a hole in softnet data, so no change in structure size.

Also prepares for xmit_more placement in the same spot;
skb->xmit_more will be removed in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 18:35:02 -07:00
YueHaibing
1d3ff0950e dccp: Fix memleak in __feat_register_sp
If dccp_feat_push_change fails, we forget free the mem
which is alloced by kmemdup in dccp_feat_clone_sp_val.

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: e8ef967a54 ("dccp: Registration routines for changing feature values")
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 18:15:10 -07:00
Xin Long
5869b8fada net: use rcu_dereference_protected to fetch sk_dst_cache in sk_destruct
As Eric noticed, in .sk_destruct, sk->sk_dst_cache update is prevented, and
no barrier is needed for this. So change to use rcu_dereference_protected()
instead of rcu_dereference_check() to fetch sk_dst_cache in there.

v1->v2:
  - no change, repost after net-next is open.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 18:10:51 -07:00
Xin Long
09279e615c sctp: initialize _pad of sockaddr_in before copying to user memory
Syzbot report a kernel-infoleak:

  BUG: KMSAN: kernel-infoleak in _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
  Call Trace:
    _copy_to_user+0x16b/0x1f0 lib/usercopy.c:32
    copy_to_user include/linux/uaccess.h:174 [inline]
    sctp_getsockopt_peer_addrs net/sctp/socket.c:5911 [inline]
    sctp_getsockopt+0x1668e/0x17f70 net/sctp/socket.c:7562
    ...
  Uninit was stored to memory at:
    sctp_transport_init net/sctp/transport.c:61 [inline]
    sctp_transport_new+0x16d/0x9a0 net/sctp/transport.c:115
    sctp_assoc_add_peer+0x532/0x1f70 net/sctp/associola.c:637
    sctp_process_param net/sctp/sm_make_chunk.c:2548 [inline]
    sctp_process_init+0x1a1b/0x3ed0 net/sctp/sm_make_chunk.c:2361
    ...
  Bytes 8-15 of 16 are uninitialized

It was caused by that th _pad field (the 8-15 bytes) of a v4 addr (saved in
struct sockaddr_in) wasn't initialized, but directly copied to user memory
in sctp_getsockopt_peer_addrs().

So fix it by calling memset(addr->v4.sin_zero, 0, 8) to initialize _pad of
sockaddr_in before copying it to user memory in sctp_v4_addr_to_user(), as
sctp_v6_addr_to_user() does.

Reported-by: syzbot+86b5c7c236a22616a72f@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 18:08:19 -07:00
Jiri Slaby
3c446e6f96 kcm: switch order of device registration to fix a crash
When kcm is loaded while many processes try to create a KCM socket, a
crash occurs:
 BUG: unable to handle kernel NULL pointer dereference at 000000000000000e
 IP: mutex_lock+0x27/0x40 kernel/locking/mutex.c:240
 PGD 8000000016ef2067 P4D 8000000016ef2067 PUD 3d6e9067 PMD 0
 Oops: 0002 [#1] SMP KASAN PTI
 CPU: 0 PID: 7005 Comm: syz-executor.5 Not tainted 4.12.14-396-default #1 SLE15-SP1 (unreleased)
 RIP: 0010:mutex_lock+0x27/0x40 kernel/locking/mutex.c:240
 RSP: 0018:ffff88000d487a00 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 000000000000000e RCX: 1ffff100082b0719
 ...
 CR2: 000000000000000e CR3: 000000004b1bc003 CR4: 0000000000060ef0
 Call Trace:
  kcm_create+0x600/0xbf0 [kcm]
  __sock_create+0x324/0x750 net/socket.c:1272
 ...

This is due to race between sock_create and unfinished
register_pernet_device. kcm_create tries to do "net_generic(net,
kcm_net_id)". but kcm_net_id is not initialized yet.

So switch the order of the two to close the race.

This can be reproduced with mutiple processes doing socket(PF_KCM, ...)
and one process doing module removal.

Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:59:20 -07:00
Xiaofei Shen
a2c7023f70 net: dsa: read mac address from DT for slave device
Before creating a slave netdevice, get the mac address from DTS and
apply in case it is valid.

Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:57:32 -07:00
Paolo Abeni
e5f0e8f8e4 net: sched: introduce and use qdisc tree flush/purge helpers
The same code to flush qdisc tree and purge the qdisc queue
is duplicated in many places and in most cases it does not
respect NOLOCK qdisc: the global backlog len is used and the
per CPU values are ignored.

This change addresses the above, factoring-out the relevant
code and using the helpers introduced by the previous patch
to fetch the correct backlog len.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:50:13 -07:00
Paolo Abeni
5dd431b6b9 net: sched: introduce and use qstats read helpers
Classful qdiscs can't access directly the child qdiscs backlog
length: if such qdisc is NOLOCK, per CPU values should be
accounted instead.

Most qdiscs no not respect the above. As a result, qstats fetching
for most classful qdisc is currently incorrect: if the child qdisc is
NOLOCK, it always reports 0 len backlog.

This change introduces a pair of helpers to safely fetch
both backlog and qlen and use them in stats class dumping
functions, fixing the above issue and cleaning a bit the code.

DRR needs also to access the child qdisc queue length, so it
needs custom handling.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:50:13 -07:00
Nicolas Dichtel
0db6f8befc net/sched: fix ->get helper of the matchall cls
It returned always NULL, thus it was never possible to get the filter.

Example:
$ ip link add foo type dummy
$ ip link add bar type dummy
$ tc qdisc add dev foo clsact
$ tc filter add dev foo protocol all pref 1 ingress handle 1234 \
	matchall action mirred ingress mirror dev bar

Before the patch:
$ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
Error: Specified filter handle not found.
We have an error talking to the kernel

After:
$ tc filter get dev foo protocol all pref 1 ingress handle 1234 matchall
filter ingress protocol all pref 1 matchall chain 0 handle 0x4d2
  not_in_hw
        action order 1: mirred (Ingress Mirror to device bar) pipe
        index 1 ref 1 bind 1

CC: Yotam Gigi <yotamg@mellanox.com>
CC: Jiri Pirko <jiri@mellanox.com>
Fixes: fd62d9f5c5 ("net/sched: matchall: Fix configuration race")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:13:25 -07:00
Stephen Suryaputra
8c83f2df9c vrf: check accept_source_route on the original netdevice
Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.

v2->v3:
- Simplify by passing the original netdevice down the stack (per David
  Ahern).

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 10:44:58 -07:00
Eric Dumazet
f5d547676c tcp: fix tcp_inet6_sk() for 32bit kernels
It turns out that struct ipv6_pinfo is not located as we think.

inet6_sk_generic() and tcp_inet6_sk() disagree on 32bit kernels by 4-bytes,
because struct tcp_sock has 8-bytes alignment,
but ipv6_pinfo size is not a multiple of 8.

sizeof(struct ipv6_pinfo): 116 (not padded to 8)

I actually first coded tcp_inet6_sk() as this patch does, but thought
that "container_of(tcp_sk(sk), struct tcp6_sock, tcp)" was cleaner.

As Julian told me : Nobody should use tcp6_sock.inet6
directly, it should be accessed via tcp_inet6_sk() or inet6_sk().

This happened when we added the first u64 field in struct tcp_sock.

Fixes: 93a77c11ae ("tcp: add tcp_inet6_sk() helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 10:21:21 -07:00
Dust Li
b506bc975f tcp: fix a potential NULL pointer dereference in tcp_sk_exit
When tcp_sk_init() failed in inet_ctl_sock_create(),
 'net->ipv4.tcp_congestion_control' will be left
 uninitialized, but tcp_sk_exit() hasn't check for
 that.

 This patch add checking on 'net->ipv4.tcp_congestion_control'
 in tcp_sk_exit() to prevent NULL-ptr dereference.

Fixes: 6670e15244 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 10:11:41 -07:00
Xin Long
2ac695d1d6 tipc: handle the err returned from cmd header function
Syzbot found a crash:

  BUG: KMSAN: uninit-value in tipc_nl_compat_name_table_dump+0x54f/0xcd0 net/tipc/netlink_compat.c:872
  Call Trace:
    tipc_nl_compat_name_table_dump+0x54f/0xcd0 net/tipc/netlink_compat.c:872
    __tipc_nl_compat_dumpit+0x59e/0xda0 net/tipc/netlink_compat.c:215
    tipc_nl_compat_dumpit+0x63a/0x820 net/tipc/netlink_compat.c:280
    tipc_nl_compat_handle net/tipc/netlink_compat.c:1226 [inline]
    tipc_nl_compat_recv+0x1b5f/0x2750 net/tipc/netlink_compat.c:1265
    genl_family_rcv_msg net/netlink/genetlink.c:601 [inline]
    genl_rcv_msg+0x185f/0x1a60 net/netlink/genetlink.c:626
    netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
    genl_rcv+0x63/0x80 net/netlink/genetlink.c:637
    netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
    netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336
    netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1917
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg net/socket.c:632 [inline]

  Uninit was created at:
    __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
    alloc_skb include/linux/skbuff.h:1012 [inline]
    netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
    netlink_sendmsg+0xb82/0x1300 net/netlink/af_netlink.c:1892
    sock_sendmsg_nosec net/socket.c:622 [inline]
    sock_sendmsg net/socket.c:632 [inline]

It was supposed to be fixed on commit 974cb0e3e7 ("tipc: fix uninit-value
in tipc_nl_compat_name_table_dump") by checking TLV_GET_DATA_LEN(msg->req)
in cmd->header()/tipc_nl_compat_name_table_dump_header(), which is called
ahead of tipc_nl_compat_name_table_dump().

However, tipc_nl_compat_dumpit() doesn't handle the error returned from cmd
header function. It means even when the check added in that fix fails, it
won't stop calling tipc_nl_compat_name_table_dump(), and the issue will be
triggered again.

So this patch is to add the process for the err returned from cmd header
function in tipc_nl_compat_dumpit().

Reported-by: syzbot+3ce8520484b0d4e260a5@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-31 16:45:57 -07:00
Xin Long
8c63bf9ab4 tipc: check link name with right length in tipc_nl_compat_link_set
A similar issue as fixed by Patch "tipc: check bearer name with right
length in tipc_nl_compat_bearer_enable" was also found by syzbot in
tipc_nl_compat_link_set().

The length to check with should be 'TLV_GET_DATA_LEN(msg->req) -
offsetof(struct tipc_link_config, name)'.

Reported-by: syzbot+de00a87b8644a582ae79@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-31 16:45:57 -07:00
Xin Long
6f07e5f06c tipc: check bearer name with right length in tipc_nl_compat_bearer_enable
Syzbot reported the following crash:

BUG: KMSAN: uninit-value in memchr+0xce/0x110 lib/string.c:961
  memchr+0xce/0x110 lib/string.c:961
  string_is_valid net/tipc/netlink_compat.c:176 [inline]
  tipc_nl_compat_bearer_enable+0x2c4/0x910 net/tipc/netlink_compat.c:401
  __tipc_nl_compat_doit net/tipc/netlink_compat.c:321 [inline]
  tipc_nl_compat_doit+0x3aa/0xaf0 net/tipc/netlink_compat.c:354
  tipc_nl_compat_handle net/tipc/netlink_compat.c:1162 [inline]
  tipc_nl_compat_recv+0x1ae7/0x2750 net/tipc/netlink_compat.c:1265
  genl_family_rcv_msg net/netlink/genetlink.c:601 [inline]
  genl_rcv_msg+0x185f/0x1a60 net/netlink/genetlink.c:626
  netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
  genl_rcv+0x63/0x80 net/netlink/genetlink.c:637
  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
  netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336
  netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:622 [inline]
  sock_sendmsg net/socket.c:632 [inline]

Uninit was created at:
  __alloc_skb+0x309/0xa20 net/core/skbuff.c:208
  alloc_skb include/linux/skbuff.h:1012 [inline]
  netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
  netlink_sendmsg+0xb82/0x1300 net/netlink/af_netlink.c:1892
  sock_sendmsg_nosec net/socket.c:622 [inline]
  sock_sendmsg net/socket.c:632 [inline]

It was triggered when the bearer name size < TIPC_MAX_BEARER_NAME,
it would check with a wrong len/TLV_GET_DATA_LEN(msg->req), which
also includes priority and disc_domain length.

This patch is to fix it by checking it with a right length:
'TLV_GET_DATA_LEN(msg->req) - offsetof(struct tipc_bearer_config, name)'.

Reported-by: syzbot+8b707430713eb46e1e45@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-31 16:45:57 -07:00
Linus Torvalds
7376e39ad9 A patch to avoid choking on multipage bvecs in the messenger and
a small use-after-free fix.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlyeRsgTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi8i6B/9wP90ZLGzdAZDIlfWKXjGB1PUrFdeN
 WCA5p68Hl7yh1RbY6cvbZcTF5Bo3DhjxjxTFjXHPXLxsARlxbCXon9R6Lo2lDgA4
 Bk/W8dcR3onU3nspifG91Him/WnImWB80pyVgZog2PTiwsZJ0rRknXXbRU9ARCpk
 8vjg19O4wHwXgtMXAN3vxjQ7v8T8wk8vDb08efPcmMPLDYMaTUL1z2JoqyRfMTbo
 OpZoXSjHXqVFfz0mJ5EN7+92eK39oDcQIDSuuqePDCI09ZmrcQd/xSvG5tBfPoXr
 1mR3ojkKRURW5RKGClbSoAt90vIuYJH5Cncmemzsr6m4FETH6XthGbJl
 =twzl
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch to avoid choking on multipage bvecs in the messenger and a
  small use-after-free fix"

* tag 'ceph-for-5.1-rc3' of git://github.com/ceph/ceph-client:
  ceph: fix use-after-free on symlink traversal
  libceph: fix breakage caused by multipage bvecs
2019-03-29 14:41:09 -07:00
Julian Wiedmann
35f861e3c5 net: bridge: use netif_is_bridge_port()
Replace the br_port_exists() macro with its twin from netdevice.h

CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 13:48:40 -07:00
Li RongQing
3d8830266f net: ethtool: not call vzalloc for zero sized memory request
NULL or ZERO_SIZE_PTR will be returned for zero sized memory
request, and derefencing them will lead to a segfault

so it is unnecessory to call vzalloc for zero sized memory
request and not call functions which maybe derefence the
NULL allocated memory

this also fixes a possible memory leak if phy_ethtool_get_stats
returns error, memory should be freed before exit

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Wang Li <wangli39@baidu.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 13:41:55 -07:00
Jakub Kicinski
c43ac97bac net: tls: prevent false connection termination with offload
Only decrypt_internal() performs zero copy on rx, all paths
which don't hit decrypt_internal() must set zc to false,
otherwise tls_sw_recvmsg() may return 0 causing the application
to believe that that connection got closed.

Currently this happens with device offload when new record
is first read from.

Fixes: d069b780e3 ("tls: Fix tls_device receive")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 13:38:50 -07:00
wenxu
18b6f71748 openvswitch: Make metadata_dst tunnel work in IP_TUNNEL_INFO_BRIDGE mode
There is currently no support for the multicast/broadcast aspects
of VXLAN in ovs. In the datapath flow the tun_dst must specific.
But in the IP_TUNNEL_INFO_BRIDGE mode the tun_dst can not be specific.
And the packet can forward through the fdb table of vxlan devcice. In
this mode the broadcast/multicast packet can be sent through the
following ways in ovs.

ovs-vsctl add-port br0 vxlan -- set in vxlan type=vxlan \
        options:key=1000 options:remote_ip=flow
ovs-ofctl add-flow br0 in_port=LOCAL,dl_dst=ff:ff:ff:ff:ff:ff, \
        action=output:vxlan

bridge fdb append ff:ff:ff:ff:ff:ff dev vxlan_sys_4789 dst 172.168.0.1 \
        src_vni 1000 vni 1000 self
bridge fdb append ff:ff:ff:ff:ff:ff dev vxlan_sys_4789 dst 172.168.0.2 \
        src_vni 1000 vni 1000 self

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 13:29:15 -07:00
Eric Dumazet
eb70a1ae23 tcp: cleanup sk_tx_skb_cache before reuse
TCP stack relies on the fact that a freshly allocated skb
has skb->cb[] and skb_shinfo(skb)->tx_flags cleared.

When recycling tx skb, we must ensure these fields are cleared.

Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 13:16:44 -07:00
David Ahern
3616d08bcb ipv6: Move ipv6 stubs to a separate header file
The number of stubs is growing and has nothing to do with addrconf.
Move the definition of the stubs to a separate header file and update
users. In the move, drop the vxlan specific comment before ipv6_stub.

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:53:45 -07:00
David Ahern
979e276ebe net: Use common nexthop init and release helpers
With fib_nh_common in place, move common initialization and release
code into helpers used by both ipv4 and ipv6. For the moment, the init
is just the lwt encap and the release is both the netdev reference and
the the lwt state reference. More will be added later.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
f1741730dd net: Add fib_nh_common and update fib_nh and fib6_nh
Add fib_nh_common struct with common nexthop attributes. Convert
fib_nh and fib6_nh to use it. Use macros to move existing
fib_nh_* references to the new nh_common.nhc_*.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
ad1601ae02 ipv6: Rename fib6_nh entries
Rename fib6_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, gateway, flags, and lwtstate are common
with all nexthop definitions. In some places new temporary variables
are declared or local variables renamed to maintain line lengths.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
b75ed8b1aa ipv4: Rename fib_nh entries
Rename fib_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, oif, gateway, flags, scope, lwtstate,
nh_weight and nh_upper_bound are common with all nexthop definitions.
In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really
long lines.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
572bf4dd71 ipv6: Change rt6_add_nexthop and rt6_nexthop_info to take fib6_nh
rt6_add_nexthop and rt6_nexthop_info only need the fib6_info for the
gateway flag and the nexthop weight, and the presence of a gateway is now
per-nexthop. Update the signatures to take a fib6_nh and nexthop weight
and better align with the ipv4 versions.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
6d3d07b45c ipv6: Refactor fib6_ignore_linkdown
fib6_ignore_linkdown takes a fib6_info but only looks at the net_device
and its IPv6 config. Change it to take a net_device over a fib6_info as
its input argument.

In addition, move it to a header file to make the check inline and usable
later with IPv4 code without going through the ipv6 stub, and rename to
ip6_ignore_linkdown since it is only checking the setting based on the
ipv6 struct on a device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
2b2450ca4a ipv6: Move gateway checks to a fib6_nh setting
The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new
fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to
the new flag. For IPv6 address the flag is cheaper than checking that
nh_gw is non-0 like IPv4 does.

While this increases fib6_nh by 8-bytes, the effective allocation size of
a fib6_info is unchanged. The 8 bytes is recovered later with a
fib_nh_common change.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
dac7d0f270 ipv6: Create cleanup helper for fib6_nh
Move the fib6_nh cleanup code to a new helper, fib6_nh_release.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
83c4425159 ipv6: Create init helper for fib6_nh
Similar to IPv4, consolidate the fib6_nh initialization into a helper.
As a new standalone function, add a cleanup path to put lwtstate on
error.

To avoid modifying fib6_config flags, move the reject check to a helper
that is invoked once by fib6_nh_init to reset the device and then
again in ip6_route_info_create to set the fib6_flags.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
faa041a40b ipv4: Create cleanup helper for fib_nh
Move the fib_nh cleanup code from free_fib_info_rcu into a new helper,
fib_nh_release. Move classid accounting into fib_nh_release which is
called per fib_nh to make accounting symmetrical with fib_nh_init.
Export the helper to allow for use with nexthop objects in the
future.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
e4516ef654 ipv4: Create init helper for fib_nh
Consolidate the fib_nh initialization which is duplicated between
fib_create_info for single path and fib_get_nhs for multipath.
Export the helper to allow for use with nexthop objects in the
future.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
331c7a4023 ipv4: Move IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN to helper
in_dev lookup followed by IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN check
is called in several places, some with the rcu lock and others with the
rtnl held.

Move the check to a helper similar to what IPv6 has. Since the helper
can be invoked from either context use rcu_dereference_rtnl to
dereference ip_ptr.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
8373c6c84e ipv4: Define fib_get_nhs when CONFIG_IP_ROUTE_MULTIPATH is disabled
Define fib_get_nhs to return EINVAL when CONFIG_IP_ROUTE_MULTIPATH is
not enabled and remove the ifdef check for CONFIG_IP_ROUTE_MULTIPATH
in fib_create_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
Felix Fietkau
5b989c18da mac80211: rework locking for txq scheduling / airtime fairness
Holding the lock around the entire duration of tx scheduling can create
some nasty lock contention, especially when processing airtime information
from the tx status or the rx path.
Improve locking by only holding the active_txq_lock for lookups / scheduling
list modifications.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:27:44 +01:00
Sunil Dutt
d6db02a88a nl80211: Add NL80211_FLAG_CLEAR_SKB flag for other NL commands
This commit adds NL80211_FLAG_CLEAR_SKB flag to other NL commands
that carry key data to ensure they do not stick around on heap
after the SKB is freed.

Also introduced this flag for NL80211_CMD_VENDOR as there are sub
commands which configure the keys.

Signed-off-by: Sunil Dutt <usdutt@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:23:09 +01:00
YueHaibing
90abf96abd cfg80211: Use kmemdup in cfg80211_gen_new_ie()
Use kmemdup rather than duplicating its implementation

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:22:55 +01:00
Felix Fietkau
4856bfd230 mac80211: do not call driver wake_tx_queue op during reconfig
There are several scenarios in which mac80211 can call drv_wake_tx_queue
after ieee80211_restart_hw has been called and has not yet completed.
Driver private structs are considered uninitialized until mac80211 has
uploaded the vifs, stations and keys again, so using private tx queue
data during that time is not safe.

The driver can also not rely on drv_reconfig_complete to figure out when
it is safe to accept drv_wake_tx_queue calls again, because it is only
called after all tx queues are woken again.

To fix this, bail out early in drv_wake_tx_queue if local->in_reconfig
is set.

Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:22:41 +01:00
Nathan Chancellor
344c9719c5 cfg80211: Change an 'else if' into an 'else' in cfg80211_calculate_bitrate_he
When building with -Wsometimes-uninitialized, Clang warns:

net/wireless/util.c:1223:11: warning: variable 'result' is used
uninitialized whenever 'if' condition is false
[-Wsometimes-uninitialized]

Clang can't evaluate at this point that WARN(1, ...) always returns true
because __ret_warn_on is defined as !!(condition), which isn't
immediately evaluated as 1. Change this branch to else so that it's
clear to Clang that we intend to bail out here.

Link: https://github.com/ClangBuiltLinux/linux/issues/382
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:22:23 +01:00
Felix Fietkau
eb9b64e3a9 mac80211: fix memory accounting with A-MSDU aggregation
skb->truesize can change due to memory reallocation or when adding extra
fragments. Adjust fq->memory_usage accordingly

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:21:25 +01:00
Ilan Peer
08a75a887e cfg80211: Handle WMM rules in regulatory domain intersection
The support added for regulatory WMM rules did not handle
the case of regulatory domain intersections. Fix it.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Fixes: 230ebaa189 ("cfg80211: read wmm rules from regulatory database")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:20:58 +01:00
Andrei Otcheretianski
78be2d21cc mac80211: Increase MAX_MSG_LEN
Looks that 100 chars isn't enough for messages, as we keep getting
warnings popping from different places due to message shortening.
Instead of trying to shorten the prints, just increase the buffer size.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:20:36 +01:00
Felix Fietkau
40586e3fc4 mac80211: fix unaligned access in mesh table hash function
The pointer to the last four bytes of the address is not guaranteed to be
aligned, so we need to use __get_unaligned_cpu32 here

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:18:48 +01:00
Felix Fietkau
b49c15e121 mac80211: un-schedule TXQs on powersave start
Once a station enters powersave, its queues should not be returned by
ieee80211_next_txq() anymore. They will be re-scheduled again after the
station has woken up again

Fixes: 1866760096 ("mac80211: Add TXQ scheduling API")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:14:11 +01:00
Alexander Lobakin
9a5a90d167 net: core: netif_receive_skb_list: unlist skb before passing to pt->func
__netif_receive_skb_list_ptype() leaves skb->next poisoned before passing
it to pt_prev->func handler, what may produce (in certain cases, e.g. DSA
setup) crashes like:

[ 88.606777] CPU 0 Unable to handle kernel paging request at virtual address 0000000e, epc == 80687078, ra == 8052cc7c
[ 88.618666] Oops[#1]:
[ 88.621196] CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc2-dlink-00206-g4192a172-dirty #1473
[ 88.630885] $ 0 : 00000000 10000400 00000002 864d7850
[ 88.636709] $ 4 : 87c0ddf0 864d7800 87c0ddf0 00000000
[ 88.642526] $ 8 : 00000000 49600000 00000001 00000001
[ 88.648342] $12 : 00000000 c288617b dadbee27 25d17c41
[ 88.654159] $16 : 87c0ddf0 85cff080 80790000 fffffffd
[ 88.659975] $20 : 80797b20 ffffffff 00000001 864d7800
[ 88.665793] $24 : 00000000 8011e658
[ 88.671609] $28 : 80790000 87c0dbc0 87cabf00 8052cc7c
[ 88.677427] Hi : 00000003
[ 88.680622] Lo : 7b5b4220
[ 88.683840] epc : 80687078 vlan_dev_hard_start_xmit+0x1c/0x1a0
[ 88.690532] ra : 8052cc7c dev_hard_start_xmit+0xac/0x188
[ 88.696734] Status: 10000404	IEp
[ 88.700422] Cause : 50000008 (ExcCode 02)
[ 88.704874] BadVA : 0000000e
[ 88.708069] PrId : 0001a120 (MIPS interAptiv (multi))
[ 88.713005] Modules linked in:
[ 88.716407] Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000)
[ 88.725219] Stack : 85f61c28 00000000 0000000e 80780000 87c0ddf0 85cff080 80790000 8052cc7c
[ 88.734529] 87cabf00 00000000 00000001 85f5fb40 807b0000 864d7850 87cabf00 807d0000
[ 88.743839] 864d7800 8655f600 00000000 85cff080 87c1c000 0000006a 00000000 8052d96c
[ 88.753149] 807a0000 8057adb8 87c0dcc8 87c0dc50 85cfff08 00000558 87cabf00 85f58c50
[ 88.762460] 00000002 85f58c00 864d7800 80543308 fffffff4 00000001 85f58c00 864d7800
[ 88.771770] ...
[ 88.774483] Call Trace:
[ 88.777199] [<80687078>] vlan_dev_hard_start_xmit+0x1c/0x1a0
[ 88.783504] [<8052cc7c>] dev_hard_start_xmit+0xac/0x188
[ 88.789326] [<8052d96c>] __dev_queue_xmit+0x6e8/0x7d4
[ 88.794955] [<805a8640>] ip_finish_output2+0x238/0x4d0
[ 88.800677] [<805ab6a0>] ip_output+0xc8/0x140
[ 88.805526] [<805a68f4>] ip_forward+0x364/0x560
[ 88.810567] [<805a4ff8>] ip_rcv+0x48/0xe4
[ 88.815030] [<80528d44>] __netif_receive_skb_one_core+0x44/0x58
[ 88.821635] [<8067f220>] dsa_switch_rcv+0x108/0x1ac
[ 88.827067] [<80528f80>] __netif_receive_skb_list_core+0x228/0x26c
[ 88.833951] [<8052ed84>] netif_receive_skb_list+0x1d4/0x394
[ 88.840160] [<80355a88>] lunar_rx_poll+0x38c/0x828
[ 88.845496] [<8052fa78>] net_rx_action+0x14c/0x3cc
[ 88.850835] [<806ad300>] __do_softirq+0x178/0x338
[ 88.856077] [<8012a2d4>] irq_exit+0xbc/0x100
[ 88.860846] [<802f8b70>] plat_irq_dispatch+0xc0/0x144
[ 88.866477] [<80105974>] handle_int+0x14c/0x158
[ 88.871516] [<806acfb0>] r4k_wait+0x30/0x40
[ 88.876462] Code: afb10014 8c8200a0 00803025 <9443000c> 94a20468 00000000 10620042 00a08025 9605046a
[ 88.887332]
[ 88.888982] ---[ end trace eb863d007da11cf1 ]---
[ 88.894122] Kernel panic - not syncing: Fatal exception in interrupt
[ 88.901202] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fix this by pulling skb off the sublist and zeroing skb->next pointer
before calling ptype callback.

Fixes: 88eb1944e1 ("net: core: propagate SKB lists through packet_type lookup")
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexander Lobakin <alobakin@dlink.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 17:28:05 -07:00
Mao Wenan
cb66ddd156 net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().
When it is to cleanup net namespace, rds_tcp_exit_net() will call
rds_tcp_kill_sock(), if t_sock is NULL, it will not call
rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free
connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in
net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect()
and reference 'net' which has already been freed.

In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before
sock->ops->connect, but if connect() is failed, it will call
rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always
failed, rds_connect_worker() will try to reconnect all the time, so
rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the
connections.

Therefore, the condition !tc->t_sock is not needed if it is going to do
cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always
NULL, and there is on other path to cancel cp_conn_w and free
connection. So this patch is to fix this.

rds_tcp_kill_sock():
...
if (net != c_net || !tc->t_sock)
...
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

==================================================================
BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28
net/ipv4/af_inet.c:340
Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721

CPU: 3 PID: 3721 Comm: kworker/u8:4 Not tainted 5.1.0 #11
Hardware name: linux,dummy-virt (DT)
Workqueue: krdsd rds_connect_worker
Call trace:
 dump_backtrace+0x0/0x3c0 arch/arm64/kernel/time.c:53
 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:152
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x120/0x188 lib/dump_stack.c:113
 print_address_description+0x68/0x278 mm/kasan/report.c:253
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x21c/0x348 mm/kasan/report.c:409
 __asan_report_load4_noabort+0x30/0x40 mm/kasan/report.c:429
 inet_create+0xbcc/0xd28 net/ipv4/af_inet.c:340
 __sock_create+0x4f8/0x770 net/socket.c:1276
 sock_create_kern+0x50/0x68 net/socket.c:1322
 rds_tcp_conn_path_connect+0x2b4/0x690 net/rds/tcp_connect.c:114
 rds_connect_worker+0x108/0x1d0 net/rds/threads.c:175
 process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
 worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
 kthread+0x2f0/0x378 kernel/kthread.c:255
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

Allocated by task 687:
 save_stack mm/kasan/kasan.c:448 [inline]
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xd4/0x180 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:490
 slab_post_alloc_hook mm/slab.h:444 [inline]
 slab_alloc_node mm/slub.c:2705 [inline]
 slab_alloc mm/slub.c:2713 [inline]
 kmem_cache_alloc+0x14c/0x388 mm/slub.c:2718
 kmem_cache_zalloc include/linux/slab.h:697 [inline]
 net_alloc net/core/net_namespace.c:384 [inline]
 copy_net_ns+0xc4/0x2d0 net/core/net_namespace.c:424
 create_new_namespaces+0x300/0x658 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xa0/0x198 kernel/nsproxy.c:206
 ksys_unshare+0x340/0x628 kernel/fork.c:2577
 __do_sys_unshare kernel/fork.c:2645 [inline]
 __se_sys_unshare kernel/fork.c:2643 [inline]
 __arm64_sys_unshare+0x38/0x58 kernel/fork.c:2643
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
 el0_svc_common+0x168/0x390 arch/arm64/kernel/syscall.c:83
 el0_svc_handler+0x60/0xd0 arch/arm64/kernel/syscall.c:129
 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:960

Freed by task 264:
 save_stack mm/kasan/kasan.c:448 [inline]
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:521
 kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:528
 slab_free_hook mm/slub.c:1370 [inline]
 slab_free_freelist_hook mm/slub.c:1397 [inline]
 slab_free mm/slub.c:2952 [inline]
 kmem_cache_free+0xb8/0x3a8 mm/slub.c:2968
 net_free net/core/net_namespace.c:400 [inline]
 net_drop_ns.part.6+0x78/0x90 net/core/net_namespace.c:407
 net_drop_ns net/core/net_namespace.c:406 [inline]
 cleanup_net+0x53c/0x6d8 net/core/net_namespace.c:569
 process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
 worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
 kthread+0x2f0/0x378 kernel/kthread.c:255
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

The buggy address belongs to the object at ffff8003496a3f80
 which belongs to the cache net_namespace of size 7872
The buggy address is located 1796 bytes inside of
 7872-byte region [ffff8003496a3f80, ffff8003496a5e40)
The buggy address belongs to the page:
page:ffff7e000d25a800 count:1 mapcount:0 mapping:ffff80036ce4b000
index:0x0 compound_mapcount: 0
flags: 0xffffe0000008100(slab|head)
raw: 0ffffe0000008100 dead000000000100 dead000000000200 ffff80036ce4b000
raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8003496a4580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8003496a4600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8003496a4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff8003496a4700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8003496a4780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 17:17:18 -07:00
Andrea Righi
f28cd2af22 openvswitch: fix flow actions reallocation
The flow action buffer can be resized if it's not big enough to contain
all the requested flow actions. However, this resize doesn't take into
account the new requested size, the buffer is only increased by a factor
of 2x. This might be not enough to contain the new data, causing a
buffer overflow, for example:

[   42.044472] =============================================================================
[   42.045608] BUG kmalloc-96 (Not tainted): Redzone overwritten
[   42.046415] -----------------------------------------------------------------------------

[   42.047715] Disabling lock debugging due to kernel taint
[   42.047716] INFO: 0x8bf2c4a5-0x720c0928. First byte 0x0 instead of 0xcc
[   42.048677] INFO: Slab 0xbc6d2040 objects=29 used=18 fp=0xdc07dec4 flags=0x2808101
[   42.049743] INFO: Object 0xd53a3464 @offset=2528 fp=0xccdcdebb

[   42.050747] Redzone 76f1b237: cc cc cc cc cc cc cc cc                          ........
[   42.051839] Object d53a3464: 6b 6b 6b 6b 6b 6b 6b 6b 0c 00 00 00 6c 00 00 00  kkkkkkkk....l...
[   42.053015] Object f49a30cc: 6c 00 0c 00 00 00 00 00 00 00 00 03 78 a3 15 f6  l...........x...
[   42.054203] Object acfe4220: 20 00 02 00 ff ff ff ff 00 00 00 00 00 00 00 00   ...............
[   42.055370] Object 21024e91: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   42.056541] Object 070e04c3: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   42.057797] Object 948a777a: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   42.059061] Redzone 8bf2c4a5: 00 00 00 00                                      ....
[   42.060189] Padding a681b46e: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

Fix by making sure the new buffer is properly resized to contain all the
requested data.

BugLink: https://bugs.launchpad.net/bugs/1813244
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 17:15:44 -07:00
Eric Dumazet
355b985537 netns: provide pure entropy for net_hash_mix()
net_hash_mix() currently uses kernel address of a struct net,
and is used in many places that could be used to reveal this
address to a patient attacker, thus defeating KASLR, for
the typical case (initial net namespace, &init_net is
not dynamically allocated)

I believe the original implementation tried to avoid spending
too many cycles in this function, but security comes first.

Also provide entropy regardless of CONFIG_NET_NS.

Fixes: 0b4419162a ("netns: introduce the net_hash_mix "salt" for hashes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 17:00:45 -07:00
xiaofeis
6e57d72a84 net: dsa: Implement flow_dissect callback for tag_qca
Add flow_dissect for qca tagged packet to get the right hash.

Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vinod Koul <vkoul@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 16:57:19 -07:00
Yi-Hung Wei
06bd2bdf19 openvswitch: Add timeout support to ct action
Add support for fine-grain timeout support to conntrack action.
The new OVS_CT_ATTR_TIMEOUT attribute of the conntrack action
specifies a timeout to be associated with this connection.
If no timeout is specified, it acts as is, that is the default
timeout for the connection will be automatically applied.

Example usage:
$ nfct timeout add timeout_1 inet tcp syn_sent 100 established 200
$ ovs-ofctl add-flow br0 in_port=1,ip,tcp,action=ct(commit,timeout=timeout_1)

CC: Pravin Shelar <pshelar@ovn.org>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 16:53:29 -07:00
Yi-Hung Wei
717700d183 netfilter: Export nf_ct_{set,destroy}_timeout()
This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 16:53:29 -07:00
Jiri Pirko
746364f298 net: devlink: add warning for ndo_get_phys_port_name set when not needed
Currently if the driver registers devlink port instance, it should set
the devlink port attributes as well. Then the devlink core is able to
obtain physical port name itself, no need for driver to implement
the ndo. Once all drivers will implement devlink port registration,
this ndo should be removed. This warning guides new
drivers to do things as they should be done.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:31 -07:00
Jiri Pirko
d484210bf7 dsa: do not support ndo_get_phys_port_name for non-legacy ports
Since each non-legacy slave has its own devlink port instance
correctly set, rely on devlink core to generate correct phys port name.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:31 -07:00
Jiri Pirko
716efee200 dsa: implement ndo_get_devlink_port
In order for devlink compat functions to work, implement
ndo_get_devlink_port. Legacy slaves does not have devlink port instances
created for themselves.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:31 -07:00
Jiri Pirko
14c03ac4c1 net: devlink: remove unused devlink_port_get_phys_port_name() function
Now it is unused, remove it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
Jiri Pirko
af3836df9a net: devlink: introduce devlink_compat_phys_port_name_get()
Introduce devlink_compat_phys_port_name_get() helper that
gets the physical port name for specified netdevice
according to devlink port attributes.
Call this helper from dev_get_phys_port_name()
in case ndo_get_phys_port_name is not defined.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
David S. Miller
ede1fd1851 This feature/cleanup patchset includes the following patches:
- Drop license boilerplate (obsoleted by SPDX license IDs),
    by Sven Eckelmann
 
  - Drop documentation for sysfs and debugfs Documentation,
    by Sven Eckelmann (2 patches)
 
  - Mark sysfs as optional and deprecated, by Sven Eckelmann (3 patches)
 
  - Update MAINTAINERS Tree, Chat and Bugtracker,
    by Sven Eckelmann (3 patches)
 
  - Rename batadv_dat_send_data, by Sven Eckelmann
 
  - update DAT entries with incoming ARP replies, by Linus Luessing
 
  - add multicast-to-unicast support for limited destinations,
    by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlyc6uEWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSA9D/sEpVY0qOITIwzbttcyeDU5PPSD
 OF4dVCf6Za6CqfnPRCdViKAGtC1FOz+X2BXtedrIxgsjSPFoRvRoi1XBdu4Bobv2
 /4wx56rz3AeMoBZ1UyziUIS6Qam1x7vVYSRXk+QHqBYVc16YiIePpCqTuryrzuk4
 4MMqXz+V0dqm7z7irRDe7W9/CdFRtZEDAS8o6cgw4IlL56Ul3Yz6xP6p3PRA+H6V
 OWtVwmwcbX2KzZnrWDgql5NBhJ1bOfn2oDp1Y4RpLRmBp0iwg1qZdNZK2+MD2TTw
 xxuz5lsZFhTBXNqGgeoGk87m2z0wNkvnj9UnkMPl3gb7j+FyyaAgvVY4M2s2qJv/
 ++wKDPPun/aGDOuo/rJdBTdlnToH17KS3jsDwhj4TooroI8uCCLWZQaYWkgjcugD
 ZKsZlIqFrfH3rPAzOBwRZodoYkOPpz/+xHp3p/cg9ANifwqpxqq3PY35BoP4ZXRi
 xUy79QgNIFxYXwrrqTrt3UrY8AGo1/OOHmA6nFQGZT79S648ZoG5vPDKFKRzTmcj
 Mj2GXuBzMIkWayHgnH69Kv9vVZc7mZPi7lartsVq/aZtMCh3HbPNfKtNOYsu4QEq
 6c2966jvFB+LdTibiJQWbe0s5Z96UaFQUxH5+gGdM5TS5TCIaG3udXoI1ou4YVJI
 q6eOdAgblbD7oaNY4w==
 =WB31
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20190328' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - Drop license boilerplate (obsoleted by SPDX license IDs),
   by Sven Eckelmann

 - Drop documentation for sysfs and debugfs Documentation,
   by Sven Eckelmann (2 patches)

 - Mark sysfs as optional and deprecated, by Sven Eckelmann (3 patches)

 - Update MAINTAINERS Tree, Chat and Bugtracker,
   by Sven Eckelmann (3 patches)

 - Rename batadv_dat_send_data, by Sven Eckelmann

 - update DAT entries with incoming ARP replies, by Linus Luessing

 - add multicast-to-unicast support for limited destinations,
   by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 09:52:42 -07:00
David S. Miller
d3332184f1 Here are some batman-adv bugfixes:
- Fix refcount underflows in bridge loop avoidance code,
    by Sven Eckelmann (3 patches)
 
  - Fix warning when CFG80211 isn't enabled, by Anders Roxell
 
  - Fix genl notification for throughput override, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlyc6H0WHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoQh1D/sHYiFZsMDvwZ6s1ryUSwfj5ThX
 3hgYsWzjN6MW2WMFpM/nQ448bLqvXveYOA0R+IpVWB5LOOmRFs7lgjx6J+Vq8SQk
 dGx+dBrywqTLRzf5al5RPIW2sO4Z6JLXyGU0KioywSd61qj7ii8sRf0fK0tDy/Mr
 EVvTsjs2nywj43b4dy5szJ1hELW5CEUAtNzGDZA0DogH9clhsTi1yG07eC8UrGpj
 n0YrbB7SZFFQkNCpbN0DR7qeKwDEPedRQTlDxUtxXCE0acRyHai1qKFoMzEWMJZ0
 tLqTMYNHRxCYTv3D8qLQVmq2cO6rCUodfjrMKdyiZs6pM5QR39sBhWfzH1WObo90
 gMXM6hAkRGnLgUSh0Wbz1TkRMio8NgPNr9hAdUsuF5yYSni0RxZtGfCeUiby8Znk
 IGLNHdmITgfFcxHuVtFpBspl3Bdsbnk0+PR4RMkiMtoPhz8RG2oltGpvZ9sG8hJ3
 W8ZSI0f2LWueygwKEaiC4Kb7Kz1pgKiB5xPWdBK6+WfujRV+zK3+n+JSygLuJhlf
 tYpRZk0A2JBHPzDC3B865FFo9089SBBN/WekohsVibnWN7MQnolnGhlFyekjic04
 aPMvn+OX7vS9lnSSkBcsen0GFtqZFd7pzt9hWQkyHuJZ8b48KebjuL3Z4w0Uaz6c
 WqUXAKKs2KPFIKEgFg==
 =+6Ww
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20190328' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - Fix refcount underflows in bridge loop avoidance code,
   by Sven Eckelmann (3 patches)

 - Fix warning when CFG80211 isn't enabled, by Anders Roxell

 - Fix genl notification for throughput override, by Sven Eckelmann

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 09:51:03 -07:00
Herbert Xu
b5f9bd15b8 ila: Fix rhashtable walker list corruption
ila_xlat_nl_cmd_flush uses rhashtable walkers allocated from the
stack but it never frees them.  This corrupts the walker list of
the hash table.

This patch fixes it.

Reported-by: syzbot+dae72a112334aa65a159@syzkaller.appspotmail.com
Fixes: b6e71bdebb ("ila: Flush netlink command to clear xlat...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 22:46:16 -07:00